This topic describes the Keyword Extraction component provided by Machine Learning Designer (formerly known as Machine Learning Studio).
Keyword extraction is one of the important technologies in natural language processing. It is used to extract keywords from a document. The keyword extraction algorithm is based on TextRank, a variation of the PageRank algorithm. This keyword extraction algorithm uses the relationship between specific words to construct a network, calculate the importance of each word, and determine words with larger weights as keywords.
The keyword extraction process includes the following steps:
- Raw corpora preparation
- Tokenization
- Word-based filtering
- Keyword extraction
Configure the component
You can use one of the following methods to configure the Keyword Extraction component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Keyword Extraction component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
---|---|---|
Fields Setting | Column of Marked Document IDs | The name of the document ID column. |
Word Splitting Result of Marked Documents | The word splitting results of marked documents. | |
Parameters Setting | Output First N Keywords | The number of top N keywords to be provided. The value must be an integer. Default value: 5. |
Window Size | The window size. The value must be an integer. Default value: 2. | |
Damping Coefficient | The damping coefficient. Default value: 0.85. | |
Maximum Iterations | The maximum number of iterations. Default value: 100. | |
Convergence Coefficient | The convergence coefficient. Default value: 0.000001. | |
Tuning | Cores. Auto-assigned by default. | The number of cores. By default, the system determines the value. |
Memory size per core. Auto-assigned by default. | The memory size of each core. By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name KeywordsExtraction
-DinputTableName=maple_test_keywords_basic_input
-DdocIdCol=docid -DdocContent=word
-DoutputTableName=maple_test_keywords_basic_output
-DtopN=19;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input table. | No default value |
inputTablePartitions | No | The partitions selected from the input table for training, in the format of Partition_name=value. To specify multiple partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,). | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
docIdCol | Yes | The name of the document ID column. You can specify only one column. | No default value |
docContent | Yes | The name of the word column. You can specify only one column. | No default value |
topN | No | The number of top N keywords to be provided. If the value of the parameter is greater than the total number of keywords, all keywords are provided. | 5 |
windowSize | No | The window size of the TextRank algorithm. | 2 |
dumpingFactor | No | The damping coefficient of the TextRank algorithm. | 0.85 |
maxIter | No | The maximum number of iterations of the TextRank algorithm. | 100 |
epsilon | No | The convergence residual threshold of the TextRank algorithm. | 0.000001 |
lifecycle | No | The lifecycle of the output table. | No default value |
coreNum | No | The number of cores. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. | Determined by the system |
Example
- Input dataSeparate words in the input table with spaces, and filter out stop words such as "of" and all punctuation marks.
docid:string word:string doc0 The blended-wing-body aircraft is a new direction for the future development in the aviation field Many research institutions inside and outside China have carried out research on the blended-wing-body aircraft while its fully automated shape optimization algorithm has become a new hot topic Based on the existing research achievements inside and outside China common modeling and flow solver tools have been analyzed and compared The geometric modeling grid flow field solver and shape optimization modules have been designed The pros and cons between different algorithms have been compared to achieve the optimized shape of the blended-wing-body aircraft in the conceptual design stage Geometric modeling and grid generation module are achieved based on the transfinite interpolation algorithm and spline based grid generation method The flow solver module includes the finite difference solver the finite element solver and the panel method solver The finite difference solver includes mathematical modeling of the potential flow the derivation of the Cartesian grid based variable step length difference scheme Cartesian grid generation and indexing algorithm the Cartesian grid based Neumann boundary conditions expression form derivation are achieved based on finite element difference solver The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite difference solver The finite element solver includes potential flow modeling based on the variational principle of the finite element theory the derivation of the two-dimensional finite element Kutta conditional least squares based speed solving algorithm Gmsh based two-dimensional field grid generator of airfoil with wakes design The aerodynamic parameters of a two-dimensional airfoil are calculated based on the finite element solver The panel method solver includes modeling and automatic wake generation the design of the three-dimensional flow solver of the blended-wing-body drag estimation based on the Blasius solution solver implemented in the Fortran language a mixed compilation of Python and Fortran OpenMP and CUDA based acceleration algorithm The aerodynamic parameters of a three-dimensional wing body are calculated based on the panel method solver The shape optimization module includes free form deformation algorithm genetic algorithms differential evolution algorithm Aircraft surface area calculation algorithm is based on the moments integration algorithm The volume of an aircraft calculation algorithm is based on VKT data visualization format tool - PAI command
PAI -name KeywordsExtraction -DinputTableName=maple_test_keywords_basic_input -DdocIdCol=docid -DdocContent=word -DoutputTableName=maple_test_keywords_basic_output -DtopN=19;
- Output description
docid keywords weight doc0 based on 0.041306752223538405 doc0 algorithm 0.03089845626854151 doc0 modeling 0.021782865850562882 doc0 grid 0.020669749212693957 doc0 solver 0.020245609506360847 doc0 aircraft 0.019850761705313365 doc0 research 0.014193732541852615 doc0 finite element 0.013831122054200538 doc0 solving 0.012924593244133104 doc0 module 0.01280216562287212 doc0 derivation 0.011907588923852495 doc0 shape 0.011505456605632607 doc0 difference 0.011477831662367547 doc0 flow 0.010969269350293957 doc0 design 0.010830986516637251 doc0 implementation 0.010747536556701583 doc0 two-dimensional 0.010695570768457084 doc0 development 0.010527342662670088 doc0 new 0.010096978306668461