This topic describes the N-gram Counting component provided by Machine Learning Designer (formerly known as Machine Learning Studio).
N-gram counting is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted. The counting result is the number of N-grams in all documents rather than those in a single document. For more information, see ngram-count.
Configure the component
You can use one of the following methods to configure the N-gram Counting component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the N-gram Counting component on the pipeline page of Machine Learning Designer of Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Column of Sentences in Input Table | The column that contains the sentences in the input table. |
Column of Words in the Bag-of-Words | The column that contains the words in the bag of words. | |
Words Column in Input Counting Result Table | The word column in the input counting result table. | |
Count Column in Input Counting Result Table | The count column in the input counting result table. | |
Sentence Weight Column | The column that contains weights of input sentences. | |
Parameters Setting | Maximum N-gram Length | The maximum length of N-grams. Default value: 3. |
Tuning | Optional. The number of cores. | The number of cores. By default, the system determines the value. |
Optional. Memory size per core. | The memory size of each core. By default, the system determines the value. Unit: MB. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name ngram_count
-project algo_public
-DinputTableName=pai_ngram_input
-DoutputTableName=pai_ngram_output
-DinputSelectedColNames=col0
-DweightColName=weight
-DcoreNum=2
-DmemSizePerCore=1000;
Parameter | Required | Default value | Description |
inputTableName | Yes | No default value | The name of the input table. |
outputTableName | Yes | No default value | The name of the output table. |
inputSelectedColNames | No | Name of the first STRING column | The names of the columns selected from the input table. |
weightColName | No | 1 | The name of the weight column. |
inputTablePartitions | No | All partitions | The partitions selected from the input table. |
countTableName | No | No default value | The N-gram counting output table previously generated. The table is merged into the output result. |
countWordColName | No | Second column | The name of the word column in the counting table. |
countCountColName | No | Third column | The name of the count column in the counting table. |
countTablePartitions | No | No default value | The partitions in the counting table. |
vocabTableName | No | No default value | The name of the bag-of-words table. The words that are not contained in the bag of words are marked as \<unk\. |
vocabSelectedColName | No | First STRING column | The name of the column that contains the words in the bag of words. |
vocabTablePartitions | No | No default value | The partitions in the bag-of-words table. |
order | No | 3 | The maximum length of N-grams. |
lifecycle | No | No default value | The lifecycle of the output table. |
coreNum | No | No default value | The number of cores. |
memSizePerCore | No | No default value | The memory size for each core. Unit: MB. |