This topic describes the Word2Vec component provided by Machine Learning Designer.
The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.
Usage notes
The Word2Vec component must be connected to the Word Frequency Statistics component as a downstream node.
Note The Word Frequency Statistics component generates triple tables that contain words and word statistics. You can connect the Word Frequency Statistics component as an upstream node of the Word2Vec component. Then, the Word2Vec component obtains data generated by the Word Frequency Statistics component, converts the data to single words, and then processes all the data as a document.
Configure the component
You can use one of the following methods to configure the Word2Vec component:
Method 1: Configure the component on the pipeline configuration tab in the console
Configure the component on the pipeline configuration tab of Machine Learning Designer in the Machine Learning Platform for AI (PAI) console. The following table describes the component parameters.
Tab | Parameter | Description |
---|---|---|
Fields Setting | Word Column | The word column used for training. |
Parameters Setting | Word Feature Dimension | The number of dimensions of the word. Valid values: 0 to 1000. Default value: 100. |
Language Model | The language model used for training. Valid values: Skip-gram and Cbow. Default value: Skip-gram. | |
Word Window Size | The window size of words. The value must be a positive integer. Default value: 5. | |
Random Window | Specifies whether to use a random window. By default, Random Window is selected. | |
Minimum Word Truncation Frequency | The minimum frequency of words for truncation. The value must be a positive integer. Default value: 5. | |
Hierarchical Softmax | Specifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected. | |
Negative Sampling | The window size of negative sampling. The default value is 0, which indicates that the negative sampling feature is unavailable. | |
Downsampling Threshold | The threshold for downsampling. The default value is 0, which indicates that the downsampling feature is unavailable. | |
Initial Learning Rate | The initial learning rate. The value is greater than 0. Default value: 0.025. | |
Number of Iterations | The number of iterations. The value is greater than or equal to 1. Default value: 1. | |
Tuning | Number of Cores | The number of cores. By default, the system determines the value. |
Memory Size per Core | The memory size of each core. By default, the system determines the value. |
Method 2: Run PAI commands
Configure the Word2Vec component by using a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script. The following table describes the parameters of the PAI command that is used to configure this component.
pai -name Word2Vec
-project algo_public
-DinputTableName=w2v_input
-DwordColName=word
-DoutputTableName=w2v_output;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input vocabulary. | None |
inputTablePartitions | No | The names of the partitions used for word segmentation in the input vocabulary. This value must be in the partition_name=value format. To specify a multi-level partition, use the following format: name1=value1/name2=value2 . When you specify multiple partitions, separate them with commas (,). | None |
wordColName | Yes | The name of the word column. Each cell in the word column contains only a single word. The </s> tag indicates a line feed. | None |
inVocabularyTableName | No | The output of the wordcount operation that is performed on the input vocabulary. | wordcount operation that the system performs on the output table |
inVocabularyPartitions | No | The names of the partitions in the output after a wordcount operation is performed on the input vocabulary. | All partitions in the output of inVocabularyTableName |
layerSize | No | The number of dimensions of the word. Valid values: 0 to 1000. | 100 |
cbow | No | The language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model. | 0 |
window | No | The window size of words. The value must be a positive integer. | 5 |
minCount | No | The minimum frequency of words for truncation. The value must be a positive integer. | 5 |
hs | No | Specifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used. | 1 |
negative | No | The window size of negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable. | 0 |
sample | No | The threshold for downsampling. Valid values: 1e-3 to 1e-5. A value of 0 indicates that the downsampling feature is unavailable. | 0 |
alpha | No | The value is greater than 0. | 0.025 |
iterTrain | No | The value is greater than or equal to 1. | 1 |
randomWindow | No | The mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5. | 1 |
outVocabularyTableName | No | The name of the output vocabulary. | None |
outputTableName | Yes | The name of the output vector table. | None |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | None |
coreNum | No | The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both set. The value must be a positive integer. | Automatically allocated |
memSizePerCore | No | The memory size of each core. This parameter and the coreNum parameter take effect only when they are both set. The value must be a positive integer. | Automatically allocated |
FAQ
The error message "Vocab size is zero! vocab_size: 0" is reported if the dictionary is empty. To resolve the issue, set the minCount parameter to a smaller value.