This topic describes the Word2Vec component provided by Machine Learning Designer.

The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.

Usage notes

The Word2Vec component must be connected to the Word Frequency Statistics component as a downstream node.
Note The Word Frequency Statistics component generates triple tables that contain words and word statistics. You can connect the Word Frequency Statistics component as an upstream node of the Word2Vec component. Then, the Word2Vec component obtains data generated by the Word Frequency Statistics component, converts the data to single words, and then processes all the data as a document.

Configure the component

You can use one of the following methods to configure the Word2Vec component:

Method 1: Configure the component on the pipeline configuration tab in the console

Configure the component on the pipeline configuration tab of Machine Learning Designer in the Machine Learning Platform for AI (PAI) console. The following table describes the component parameters.
TabParameterDescription
Fields SettingWord ColumnThe word column used for training.
Parameters SettingWord Feature DimensionThe number of dimensions of the word. Valid values: 0 to 1000. Default value: 100.
Language ModelThe language model used for training. Valid values: Skip-gram and Cbow. Default value: Skip-gram.
Word Window SizeThe window size of words. The value must be a positive integer. Default value: 5.
Random WindowSpecifies whether to use a random window. By default, Random Window is selected.
Minimum Word Truncation FrequencyThe minimum frequency of words for truncation. The value must be a positive integer. Default value: 5.
Hierarchical SoftmaxSpecifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected.
Negative SamplingThe window size of negative sampling. The default value is 0, which indicates that the negative sampling feature is unavailable.
Downsampling ThresholdThe threshold for downsampling. The default value is 0, which indicates that the downsampling feature is unavailable.
Initial Learning RateThe initial learning rate. The value is greater than 0. Default value: 0.025.
Number of IterationsThe number of iterations. The value is greater than or equal to 1. Default value: 1.
TuningNumber of CoresThe number of cores. By default, the system determines the value.
Memory Size per CoreThe memory size of each core. By default, the system determines the value.

Method 2: Run PAI commands

Configure the Word2Vec component by using a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script. The following table describes the parameters of the PAI command that is used to configure this component.
pai -name Word2Vec
    -project algo_public
    -DinputTableName=w2v_input
    -DwordColName=word
    -DoutputTableName=w2v_output;
ParameterRequiredDescriptionDefault value
inputTableNameYesThe name of the input vocabulary. None
inputTablePartitionsNoThe names of the partitions used for word segmentation in the input vocabulary. This value must be in the partition_name=value format. To specify a multi-level partition, use the following format: name1=value1/name2=value2. When you specify multiple partitions, separate them with commas (,). None
wordColNameYesThe name of the word column. Each cell in the word column contains only a single word. The </s> tag indicates a line feed. None
inVocabularyTableNameNoThe output of the wordcount operation that is performed on the input vocabulary. wordcount operation that the system performs on the output table
inVocabularyPartitionsNoThe names of the partitions in the output after a wordcount operation is performed on the input vocabulary. All partitions in the output of inVocabularyTableName
layerSizeNoThe number of dimensions of the word. Valid values: 0 to 1000. 100
cbowNoThe language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model. 0
windowNoThe window size of words. The value must be a positive integer. 5
minCountNoThe minimum frequency of words for truncation. The value must be a positive integer. 5
hsNoSpecifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used. 1
negativeNoThe window size of negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable. 0
sampleNoThe threshold for downsampling. Valid values: 1e-3 to 1e-5. A value of 0 indicates that the downsampling feature is unavailable. 0
alphaNoThe value is greater than 0. 0.025
iterTrainNoThe value is greater than or equal to 1. 1
randomWindowNoThe mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5. 1
outVocabularyTableNameNoThe name of the output vocabulary. None
outputTableNameYesThe name of the output vector table. None
lifecycleNoThe lifecycle of the output table. The value must be a positive integer. None
coreNumNoThe number of cores. This parameter and the memSizePerCore parameter take effect only when they are both set. The value must be a positive integer. Automatically allocated
memSizePerCoreNoThe memory size of each core. This parameter and the coreNum parameter take effect only when they are both set. The value must be a positive integer. Automatically allocated

FAQ

The error message "Vocab size is zero! vocab_size: 0" is reported if the dictionary is empty. To resolve the issue, set the minCount parameter to a smaller value.