All Products
Search
Document Center

Platform For AI:Word2Vec

Last Updated:Dec 18, 2024

This topic describes the Word2Vec component provided by Machine Learning Designer.

The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.

Usage notes

The Word2Vec component must be connected to the Word Frequency Statistics component as a downstream node.

Note

The Word Frequency Statistics component generates triple tables that contain words and word statistics. You can connect the Word Frequency Statistics component as an upstream node of the Word2Vec component. Then, the Word2Vec component obtains data generated by the Word Frequency Statistics component, converts the data to single words, and then processes all the data as a document.

Configure the component

You can use one of the following methods to configure the Word2Vec component:

Method 1: Configure the component on the pipeline configuration tab in the console

Configure the component on the pipeline details page of Machine Learning Designer in the Platform for AI (PAI) console. The following table describes the component parameters.

Tab

Parameter

Description

Fields Setting

Word Column

The word column used for training. We recommend that the number of words does not exceed 10 million.

Parameters Setting

Word Feature Dimension

The number of dimensions of the word. Valid values: 0 to 1000. Default value: 100.

Language Model

The language model used for training. Valid values: Skip-gram and Cbow. Default value: Skip-gram.

Word Window Size

The window size of words. The value must be a positive integer. Default value: 5.

Random Window

Specifies whether to use a random window. By default, Random Window is selected.

Minimum Word Truncation Frequency

The minimum frequency of words for truncation. The value must be a positive integer. Default value: 5.

Hierarchical Softmax

Specifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected.

Negative Sampling

The window size of negative sampling. The default value is 0, which indicates that the negative sampling feature is unavailable.

Downsampling Threshold

The threshold for downsampling. The default value is 0, which indicates that the downsampling feature is unavailable.

Initial Learning Rate

The initial learning rate. The value is greater than 0. Default value: 0.025.

Number of Iterations

The number of iterations. The value is greater than or equal to 1. Default value: 1.

Tuning

Number of Cores

The number of cores. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Run PAI commands

Configure the Word2Vec component by using a PAI command. You can use the SQL Script component to run PAI commands. For more information, see SQL Script. The following table describes the parameters of the PAI command that is used to configure this component.

pai -name Word2Vec
    -project algo_public
    -DinputTableName=w2v_input
    -DwordColName=word
    -DoutputTableName=w2v_output;

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input vocabulary.

None

inputTablePartitions

No

The names of the partitions used for word segmentation in the input vocabulary. This value must be in the partition_name=value format. To specify a multi-level partition, use the following format: name1=value1/name2=value2. When you specify multiple partitions, separate them with commas (,).

None

wordColName

Yes

The name of the word column. Each cell in the word column contains only a single word. The </s> tag indicates a line feed.

None

inVocabularyTableName

No

The output of the wordcount operation that is performed on the input vocabulary.

wordcount operation that the system performs on the output table

inVocabularyPartitions

No

The names of the partitions in the output after a wordcount operation is performed on the input vocabulary.

All partitions in the output of inVocabularyTableName

layerSize

No

The number of dimensions of the word. Valid values: 0 to 1000.

100

cbow

No

The language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model.

0

window

No

The window size of words. The value must be a positive integer.

5

minCount

No

The minimum frequency of words for truncation. The value must be a positive integer.

5

hs

No

Specifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used.

1

negative

No

The window size of negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable.

0

sample

No

The threshold for downsampling. Valid values: 1e-3 to 1e-5. A value of 0 indicates that the downsampling feature is unavailable.

0

alpha

No

The value is greater than 0.

0.025

iterTrain

No

The value is greater than or equal to 1.

1

randomWindow

No

The mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5.

1

outVocabularyTableName

No

The name of the output vocabulary.

None

outputTableName

Yes

The name of the output vector table.

None

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

None

coreNum

No

The number of cores. This parameter and the memSizePerCore parameter take effect only when they are both set. The value must be a positive integer.

Automatically allocated

memSizePerCore

No

The memory size of each core. This parameter and the coreNum parameter take effect only when they are both set. The value must be a positive integer.

Automatically allocated

FAQ

The error message "Vocab size is zero! vocab_size: 0" is reported if the dictionary is empty. To resolve the issue, set the minCount parameter to a smaller value.