The Doc2Vec component uses a document ID as a word in the document during training. This component represents each document as a sentence vector and obtains a word vector by using the document ID as context. You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary table. The output is a document vector table, a word vector table, or a vocabulary table. This topic describes how to configure the Doc2Vec component provided by Platform for AI (PAI).
Limits
You can use the Doc2Vec component based on the computing resources of MaxCompute.
Configure the component
You can use one of the following methods to configure the Doc2Vec component:
Method 1: Configure the component in the PAI console
You can configure the parameters of the Doc2Vec component on the pipeline page of Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Document ID Column | The name of the document column that is used for training. |
Document Content | The words used for training. Separate these words with spaces. | |
Parameters Setting | Dimensions of Word Features | The number of dimensions of the word. Valid values: 0 to 1000. Default value: 100. |
Language Model | The language model used for training. Valid values:
| |
Word Window Size | The window size of words. The value must be a positive integer. Default value: 5. | |
Minimum Frequency of Words | The minimum frequency of words for truncation. The value must be a positive integer. Default value: 5. | |
Hierarchical Softmax | Specifies whether to use hierarchical softmax. By default, Hierarchical Softmax is selected. | |
Negative Sampling | The window size of negative sampling. The value must be a positive integer. Default value: 5. A value of 0 indicates that the negative sampling feature is unavailable. | |
Downsampling Threshold | The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable. | |
Initial Learning Rate | The initial learning rate. The value must be greater than 0. Default value: 0.025. | |
Training Iterations | The number of iterations. The value must be greater than or equal to 1. Default value: 1. | |
Use Random Window | The mode that is used to display the word window. Valid values: A Random Value Between 1 to 5 and Specified by the Window Parameter. Default value: Specified by the Window Parameter. | |
Tuning | Number of Computing Cores | The number of computing cores. By default, the system determines the value. |
Memory Size per Core (MB) | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the component by using PAI commands
Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.
PAI -name pai_doc2vec
-project algo_public
-DinputTableName="d2v_input"
-DdocIdColName="docid"
-DdocColName="text_seg"
-DoutputWordTableName="d2v_word_output"
-DoutputDocTableName="d2v_doc_output";
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input vocabulary table. | N/A |
inputTablePartitions | No | The names of the partitions used for word segmentation in the input vocabulary table. Format: | N/A |
docIdColName | Yes | The name of the document column that is used for training. | N/A |
docColName | Yes | The words used for training. Separate these words with spaces. | N/A |
layerSize | No | The number of dimensions of the word. Valid values: 0 to 1000. | 100 |
cbow | No | The language model used for training. Valid values: 0 and 1. A value of 0 indicates the skip-gram model, and a value of 1 indicates the CBOW model. | 0 |
window | No | The window size of words. The value must be a positive integer. | 5 |
minCount | No | The minimum frequency of words for truncation. The value must be a positive integer. | 5 |
hs | No | Specifies whether to use hierarchical softmax. Valid values: 0 and 1. A value of 0 indicates that hierarchical softmax is not used, and a value of 1 indicates that hierarchical softmax is used. | 1 |
negative | No | The window size for negative sampling. The value must be a positive integer. A value of 0 indicates that the negative sample feature is unavailable. | 5 |
sample | No | The threshold for downsampling. Valid values: 1e-3 to 1e-5. Default value: 1e-3. A value of 0 indicates that the downsampling feature is unavailable. | 1e-3 |
alpha | No | The value must be greater than 0. | 0.025 |
iterTrain | No | The value must be greater than or equal to 1. | 1 |
randomWindow | No | The mode that is used to display the word window. Valid values: 0 and 1. A value of 0 indicates that the value is specified by the window parameter, and a value of 1 indicates a random value from 1 to 5. | 1 |
outVocabularyTableName | No | The name of the output vocabulary table. | N/A |
outputWordTableName | Yes | The name of the output word vector table. | N/A |
outputDocTableName | Yes | The name of the output document vector table. | N/A |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | N/A |
coreNum | No | The number of cores. This parameter and the memSizePerCore parameter take effect only when you configure both the parameters. The value must be a positive integer. | Automatically allocated |
memSizePerCore | No | The memory size of each core. This parameter and the coreNum parameter take effect only when you configure both the parameters. The value must be a positive integer. | Automatically allocated |
References
For information about Machine Learning Designer, see Overview of Machine Learning Designer.