This topic describes the Semantic Vector Distance component provided by Machine Learning Designer.
You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component.
Configure the component
You can configure the component by using Machine Learning Designer or running a Machine Learning Platform for AI command.
- Configure the component in Machine Learning Designer
Tab Parameter Description Fields Setting ID Column The name of the ID column. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. The ID column contains the ID list imported by using the second input port. Each ID occupies a cell. Examples:1 2 4 6 8
Vector Columns The names of columns that contain vectors. Example: f1,f2. Parameters Setting Number of Closest Vectors to Output The number of the closest vectors in the output. Default value: 5. Distance Calculation Mode The method that is used to calculate the distance between vectors. Valid values: - euclidean
- cosine
- manhattan
Default value: Euclidean.
Distance Threshold The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Default value: +∞. Tuning Computing Cores The number of cores used for calculation. The value is automatically allocated. Memory Size per Core (Unit: MB) The memory size of each core. The value is automatically allocated. - Configure the component by using a Machine Learning Platform for AI command
PAI -name SemanticVectorDistance -project algo_public -DinputTableName="test_input" -DoutputTableName="test_output" -DidColName="word" -DvectorColNames="f0,f1,f2,f3,f4,f5" -Dlifecycle=30
Parameter Required Description Default value inputTableName Yes The name of the input table. None inputTablePartitions No The partitions selected from the input table for calculation. All partitions outputTableName Yes The name of the output table. None idTableName No The name of the vector ID table for vector calculation. The table contains only a single column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. None idTablePartitions No The partitions selected from the ID table for calculation. By default, all partitions are selected for calculation. None idColName Yes The name of the ID column. 3 vectorColNames No The names of columns that contain vectors. Example: f1,f2. None topN No The number of the closest vectors in the output. Valid values: [1,+∞]. 5 distanceType No The method that is used to calculate the distance between vectors. euclidean distanceThreshold No The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Valid values: (0,+∞). +∞ lifecycle No The lifecycle of the input table. The value must be a positive integer. None coreNum No The number of cores used for calculation. The value must be a positive integer. Determined by the system memSizePerCore No The memory size of each core. The value must be a positive integer. Determined by the system
Example
The output table contains the following four columns: original_id, near_id, distance, and rank.
original_id | near_id | distance | rank |
---|---|---|---|
hello | hi | 0.2 | 1 |
hello | xxx | xx | 2 |
Man | Woman | 0.3 | 1 |
Man | xx | xx | 2 |