The Semantic Vector Distance algorithm assesses the similarity of words or text fragments in semantic space by calculating the distance between word vectors generated by word embedding models such as Word2Vec. Common distance measurement methods include Euclidean distance, cosine similarity, and Manhattan distance. This algorithm is widely used in natural language processing tasks, such as synonym generation, text similarity computation, and semantic search.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Semantic Vector Distance component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | ID Column |
|
Vector Columns | The names of columns that contain vectors. Example: f1,f2. | |
Parameters Setting | Number of Closest Vectors to Output | The number of the closest vectors in the output. Default value: 5. |
Distance Calculation Mode | The method that is used to calculate the distance between vectors. Valid values:
Default value: Euclidean. | |
Distance Threshold | The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Default value: +∞. Only vectors with a distance less than or equal to this threshold are considered similar and output. For example, if a cosine similarity threshold is set at 0.8, only vectors with a similarity greater than or equal to 0.8 will be regarded as similar vectors. | |
Tuning | Computing Cores | The number of cores used for calculation. The value is automatically allocated. |
Memory Size per Core (Unit: MB) | The memory size of each core. The value is automatically allocated. |
Method 2: Use PAI commands
Configure the component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name SemanticVectorDistance
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DidColName="word"
-DvectorColNames="f0,f1,f2,f3,f4,f5"
-Dlifecycle=30
Parameter | Required | Default value | Description |
inputTableName | Yes | None | The name of the input table. |
inputTablePartitions | No | All partitions | The partitions selected from the input table for calculation. |
outputTableName | Yes | None | The name of the output table. |
idTableName | No | None | The name of the vector ID table for vector calculation. The table contains only a single column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. |
idTablePartitions | No | None | The partitions selected from the ID table for calculation. By default, all partitions are selected for calculation. |
idColName | Yes | 3 | The name of the ID column. |
vectorColNames | No | None | The names of columns that contain vectors. Example: f1,f2. |
topN | No | 5 | The number of the closest vectors in the output. Valid values: [1,+∞]. |
distanceType | No | euclidean | The method that is used to calculate the distance between vectors. |
distanceThreshold | No | +∞ | The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Valid values: (0,+∞). |
lifecycle | No | None | The lifecycle of the output table. The value must be a positive integer. |
coreNum | No | Determined by the system | The number of cores used for calculation. The value must be a positive integer. |
memSizePerCore | No | Determined by the system | The memory size of each core. The value must be a positive integer. |
Sample output
The output table contains the following four columns: original_id, near_id, distance, and rank.
original_id | near_id | distance | rank |
hello | hi | 0.2 | 1 |
hello | xxx | xx | 2 |
Man | Woman | 0.3 | 1 |
Man | xx | xx | 2 |