Semantic Vector Distance - Platform For AI - Alibaba Cloud Documentation Center

The Semantic Vector Distance algorithm assesses the similarity of words or text fragments in semantic space by calculating the distance between word vectors generated by word embedding models such as Word2Vec. Common distance measurement methods include Euclidean distance, cosine similarity, and Manhattan distance. This algorithm is widely used in natural language processing tasks, such as synonym generation, text similarity computation, and semantic search.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Semantic Vector Distance component on the pipeline page and configure the following parameters:

Category	Parameter	Description
Fields Setting	ID Column	The name of the ID column. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation. The ID column contains the ID list imported by using the second input port. Each ID occupies a cell. Examples: `1 2 4 6 8`
Fields Setting	Vector Columns	The names of columns that contain vectors. Example: f1,f2.
Parameters Setting	Number of Closest Vectors to Output	The number of the closest vectors in the output. Default value: 5.
	Distance Calculation Mode	The method that is used to calculate the distance between vectors. Valid values: euclidean cosine manhattan Default value: Euclidean.
	Distance Threshold	The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Default value: +∞. Only vectors with a distance less than or equal to this threshold are considered similar and output. For example, if a cosine similarity threshold is set at 0.8, only vectors with a similarity greater than or equal to 0.8 will be regarded as similar vectors.
Tuning	Computing Cores	The number of cores used for calculation. The value is automatically allocated.
Tuning	Memory Size per Core (Unit: MB)	The memory size of each core. The value is automatically allocated.

Method 2: Use PAI commands

Configure the component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name SemanticVectorDistance 
    -project algo_public    
    -DinputTableName="test_input"    
    -DoutputTableName="test_output"    
    -DidColName="word"    
    -DvectorColNames="f0,f1,f2,f3,f4,f5"    
    -Dlifecycle=30

Parameter	Required	Default value	Description
inputTableName	Yes	None	The name of the input table.
inputTablePartitions	No	All partitions	The partitions selected from the input table for calculation.
outputTableName	Yes	None	The name of the output table.
idTableName	No	None	The name of the vector ID table for vector calculation. The table contains only a single column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation.
idTablePartitions	No	None	The partitions selected from the ID table for calculation. By default, all partitions are selected for calculation.
idColName	Yes	3	The name of the ID column.
vectorColNames	No	None	The names of columns that contain vectors. Example: f1,f2.
topN	No	5	The number of the closest vectors in the output. Valid values: [1,+∞].
distanceType	No	euclidean	The method that is used to calculate the distance between vectors.
distanceThreshold	No	+∞	The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Valid values: (0,+∞).
lifecycle	No	None	The lifecycle of the output table. The value must be a positive integer.
coreNum	No	Determined by the system	The number of cores used for calculation. The value must be a positive integer.
memSizePerCore	No	Determined by the system	The memory size of each core. The value must be a positive integer.

Sample output

The output table contains the following four columns: original_id, near_id, distance, and rank.

original_id	near_id	distance	rank
hello	hi	0.2	1
hello	xxx	xx	2
Man	Woman	0.3	1
Man	xx	xx	2