All Products
Search
Document Center

Platform For AI:Semantic Vector Distance

Last Updated:Nov 28, 2024

The Semantic Vector Distance algorithm assesses the similarity of words or text fragments in semantic space by calculating the distance between word vectors generated by word embedding models such as Word2Vec. Common distance measurement methods include Euclidean distance, cosine similarity, and Manhattan distance. This algorithm is widely used in natural language processing tasks, such as synonym generation, text similarity computation, and semantic search.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Semantic Vector Distance component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

ID Column

  • The name of the ID column. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation.

  • The ID column contains the ID list imported by using the second input port. Each ID occupies a cell. Examples:

    1
    2
    4
    6
    8

Vector Columns

The names of columns that contain vectors. Example: f1,f2.

Parameters Setting

Number of Closest Vectors to Output

The number of the closest vectors in the output. Default value: 5.

Distance Calculation Mode

The method that is used to calculate the distance between vectors. Valid values:

  • euclidean

  • cosine

  • manhattan

Default value: Euclidean.

Distance Threshold

The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Default value: +∞.

Only vectors with a distance less than or equal to this threshold are considered similar and output. For example, if a cosine similarity threshold is set at 0.8, only vectors with a similarity greater than or equal to 0.8 will be regarded as similar vectors.

Tuning

Computing Cores

The number of cores used for calculation. The value is automatically allocated.

Memory Size per Core (Unit: MB)

The memory size of each core. The value is automatically allocated.

Method 2: Use PAI commands

Configure the component by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name SemanticVectorDistance 
    -project algo_public    
    -DinputTableName="test_input"    
    -DoutputTableName="test_output"    
    -DidColName="word"    
    -DvectorColNames="f0,f1,f2,f3,f4,f5"    
    -Dlifecycle=30

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

inputTablePartitions

No

All partitions

The partitions selected from the input table for calculation.

outputTableName

Yes

None

The name of the output table.

idTableName

No

None

The name of the vector ID table for vector calculation. The table contains only a single column, and each row stores a vector ID. This parameter is empty by default, which indicates that all vectors in the input table are used for calculation.

idTablePartitions

No

None

The partitions selected from the ID table for calculation. By default, all partitions are selected for calculation.

idColName

Yes

3

The name of the ID column.

vectorColNames

No

None

The names of columns that contain vectors. Example: f1,f2.

topN

No

5

The number of the closest vectors in the output. Valid values: [1,+∞].

distanceType

No

euclidean

The method that is used to calculate the distance between vectors.

distanceThreshold

No

+∞

The threshold for the distance between vectors. The threshold is provided if the distance between two vectors is less than this value. Valid values: (0,+∞).

lifecycle

No

None

The lifecycle of the output table. The value must be a positive integer.

coreNum

No

Determined by the system

The number of cores used for calculation. The value must be a positive integer.

memSizePerCore

No

Determined by the system

The memory size of each core. The value must be a positive integer.

Sample output

The output table contains the following four columns: original_id, near_id, distance, and rank.

original_id

near_id

distance

rank

hello

hi

0.2

1

hello

xxx

xx

2

Man

Woman

0.3

1

Man

xx

xx

2