what is the document similarity algorithm component - Platform For AI

Document similarity is the similarity calculated between articles or sentences based on string similarity. Documents or sentences are separated by spaces. This topic describes how to configure the Document Similarity algorithm component provided by Platform for AI (PAI).

Background information

Document similarity is calculated in the same manner that string similarity is calculated. Document similarity supports the following calculation methods: Levenshtein Distance (Levenshtein), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine, and Simhash_Hamming.

The Levenshtein method supports the calculation of distance and similarity.
- The distance is expressed as the levenshtein parameter.
- The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the levenshtein_sim parameter.
The LCS method supports the calculation of distance and similarity.
- The distance is expressed as the lcs parameter.
- The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the lcs_sim parameter.
The SSK method supports similarity calculation and is expressed as the ssk parameter.
The Cosine method supports similarity calculation and is expressed as the cosine parameter.
In the Simhash_Hamming method, the SimHash algorithm is used to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints on the same position. The Simhash_Hamming method supports distance and similarity calculation.
- The distance is expressed as the simhash_hamming parameter.
- The similarity is calculated by using the following formula: Similarity = 1 - Distance/64.0. The similarity is expressed as the simhash_hamming_sim parameter.
Note
- For more information about SimHash, see Similarity Estimation Techniques from Rounding Algorithms.
- For more information about the Hamming distance, see Wikipedia.

Limits

You can use the Document Similarity component based only on the computing resources of MaxCompute.

Configure the component

You can use one of the following methods to configure the Document Similarity component.

Method 1: Configure the component in the PAI console

You can configure the parameters of the Document Similarity component on the pipeline page of Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	First Column for Similarity Calculation	The default value is the name of the first string column in the table.
	Second Column for Similarity Calculation	The default value is the name of the second string column in the table.
	Columns Appended to Output Table	The names of the columns appended to the output table.
	Similarity Column in Output Table	The name of the similarity column in the output table. Default value: output. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.
Parameters Setting	Similarity Calculation Method	The method that is used for similarity calculation. Valid values: levenshtein levenshtein_sim (default) lcs lcs_sim ssk cosine simhash_hamming simhash_hamming_sim
	Substring Length (Available in SSK and Cosine)	This parameter takes effect only when the Similarity Calculation Method parameter is set to levenshtein, ssk, or Cosine. Valid values: (0,100]. Default value: 2.
	Matching Word Pair Weight (Available in SSK)	This parameter takes effect only when the Similarity Calculation Method parameter is set to ssk. The value must be between 0 and 1. Default value: 0.5.
Tuning	Computing Cores	The number of cores used for calculation. By default, the system determines the value.
Tuning	Memory Size per Core (Unit: MB)	The memory size of each core. By default, the system determines the value.

Method 2: Configure the parameters by using PAI commands

Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.

PAI -name doc_similarity    
    -project algo_public    
    -DinputTableName="pai_test_doc_similarity"    
    -DoutputTableName="pai_test_doc_similarity_output"    
    -DinputSelectedColName1="col0"    
    -DinputSelectedColName2="col1"

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
outputTableName	Yes	The name of the output table.	N/A
inputSelectedColName1	No	The first column that is used for similarity calculation.	The name of the first string column in the table
inputSelectedColName2	No	The second column that is used for similarity calculation.	The name of the second string column in the table
inputAppendColNames	No	The columns appended to the output table.	No column appended
inputTablePartitions	No	The partitions that are selected from the input table.	Full table
outputColName	No	The name of the similarity column in the output table. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter.	output
method	No	The method that is used for similarity calculation. Valid values: levenshtein levenshtein_sim lcs lcs_sim ssk cosine simhash_hamming simhash_hamming_sim	levenshtein_sim
lambda	No	The weight of a matched word pair. The SSK method supports this parameter. Valid values: (0,1).	0.5
k	No	The length of the substring. The SSK and Cosine methods support this parameter. Valid values: (0,100].	2
lifecycle	No	The lifecycle of the output table.	N/A
coreNum	No	The number of cores that are used for calculation.	Automatically allocated
memSizePerCore	No	The memory size of each core. Unit: MB.	Automatically allocated

Examples

Input

Use an ODPS SQL node to create a table pai_ft_string_similarity_topn_input. For more information, see Develop a MaxCompute SQL task. Sample command:

drop table if exists pai_doc_similarity_input;
create table pai_doc_similarity_input as
select * from 
(
select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual
union all
select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Shenzhen" as col1 from dual
)tmp

After you run the command, the following table is the input table pai_doc_similarity_input:

id	col0	col1
1
0

PAI command

You can use an SQL script component or an ODPS SQL node to run the following PAI commands.

drop table if exists pai_doc_similarity_output;
PAI -name doc_similarity    
    -project algo_public    
    -DinputTableName=pai_doc_similarity_input    
    -DoutputTableName=pai_doc_similarity_output    
    -DinputSelectedColName1=col0    
    -DinputSelectedColName2=col1    
    -Dmethod=levenshtein_sim    
    -DinputAppendColNames=id,col0,col1;

Output
The following table is the output table named pai_doc_similarity_output.
id
col0
col1
output
1
Beijing Shanghai
Beijing Shanghai Shenzhen
0.6666666666666667
0
Beijing Shanghai
Beijing Shanghai
1.0

FAQ

Similarity calculation is based on the result of word segmentation. Words are separated by spaces. Each word serves as a unit of similarity calculation. If the input is a string as a whole, use the string similarity method.
In the method parameter, levenshtein, lcs, and simhash_hamming are used to calculate the distance. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim are used to calculate the similarity. The distance is calculated by using the following formula: Distance = 1.0 - Similarity.
If you set the method parameter to cosine or ssk, the k parameter is available, which indicates that k words are used as a combination for similarity calculation. If the value of k is greater than the number of words, two strings are the same. The similarity output is 0. In this case, you need to change the value of k to a value less than or equal to the minimum number of words.

References

For more information about Machine Learning Designer components, see Overview of Machine Learning Designer.
You can use the String Similarity component to calculate the similarity of a string. For more information, see String Similarity.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on different scenarios. For more information, see Component reference: Overview of all components.

id	col0	col1	output
1	Beijing Shanghai	Beijing Shanghai Shenzhen	0.6666666666666667
0	Beijing Shanghai	Beijing Shanghai	1.0