Document similarity is the similarity calculated between articles or sentences based on string similarity. Documents or sentences are separated by spaces. This topic describes how to configure the Document Similarity algorithm component provided by Platform for AI (PAI).
Background information
Document similarity is calculated in the same manner that string similarity is calculated. Document similarity supports the following calculation methods: Levenshtein Distance (Levenshtein), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine, and Simhash_Hamming.
The Levenshtein method supports the calculation of distance and similarity.
The distance is expressed as the levenshtein parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the levenshtein_sim parameter.
The LCS method supports the calculation of distance and similarity.
The distance is expressed as the lcs parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is expressed as the lcs_sim parameter.
The SSK method supports similarity calculation and is expressed as the ssk parameter.
The Cosine method supports similarity calculation and is expressed as the cosine parameter.
In the Simhash_Hamming method, the SimHash algorithm is used to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints on the same position. The Simhash_Hamming method supports distance and similarity calculation.
The distance is expressed as the simhash_hamming parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance/64.0. The similarity is expressed as the simhash_hamming_sim parameter.
NoteFor more information about SimHash, see Similarity Estimation Techniques from Rounding Algorithms.
For more information about the Hamming distance, see Wikipedia.
Limits
You can use the Document Similarity component based only on the computing resources of MaxCompute.
Configure the component
You can use one of the following methods to configure the Document Similarity component.
Method 1: Configure the component in the PAI console
You can configure the parameters of the Document Similarity component on the pipeline page of Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | First Column for Similarity Calculation | The default value is the name of the first string column in the table. |
Second Column for Similarity Calculation | The default value is the name of the second string column in the table. | |
Columns Appended to Output Table | The names of the columns appended to the output table. | |
Similarity Column in Output Table | The name of the similarity column in the output table. Default value: output. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter. | |
Parameters Setting | Similarity Calculation Method | The method that is used for similarity calculation. Valid values:
|
Substring Length (Available in SSK and Cosine) | This parameter takes effect only when the Similarity Calculation Method parameter is set to levenshtein, ssk, or Cosine. Valid values: (0,100]. Default value: 2. | |
Matching Word Pair Weight (Available in SSK) | This parameter takes effect only when the Similarity Calculation Method parameter is set to ssk. The value must be between 0 and 1. Default value: 0.5. | |
Tuning | Computing Cores | The number of cores used for calculation. By default, the system determines the value. |
Memory Size per Core (Unit: MB) | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the parameters by using PAI commands
Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.
PAI -name doc_similarity
-project algo_public
-DinputTableName="pai_test_doc_similarity"
-DoutputTableName="pai_test_doc_similarity_output"
-DinputSelectedColName1="col0"
-DinputSelectedColName2="col1"
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
outputTableName | Yes | The name of the output table. | N/A |
inputSelectedColName1 | No | The first column that is used for similarity calculation. | The name of the first string column in the table |
inputSelectedColName2 | No | The second column that is used for similarity calculation. | The name of the second string column in the table |
inputAppendColNames | No | The columns appended to the output table. | No column appended |
inputTablePartitions | No | The partitions that are selected from the input table. | Full table |
outputColName | No | The name of the similarity column in the output table. Note The column name can be up to 128 characters in length and can contain letters, digits, and underscores (_). It must start with a letter. | output |
method | No | The method that is used for similarity calculation. Valid values:
| levenshtein_sim |
lambda | No | The weight of a matched word pair. The SSK method supports this parameter. Valid values: (0,1). | 0.5 |
k | No | The length of the substring. The SSK and Cosine methods support this parameter. Valid values: (0,100]. | 2 |
lifecycle | No | The lifecycle of the output table. | N/A |
coreNum | No | The number of cores that are used for calculation. | Automatically allocated |
memSizePerCore | No | The memory size of each core. Unit: MB. | Automatically allocated |
Examples
Input
Use an ODPS SQL node to create a table pai_ft_string_similarity_topn_input. For more information, see Develop a MaxCompute SQL task. Sample command:
drop table if exists pai_doc_similarity_input; create table pai_doc_similarity_input as select * from ( select 0 as id, "Beijing Shanghai" as col0, "Beijing Shanghai" as col1 from dual union all select 1 as id, "Beijing Shanghai" as col0, "Beijing Shanghai Shenzhen" as col1 from dual )tmp
After you run the command, the following table is the input table pai_doc_similarity_input:
id
col0
col1
1
0
PAI command
You can use an SQL script component or an ODPS SQL node to run the following PAI commands.
drop table if exists pai_doc_similarity_output; PAI -name doc_similarity -project algo_public -DinputTableName=pai_doc_similarity_input -DoutputTableName=pai_doc_similarity_output -DinputSelectedColName1=col0 -DinputSelectedColName2=col1 -Dmethod=levenshtein_sim -DinputAppendColNames=id,col0,col1;
Output
The following table is the output table named pai_doc_similarity_output.
id
col0
col1
output
1
Beijing Shanghai
Beijing Shanghai Shenzhen
0.6666666666666667
0
Beijing Shanghai
Beijing Shanghai
1.0
FAQ
Similarity calculation is based on the result of word segmentation. Words are separated by spaces. Each word serves as a unit of similarity calculation. If the input is a string as a whole, use the string similarity method.
In the method parameter, levenshtein, lcs, and simhash_hamming are used to calculate the distance. levenshtein_sim, lcs_sim, ssk, cosine, and simhash_hamming_sim are used to calculate the similarity. The distance is calculated by using the following formula: Distance = 1.0 - Similarity.
If you set the method parameter to cosine or ssk, the k parameter is available, which indicates that k words are used as a combination for similarity calculation. If the value of k is greater than the number of words, two strings are the same. The similarity output is 0. In this case, you need to change the value of k to a value less than or equal to the minimum number of words.
References
For more information about Machine Learning Designer components, see Overview of Machine Learning Designer.
You can use the String Similarity component to calculate the similarity of a string. For more information, see String Similarity.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on different scenarios. For more information, see Component reference: Overview of all components.