String similarity calculation is a basic machine learning operation that is commonly used in industries such as information retrieval, natural language processing, and bioinformatics. This topic describes how to configure the String Similarity algorithm component in Platform for AI (PAI).
Background information
The component supports five calculation methods: Levenshtein (Levenshtein Distance), Longest Common SubString (LCS), String Subsequence Kernel (SSK), Cosine (Cosine), and SimHash_Hamming. Input data may be distributed in two columns, and the value in one column can be used to calculate the value in the other column.
The Levenshtein method supports distance and similarity calculation.
The distance is specified by the levenshtein parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is specified by the levenshtein_sim parameter.
The LCS method supports distance and similarity calculation.
The distance is specified by the lcs parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance. The similarity is specified by the lcs_sim parameter.
The SSK method supports similarity calculation, which is specified by the ssk parameter.
Cosine supports similarity calculation, which is specified by the cosine parameter.
In Simhash_Hamming, the SimHash algorithm is used to map the original documents to 64-bit binary fingerprints. The Hamming distance is used to calculate the number of characters of binary fingerprints in the same position. The Simhash_Hamming method supports distance and similarity calculation.
The distance is specified by the simhash_hamming parameter.
The similarity is calculated by using the following formula: Similarity = 1 - Distance/64.0. The similarity is specified by the simhash_hamming_sim parameter.
Configure the component
You can use one of the following methods to configure the parameters of the String Similarity component.
Method 1: Configure the component in the PAI console
You can configure the parameters of the String Similarity component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Columns Appended Output Table | The columns appended to the specified output table. |
First Column for Similarity Calculation | The default value is the first STRING column in the input table. | |
Second Column for Similarity Calculation | The default value is the second STRING column in the input table. | |
Similarity Columns in Output Table | The similarity column in the specified output table. | |
Parameters Setting | Similarity Calculation Method | The method that is used for similarity calculation. Valid values:
Default value: levenshtein_sim. |
Substring Length | This parameter is required only when the Similarity Calculation Method parameter is set to ssk, Cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). Default value: 2. | |
Weight of Matching String | This parameter is required only when the method parameter is set to ssk, cosine, or simhash_hamming_sim. Valid values: (0,1). Default value: 0.5. | |
Execution Tuning | Number of Computing Cores | The number of computing cores. By default, the system determines the value. |
Memory Size per Core (MB) | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the component by using PAI commands
Configure the component parameters by using PAI commands. The following section describes the parameters. You can use SQL scripts to call PAI commands. For more information, see SQL Script.
PAI -name string_similarity
-project algo_public
-DinputTableName="pai_test_string_similarity"
-DoutputTableName="pai_test_string_similarity_output"
-DinputSelectedColName1="col0"
-DinputSelectedColName2="col1";
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
outputTableName | Yes | The name of the output table. | N/A |
inputSelectedColName1 | No | The first column for similarity calculation. | The name of the first column of the STRING type in the left table |
inputSelectedColName2 | No | The second column for similarity calculation. | The second STRING column in the input table |
inputAppendColNames | No | The columns appended to the output table. | N/A |
inputTablePartitions | No | The partitions in the input table. | All partitions |
outputColName | No | The name of the similarity column in the output table. The value cannot contain special characters. It can contain only letters, digits, or underscores (_) and must start with a letter and can be up to 128 bytes in length. | output |
method | No | The method that is used for similarity calculation. Valid values:
| levenshtein_sim |
lambda | No | This parameter is required only when the Method parameter is set to ssk. Valid values: (0,1). | 0.5 |
k | No | This parameter is required only when the Method parameter is set to ssk, cosine, simhash_hamming, or simhash_hamming_sim. Valid values: (0,100). | 2 |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | N/A |
coreNum | No | The number of cores that are used in computing. | Automatically allocated |
memSizePerCore | No | The memory size of each core. | Automatically allocated |
References
For information about Machine Learning Designer, see Overview of Machine Learning Designer.
You can also use the String Similarity - top N component to calculate string similarity and obtain the top N data records that best match the mapping table. For information about how to use this component, see String Similarity - top N.