The String Similarity - top N component is used to calculate string similarity and obtain the top N data records that best match the mapping table. This topic describes how to configure the String Similarity - top N component in Platform for AI (PAI).
Configure the component
You can use one of the following methods to configure the String Similarity - top N component:
Method 1: Configure the component in the PAI console
You can configure the parameters of the String Similarity - top N component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Columns from the Input Table Appended to the Output Table | The names of the columns that you want to append to the output table from the input table. |
Columns from the Mapping Table Appended to the Output Table | The names of the columns that you want to append to the output table from the mapping table. | |
Columns from Left Table for Similarity Calculation | The names of the left-table columns that are used for similarity calculation. | |
Columns from the Mapping Table for Similarity Calculation | The names of the mapping table columns that are used for similarity calculation. The similarities between the rows in the left table and all strings in the mapping table are calculated, and the top N results are returned. | |
Similarity Column in Output Table | The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter. Default value: output. | |
Parameters Setting | Number of Similarity Maximums in the End | The number of top N similarity values. The value must be a positive integer. Default value: 10. |
Similarity Calculation Methods | The method that is used for similarity calculation. Valid values:
| |
Length of Substring | This parameter is required only if you set the Similarity Calculation Methods parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). Default value: 2. | |
Weight of Matching String | This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1). Default value: 0.5. | |
Tuning | Number of Computing Cores | The number of computing cores. By default, the system determines the value. |
Memory Size per Core (MB) | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the component by using PAI commands
The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.
PAI -name string_similarity_topn
-project algo_public
-DinputTableName="pai_test_string_similarity_topn"
-DoutputTableName="pai_test_string_similarity_topn_output"
-DmapTableName="pai_test_string_similarity_map_topn"
-DinputSelectedColName="col0"
-DmapSelectedColName="col1";
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
mapTableName | Yes | The name of the mapping table. | N/A |
Yes | The name of the output table. | N/A | |
inputSelectedColName1 | No | The names of the left table columns that are used for similarity calculation. | Name of the first STRING column in the left table |
inputSelectedColName2 | No | The names of the mapping table columns that are used for similarity calculation. | Name of the first STRING column in the mapping table |
inputAppendColNames | No | The names of the columns that you want to append to the output table from the input table. | N/A |
inputAppendRenameColNames | No | The aliases of the columns that you want to append to the output table from the input table. | N/A |
mapSelectedColName | Yes | The names of the mapping table columns that are used for similarity calculation. | N/A |
mapAppendColNames | No | The names of the columns that you want to append to the output table from the mapping table. | N/A |
mapAppendRenameColNames | No | The aliases of the columns that you want to append to the output table from the mapping table. | N/A |
inputTablePartitions | No | The names of the partitions in the input table. | All partitions |
mapTablePartitions | No | The names of the partitions in the mapping table. | All partitions |
outputColName | No | The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter. | output |
method | No | The method that is used for similarity calculation. Valid values:
| levenshtein_sim |
lambda | No | This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1). | 0.5 |
k | No | This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). | 2 |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | N/A |
coreNum | No | The number of cores that are used. | Specified by the system |
memSizePerCore | No | The memory size of each core. | Specified by the system |
Resource usage and cost estimates
The String Similarity - top N component uses a complex algorithm that has a time complexity of O(M × N), where M is the total number of data records and N is the number of data records for which you want to find the best matching strings. The similarity of samples is measured by calculating the distance between sample data for M × N times. The amount of resources consumed by this algorithm is proportional to the product of M and N.
To use the String Similarity - top N component, you can apply for up to 1,000 worker nodes with an individual memory of 4 GB to 64 GB. The required number of worker nodes is calculated by using the following formula: M × N/(1024 × 1024 × 32). The memory of each worker node is calculated by using the following formula: N/8 MB. Example: If 1 CU provides 4 GB memory, this component can consume up to 16,000 CUs, which is calculated by using the following formula: 1000 × 64/4. For more information, see Billing example of Designer (formerly known as Machine Learning Studio).
References
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.
You can use the String Similarity component to calculate string similarity in industries such as information retrieval, natural language processing, and bioinformatics. For more information about how to use this component, see String Similarity.