what is the string similarity-top N component - Platform For AI

The String Similarity - top N component is used to calculate string similarity and obtain the top N data records that best match the mapping table. This topic describes how to configure the String Similarity - top N component in Platform for AI (PAI).

Configure the component

You can use one of the following methods to configure the String Similarity - top N component:

Method 1: Configure the component in the PAI console

You can configure the parameters of the String Similarity - top N component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Columns from the Input Table Appended to the Output Table	The names of the columns that you want to append to the output table from the input table.
	Columns from the Mapping Table Appended to the Output Table	The names of the columns that you want to append to the output table from the mapping table.
	Columns from Left Table for Similarity Calculation	The names of the left-table columns that are used for similarity calculation.
	Columns from the Mapping Table for Similarity Calculation	The names of the mapping table columns that are used for similarity calculation. The similarities between the rows in the left table and all strings in the mapping table are calculated, and the top N results are returned.
	Similarity Column in Output Table	The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter. Default value: output.
Parameters Setting	Number of Similarity Maximums in the End	The number of top N similarity values. The value must be a positive integer. Default value: 10.
	Similarity Calculation Methods	The method that is used for similarity calculation. Valid values: levenshtein_sim (default) lcs_sim ssk cosine simhash_hamming_sim
	Length of Substring	This parameter is required only if you set the Similarity Calculation Methods parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100). Default value: 2.
	Weight of Matching String	This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1). Default value: 0.5.
Tuning	Number of Computing Cores	The number of computing cores. By default, the system determines the value.
Tuning	Memory Size per Core (MB)	The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

The following table describes the parameters that are used in PAI commands. You can use the SQL script component to run PAI commands. For more information, see SQL Script.

PAI -name string_similarity_topn
    -project algo_public
    -DinputTableName="pai_test_string_similarity_topn"
    -DoutputTableName="pai_test_string_similarity_topn_output"
    -DmapTableName="pai_test_string_similarity_map_topn"
    -DinputSelectedColName="col0"
    -DmapSelectedColName="col1";

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
mapTableName	Yes	The name of the mapping table.	N/A
	Yes	The name of the output table.	N/A
inputSelectedColName1	No	The names of the left table columns that are used for similarity calculation.	Name of the first STRING column in the left table
inputSelectedColName2	No	The names of the mapping table columns that are used for similarity calculation.	Name of the first STRING column in the mapping table
inputAppendColNames	No	The names of the columns that you want to append to the output table from the input table.	N/A
inputAppendRenameColNames	No	The aliases of the columns that you want to append to the output table from the input table.	N/A
mapSelectedColName	Yes	The names of the mapping table columns that are used for similarity calculation.	N/A
mapAppendColNames	No	The names of the columns that you want to append to the output table from the mapping table.	N/A
mapAppendRenameColNames	No	The aliases of the columns that you want to append to the output table from the mapping table.	N/A
inputTablePartitions	No	The names of the partitions in the input table.	All partitions
mapTablePartitions	No	The names of the partitions in the mapping table.	All partitions
outputColName	No	The name of the similarity column in the output table. The name can be up to 128 characters in length and can contain only letters, digits, and underscores (_). The name must start with a letter.	output
method	No	The method that is used for similarity calculation. Valid values: levenshtein_sim lcs_sim ssk cosine simhash_hamming_sim	levenshtein_sim
lambda	No	This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Value range: (0,1).	0.5
k	No	This parameter is required only if you set the method parameter to ssk, cosine, or simhash_hamming_sim. Valid values: (0,100).	2
lifecycle	No	The lifecycle of the output table. The value must be a positive integer.	N/A
coreNum	No	The number of cores that are used.	Specified by the system
memSizePerCore	No	The memory size of each core.	Specified by the system

Resource usage and cost estimates

The String Similarity - top N component uses a complex algorithm that has a time complexity of O(M × N), where M is the total number of data records and N is the number of data records for which you want to find the best matching strings. The similarity of samples is measured by calculating the distance between sample data for M × N times. The amount of resources consumed by this algorithm is proportional to the product of M and N.

To use the String Similarity - top N component, you can apply for up to 1,000 worker nodes with an individual memory of 4 GB to 64 GB. The required number of worker nodes is calculated by using the following formula: M × N/(1024 × 1024 × 32). The memory of each worker node is calculated by using the following formula: N/8 MB. Example: If 1 CU provides 4 GB memory, this component can consume up to 16,000 CUs, which is calculated by using the following formula: 1000 × 64/4. For more information, see Billing example of Designer (formerly known as Machine Learning Studio).

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.
You can use the String Similarity component to calculate string similarity in industries such as information retrieval, natural language processing, and bioinformatics. For more information about how to use this component, see String Similarity.