All Products
Search
Document Center

Platform For AI:LLM-Text Quality Predict and Language Identification-FastText (MaxCompute)

Last Updated:Dec 18, 2024

The LLM-Text Quality Predict and Language Identification-FastText (MaxCompute) component of Platform for AI (PAI) is used to identify text languages, calculate confidence scores, and filter samples based on the language and scores. You can use the component during text preprocessing of large language models (LLMs).

Limits

The LLM-Text Quality Predict and Language Identification-FastText (MaxCompute) component supports only MaxCompute resources.

Algorithm

The algorithm uses FastText to identify text languages and calculate a confidence score. The algorithm can identify 176 languages. The languages are represented by the following codes:

['af', 'als', 'am', 'an', 'ar', 'arz', 'as', 'ast', 'av', 'az', 'azb', 'ba', 'bar', 'bcl', 'be', 'bg', 'bh', 'bn', 'bo', 'bpy', 'br', 'bs', 'bxr', 'ca', 'cbk', 'ce', 'ceb', 'ckb', 'co', 'cs', 'cv', 'cy', 'da', 'de', 'diq', 'dsb', 'dty', 'dv', 'el', 'eml', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'frr', 'fy', 'ga', 'gd', 'gl', 'gn', 'gom', 'gu', 'gv', 'he', 'hi', 'hif', 'hr', 'hsb', 'ht', 'hu', 'hy', 'ia', 'id', 'ie', 'ilo', 'io', 'is', 'it', 'ja', 'jbo', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'krc', 'ku', 'kv', 'kw', 'ky', 'la', 'lb', 'lez', 'li', 'lmo', 'lo', 'lrc', 'lt', 'lv', 'mai', 'mg', 'mhr', 'min', 'mk', 'ml', 'mn', 'mr', 'mrj', 'ms', 'mt', 'mwl', 'my', 'myv', 'mzn', 'nah', 'nap', 'nds', 'ne', 'new', 'nl', 'nn', 'no', 'oc', 'or', 'os', 'pa', 'pam', 'pfl', 'pl', 'pms', 'pnb', 'ps', 'pt', 'qu', 'rm', 'ro', 'ru', 'rue', 'sa', 'sah', 'sc', 'scn', 'sco', 'sd', 'sh', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tk', 'tl', 'tr', 'tt', 'tyv', 'ug', 'uk', 'ur', 'uz', 'vec', 'vep', 'vi', 'vls', 'vo', 'wa', 'war', 'wuu', 'xal', 'xmf', 'yi', 'yo', 'yue', 'zh']

Configure the component

You can configure the parameters of the LLM-Text Quality Predict and Language Identification-FastText (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The columns that you want to process.

No default value

Whether to save the language id and score

No

Specifies whether to save the language name and confidence score to the output table. If you select this check box, the system adds two columns to the output table to save the results. Otherwise, the results are not saved.

  • Language id saved column name: the name of the column in which the language name is saved. Default value: language_id.

  • Language score saved column name: the name of the column in which the confidence score is saved. Default value: language_score.

No default value

SQL Script

No

Specify a WHERE statement that saves the language name in the language_id column, and the confidence score in the language_score column. You can filter the values based on the results of these two columns. Example: where language_id = 'en' and language_score >= 0.8. If you save the language results and modify the column names, configure the parameter based on the modified column names.

No default value

Output table lifecycle

No

The value is a positive integer. Unit: days. Default value: 28. After the default lifecycle of the table elapses, the temporary tables generated by the component are recycled.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: 50 to 800.

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: 256 to 12288.

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to manage the input of a map. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.

256

References

For more information about Machine Learning Designer, see Overview of Machine Learning Designer.