You can use the LLM-N-Gram Repetition Filter (MaxCompute) component of Platform for AI (PAI) to preprocess the text data that is used to train large language models (LLMs). The component filters texts based on the repetition ratio of character-level or word-level N-Grams.
Limits
You can use the LLM-N-Gram Repetition Filter (MaxCompute) component based only on the resources of MaxCompute.
Algorithm description
The LLM-N-Gram Repetition Filter (MaxCompute) component moves an N-character window across a text to generate sequences of N characters or words. Each sequence is called an N-gram. The component calculates the frequency of each N-gram and then calculates the repetition ratio by using the following formula: Cumulative frequency of N-grams that occur more than once/Total frequency of all N-grams
. This allows the component to filter texts based on the repetition ratio.
If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio.
Configure the component
You can configure the parameters of the LLM-N-Gram Repetition Filter (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Required | Description | Default value |
Fields Setting | Select Target Column | Yes | The columns that you want to process. You can select multiple columns. | N/A |
Whether to Filter with Character-level N-Gram Repetition Ratio | No |
| N/A | |
Whether to Filter with Word-level N-Gram Repetition Ratio | No |
| N/A | |
Output table lifecycle | No | The value must be a positive integer. Unit: days. Default value: 28. The temporary table generated by this component is recycled after 28 days. | 28 | |
Tuning | Number of CPUs per instance of map task | No | The number of CPUs for each instance of a map task. Valid values: 50 to 800. | 100 |
The memory size per instance of map task | No | The memory size of each instance of a map task. Unit: MB. Valid values: 256 to12288. | 1024 | |
The maximum size of input data for a map | No | The maximum amount of data that each instance of a map task can process. You can use this parameter to control the size of the input data. Unit: MB. Valid values: 1 to Integer.MAX_VALUE. | 256 |
References
For information about Machine Learning Designer components, see Overview of Machine Learning Designer.