All Products
Document Center

Platform For AI:LLM-N-Gram Repetition Filter (MaxCompute)

Last Updated:May 31, 2024

You can use the LLM-N-Gram Repetition Filter (MaxCompute) component of Platform for AI (PAI) to preprocess the text data that is used to train large language models (LLMs). The component filters texts based on the repetition ratio of character-level or word-level N-Grams.


You can use the LLM-N-Gram Repetition Filter (MaxCompute) component based only on the resources of MaxCompute.

Algorithm description

The LLM-N-Gram Repetition Filter (MaxCompute) component moves an N-character window across a text to generate sequences of N characters or words. Each sequence is called an N-gram. The component calculates the frequency of each N-gram and then calculates the repetition ratio by using the following formula: Cumulative frequency of N-grams that occur more than once/Total frequency of all N-grams. This allows the component to filter texts based on the repetition ratio.

If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio.

Configure the component

You can configure the parameters of the LLM-N-Gram Repetition Filter (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.





Default value

Fields Setting

Select Target Column


The columns that you want to process. You can select multiple columns.


Whether to Filter with Character-level N-Gram Repetition Ratio


  • Length N: the length of the N-gram.

  • Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out.

  • Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.


Whether to Filter with Word-level N-Gram Repetition Ratio


  • Text Separator: the delimiter that is used to split the text into words. Default value: space (" "). Enclose the delimiter in double quotation marks ("").

  • Length N: the length of the N-gram.

  • Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out.

  • Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.


Output table lifecycle


The value must be a positive integer. Unit: days. Default value: 28. The temporary table generated by this component is recycled after 28 days.



Number of CPUs per instance of map task


The number of CPUs for each instance of a map task. Valid values: 50 to 800.


The memory size per instance of map task


The memory size of each instance of a map task. Unit: MB. Valid values: 256 to12288.


The maximum size of input data for a map


The maximum amount of data that each instance of a map task can process. You can use this parameter to control the size of the input data. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.



For information about Machine Learning Designer components, see Overview of Machine Learning Designer.