All Products
Search
Document Center

Platform For AI:LLM-N-Gram Repetition Filter (MaxCompute)

Last Updated:May 31, 2024

You can use the LLM-N-Gram Repetition Filter (MaxCompute) component of Platform for AI (PAI) to preprocess the text data that is used to train large language models (LLMs). The component filters texts based on the repetition ratio of character-level or word-level N-Grams.

Limits

You can use the LLM-N-Gram Repetition Filter (MaxCompute) component based only on the resources of MaxCompute.

Algorithm description

The LLM-N-Gram Repetition Filter (MaxCompute) component moves an N-character window across a text to generate sequences of N characters or words. Each sequence is called an N-gram. The component calculates the frequency of each N-gram and then calculates the repetition ratio by using the following formula: Cumulative frequency of N-grams that occur more than once/Total frequency of all N-grams. This allows the component to filter texts based on the repetition ratio.

If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio.

Configure the component

You can configure the parameters of the LLM-N-Gram Repetition Filter (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.

Tab

Parameter

Required

Description

Default value

Fields Setting

Select Target Column

Yes

The columns that you want to process. You can select multiple columns.

N/A

Whether to Filter with Character-level N-Gram Repetition Ratio

No

  • Length N: the length of the N-gram.

  • Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out.

  • Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.

N/A

Whether to Filter with Word-level N-Gram Repetition Ratio

No

  • Text Separator: the delimiter that is used to split the text into words. Default value: space (" "). Enclose the delimiter in double quotation marks ("").

  • Length N: the length of the N-gram.

  • Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out.

  • Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.

N/A

Output table lifecycle

No

The value must be a positive integer. Unit: days. Default value: 28. The temporary table generated by this component is recycled after 28 days.

28

Tuning

Number of CPUs per instance of map task

No

The number of CPUs for each instance of a map task. Valid values: 50 to 800.

100

The memory size per instance of map task

No

The memory size of each instance of a map task. Unit: MB. Valid values: 256 to12288.

1024

The maximum size of input data for a map

No

The maximum amount of data that each instance of a map task can process. You can use this parameter to control the size of the input data. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.

256

References

For information about Machine Learning Designer components, see Overview of Machine Learning Designer.