Description of the LLM-N-Gram Repetition Filter (MaxCompute) component - Platform For AI

You can use the LLM-N-Gram Repetition Filter (MaxCompute) component of Platform for AI (PAI) to preprocess the text data that is used to train large language models (LLMs). The component filters texts based on the repetition ratio of character-level or word-level N-Grams.

Limits

You can use the LLM-N-Gram Repetition Filter (MaxCompute) component based only on the resources of MaxCompute.

Algorithm description

The LLM-N-Gram Repetition Filter (MaxCompute) component moves an N-character window across a text to generate sequences of N characters or words. Each sequence is called an N-gram. The component calculates the frequency of each N-gram and then calculates the repetition ratio by using the following formula: Cumulative frequency of N-grams that occur more than once/Total frequency of all N-grams. This allows the component to filter texts based on the repetition ratio.

If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio.

Configure the component

You can configure the parameters of the LLM-N-Gram Repetition Filter (MaxCompute) component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Required	Description	Default value
Fields Setting	Select Target Column	Yes	The columns that you want to process. You can select multiple columns.	N/A
	Whether to Filter with Character-level N-Gram Repetition Ratio	No	Length N: the length of the N-gram. Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out. Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.	N/A
	Whether to Filter with Word-level N-Gram Repetition Ratio	No	Text Separator: the delimiter that is used to split the text into words. Default value: space (" "). Enclose the delimiter in double quotation marks (""). Length N: the length of the N-gram. Minimum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is smaller than this value are filtered out. Maximum Ratio Value: Valid values: 0.0 to 1.0. Texts whose repetition ratio is greater than this value are filtered out.	N/A
	Output table lifecycle	No	The value must be a positive integer. Unit: days. Default value: 28. The temporary table generated by this component is recycled after 28 days.	28
Tuning	Number of CPUs per instance of map task	No	The number of CPUs for each instance of a map task. Valid values: 50 to 800.	100
	The memory size per instance of map task	No	The memory size of each instance of a map task. Unit: MB. Valid values: 256 to12288.	1024
	The maximum size of input data for a map	No	The maximum amount of data that each instance of a map task can process. You can use this parameter to control the size of the input data. Unit: MB. Valid values: 1 to Integer.MAX_VALUE.	256

References

For information about Machine Learning Designer components, see Overview of Machine Learning Designer.