Machine Learning Designer of Platform for AI (PAI) provides various data processing components to help you edit, convert, filter, and deduplicate data. You can combine different components to filter high-quality data and generate text samples that meet your requirements. You can use the processed data to train large language models (LLMs). This topic describes how to use the data processing components of PAI to clean and process the thesis data obtained from the arXiv repository. In the following example, a small amount of arXiv data extracted from the open source RedPajama dataset is used.
Dataset
Machine Learning Designer provides a preset template for processing thesis data from the arXiv repository. The template uses 5,000 sample data records extracted from the open source RedPajama dataset.
Create and run a pipeline
Go to the Visualized Modeling (Designer) page.
Log on to the PAI console.
In the upper-left corner, select a region based on your business requirements.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
In the workspace, choose Model Training > Visualized Modeling (Designer) in the left-side navigation pane.
Create a pipeline.
On the Visualized Modeling (Designer) page, click the Preset Templates tab, select Business Area from the drop-down list, and then click the Large Language Model (LLM) tab. Find the LLM Data Processing-arXiv (Thesis Data) template and click Create.
Configure the pipeline parameters and click OK. You can retain the default values.
In the pipeline list, click the pipeline that you created and then click Open.
Configure the pipeline.
The pipeline contains the following key components:
LLM-Sensitive Content Mask (MaxCompute)-1
Masks the sensitive content in the text field. Examples:
Replaces an email address with
[EMAIL]
.Replaces a phone number with
[TELEPHONE]
or[MOBILEPHONE]
.Replaces an ID card number with
IDNUM
.
LLM-Clean Special Content (MaxCompute)-1
Deletes URLs from the text field.
LLM-Text Normalizer (MaxCompute)-1
Normalizes text samples in the text field to the Unicode format and converts Chinese text from traditional to simplified characters.
LLM-Count Filter (MaxCompute)-1
Deletes text samples that do not meet the required number or ratio of alphanumeric characters from the text field. Most of the characters in an arXiv dataset are letters and digits. This component can help clean your data in the dataset.
LLM-Length Filter (MaxCompute)-1
Filters text samples in the text field based on the average length of lines. Text is split into lines by line breaks (
\n
). The component calculates the average length of lines and filters text samples based on the specified threshold.LLM-N-Gram Repetition Filter (MaxCompute)-1
Filters text samples in the text field based on the character-level N-gram repetition rate. The component moves an N-character window across the text to generate contiguous sequences of N characters. Each sequence is called an N-gram. The component counts the occurrences of each N-gram and calculates the repetition rate by using the following formula:
Total frequencies of N-grams that occur more than once/Total frequencies of all N-grams
.LLM-Sensitive Keywords Filter (MaxCompute)-1
Filters text samples in the text field that contain sensitive keywords specified in the Preset sensitive keywords file.
LLM-Length Filter (MaxCompute)-2
Filters text samples based on the length of the longest line. Text is split into lines by line breaks (
\n
). The component calculates the length of the longest line and filters text samples based on the specified threshold.LLM-Perplexity Filter (MaxCompute)-1
Calculates the perplexity of text and filters text samples based on the specified perplexity threshold.
LLM-Special Characters Ratio Filter (MaxCompute)-1
Deletes text samples that do not meet the required ratio of special characters from the text field.
LLM-Length Filter (MaxCompute)-3
Filters text samples based on the text length.
LLM-Tokenization (MaxCompute)-1
Breaks text into tokens and saves the results to a new column.
LLM-Length Filter (MaxCompute)-4
Splits text into a list of words by using the space (
" "
) as the separator and filters text samples based on the number of words.LLM-N-Gram Repetition Filter (MaxCompute)-2
Filters text samples in the text field based on the word-level N-gram repetition rate. The component converts all words to lowercase and moves an N-word window across the text to generate contiguous sequences of N words. Each sequence is called an N-gram. The component counts the occurrences of each N-gram and calculates the repetition rate by using the following formula:
Total frequencies of N-grams that occur more than once/Total frequencies of all N-grams
.LLM-Document Deduplicator (MaxCompute)-1
Deduplicates text samples based on the specified threshold of the Jaccard similarity index and the Levenshtein distance.
Run the pipeline.
After you run the pipeline, right-click the Write Table -1 component and choose View Data > Output to view the processed dataset.
References
For more information about LLM data processing components, see LLM data processing (MaxCompute).
PAI provides a series of components for data processing, model training, and model inference. After data processing is complete, you can use the components to implement the end-to-end process from LLM development to application. For more information, see E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference.