Data Processing for LLM (Github Code) - Platform For AI - Alibaba Cloud Documentation Center

Machine Learning Designer of Platform for AI (PAI) provides various data processing components to help you edit, convert, filter, identify, and deduplicate data. You can combine different components to filter high-quality data and generate text samples that meet your business requirements. You can use the processed data to train large language models (LLMs). This topic describes how to use the LLM data processing components provided by PAI to cleanse and process GitHub code data. In this topic, a small amount of data obtained from the open source RedPajama-Data project is used.

Dataset

In this topic, 5,000 samples that are extracted from the raw data of the open source RedPajama-Data project are used as a dataset in the preset template Data Processing for LLM (Github Code) of Machine Learning Designer.

Create and run a pipeline

Go to the Visualized Modeling (Designer) page.
1. Log on to the PAI console.
2. In the upper-left corner, select a region based on your business requirements.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
4. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).
Create a pipeline.
1. On the Preset Templates tab, choose Business Area > LLM. In the Data Processing for LLM (Github Code) section, click Create.
2. In the Create Pipeline dialog box, configure the pipeline parameters and click OK. You can retain the default values.
3. In the pipeline list, find the pipeline that you created and click Open.

Configure the pipeline.

The pipeline contains the following key components:

LLM-Sensitive Content Mask (DLC)-1
Masks sensitive information in text samples in the content field. Examples:
- Replace an email address with [EMAIL].
- Replace a mobile phone number with [TELEPHONE] or [MOBILEPHONE].
- Replace an ID card number with IDNUM.
LLM-Clean Special Content (DLC)-1
Removes URLs from text samples in the content field.
LLM-Text Normalizer (DLC)-1
Normalizes text samples in the content field to the Unicode format.
LLM-Clean Copyright Information (DLC)-1
Deletes copyright information from text samples in the content field.
LLM-Count Filter (DLC)-1
Filters text samples in the content field based on the proportions of digits and letters and the ratio of letters to text tokens. Most of the characters in a GitHub code dataset are letters and digits. This component can help cleanse your data in the dataset.
LLM-Length Filter (DLC)-1
Filters text samples in the content field based on the text length, the average length, and the maximum line length. Text is split into lines by line breaks (\n). The component calculates the average length and the maximum line length, and filters text samples based on the specified threshold.
LLM-N-Gram Repetition Filter (DLC)-1
Filters text samples in the content field based on the repetition ratio of character-level or word-level N-Grams. If the N-grams are sequences of words, the component converts all words to lowercase before calculating the repetition ratio. The component moves an N-character window across a text sample to generate sequences of N characters or words. Each sequence is called an N-gram. The component counts the occurrences of each N-gram and calculates the repetition ratio by using the following formula: Total frequencies of N-grams that occur more than once/Total frequencies of all N-grams.
LLM-Length Filter (DLC)-2
Filters text samples in the content field based on the text length.
LLM-Document Deduplicator (DLC)-1
Deduplicates text samples based on the values of the window_size, num_blocks, and hamming_distance parameters.

Run the pipeline.
After you run the pipeline, right-click the LLM-Document Deduplicator (DLC)-1 component and choose View Data > Output Table to view the sample files that are processed by all the preceding components.

References

For more information about LLM components, see LLM Data Processing (DLC).
PAI provides a series of components for data processing, model training, and model inference. After data processing is complete, you can use the components to implement the end-to-end process from LLM development to application. For more information, see E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference.