Data Processing for LLM (Github Code) - Platform For AI - Alibaba Cloud Documentation Center

This topic describes how to use Large Language Model (LLM) data processing components provided by Platform for AI (PAI) to clean and process GitHub code data. In this topic, LLM data processing components are used to process a small amount of GitHub code data that is stored in the open source project RedPajama.

Prerequisites

A workspace is created. For more information, see Create a workspace.
MaxCompute resources are associated with the workspace. For more information, see Manage workspaces.

Dataset

In this topic, 5,000 sample data records are extracted from raw GitHub data of the open source project RedPajama.

To improve data quality and the effect of model training, you can perform the steps that are described in Procedure to clean and process data.

Procedure

Go to the Machine Learning Designer page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the workspace page, choose Model Development and Training > Visual Modeling (Designer) to go to the Machine Learning Designer page.

Create a pipeline.

On the Visualized Modeling (Designer) page, click the Preset Templates tab.
On the Preset Templates tab, click the Large Language Model (LLM) tab. In the Data Processing for LLM (Github Code) card of the Large Language Model (LLM) tab, click Create.
In the Create Pipeline dialog box, configure the parameters and click OK. You can use the default values of the parameters.
Set the Pipeline Data Path (Recommended) parameter to the path of an Object Storage Service (OSS) bucket. The path is used to store data generated during the runtime of the pipeline.
In the pipeline list, double-click the pipeline that you create to open the pipeline.

View the components of the pipeline on the canvas, as shown in the following figure. The system automatically creates the pipeline based on the preset template.

Component	Description
LLM-Sensitive Content Mask-1	Masks sensitive information. Examples: Replaces the email address with `[EMAIL]`. Replaces the phone number with `[TELEPHONE]` or `[MOBILEPHONE]`. Replaces the ID card number with `IDNUM`. The following example shows the data in the content field after data is processed. The email address is replaced by `[EMAIL]`. Before processing After processing
LLM-Clean Special Content-1	Deletes the URL from the content field. The following example shows the data in the content field after data is processed. The URL is deleted from the content field. Before processing After processing
LLM-Text Normalizer-1	Normalizes the text in the content field in the standard Unicode format. The following example shows the data in the content field after data is processed. The related text is normalized.
LLM-Clean Copyright Information-1	Deletes the copyright information from the content field. The following example shows the data in the content field after data is processed. The related copyright information is deleted from the content field. Before processing After processing
LLM-Count Filter-1	Deletes sample data that fails to meet the required ratio of alphanumeric characters from the content field. Most of the characters in the GitHub code dataset are letters and digits. This component can be used to delete specific dirty data. The following example shows a list of specific data that is deleted. Most dirty data is deleted.
LLM-Length Filter-1	Filters sample data based on the total length, average length, and maximum row length of the content field. Line feeds ("\n") are used to split sample data before the average length and maximum row length of data are measured. The following example shows a list of specific data that is deleted from the dataset. Dirty data that is excessively short or excessively long is deleted.
LLM-N-Gram Repetition Filter-1	Filters sample data based on the character-level and word-level N-Gram repetition ratios of the content field. This component creates sliding windows with the size of N to split data based on characters or words to form a segment sequence with a length of N. Each segment is called a gram, and the number of occurrences of all grams is collected. Then, the ratio of `total frequencies of grams with frequencies greater than 1 to the sum of frequencies of all grams` is calculated and used as the repetition ratio for sample filtering. Note For word-level statistics, all words are converted into lowercase before the repetition ratio is calculated.
LLM-Length Filter-2	Splits sample data into a list of words based on spaces and filters sample data based on the length of the list. Sample data is filtered based on the number of words.
LLM-Document deduplicator-1	Removes similar text.

Click at the top of the canvas to run the pipeline.
After the pipeline is run as expected, right-click the Write Table-1 node and choose View Data > Output.
The output sample data is the sample data that is obtained after the data is filtered and processed by all processing components described in the preceding table.

References

After data processing is complete, you can use a series of LLM components provided by PAI to implement the end-to-end process from the development of LLM to the use of LLM. The LLM components include data processing components, training components, and inference components. For more information, see E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference.