Machine Learning Designer of Platform for AI (PAI) provides various data processing components to help you edit, convert, filter, identify, and deduplicate data. You can combine different components to obtain high-quality data and generate text samples that meet your business requirements. You can use the processed data to train large language models (LLMs). The topic describes how to use the data processing components of PAI to cleanse and process supervised fine-tuning (SFT) data. In this topic, a small amount of data obtained from the open source Alpaca-CoT project is used.
Dataset
In this topic, 5,000 samples that are extracted from the raw data of the open source Alpaca-CoT project are used as a dataset in the preset template LLM Data Processing-Alpaca-Cot (SFT Data) of Machine Learning Designer.
Create and run a pipeline
Go to the Visualized Modeling (Designer) page.
Log on to the PAI console.
In the upper-left corner, select a region based on your business requirements.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).
Create a pipeline.
On the Preset Templates tab, choose Business Area > LLM. In the LLM Data Processing-Alpaca-Cot (SFT Data) section, click Create.
In the Create Pipeline dialog box, configure the pipeline parameters and click OK. You can retain the default values.
In the pipeline list, find the pipeline that you created and click Open.
Configure the pipeline.
The pipeline contains the following key components:
LLM-MD5 Deduplicator (DLC)-1
Calculates the hash values of text samples in the text field and deduplicates the text samples based on the hash values. If multiple text samples have the same hash value, only one text sample is retained.
LLM-Count Filter (DLC)-1
Deletes text samples that do not meet the required proportions of digits and letters from the text field. Most of the characters in an SFT dataset are letters and digits. This component can help cleanse your data in the dataset.
LLM-N-Gram Repetition Filter (DLC)-1
Filters text samples in the text field based on the repetition ratio of character-level N-Grams. The component moves an N-character window across a text sample to generate sequences of N characters. Each sequence is called an N-gram. The component counts the occurrences of each N-gram and calculates the repetition ratio by using the following formula:
Total frequencies of N-grams that occur more than once/Total frequencies of all N-grams
.LLM-Sensitive Keywords Filter (DLC)-1
Filters out text samples in the text field that contains sensitive keywords specified in the Preset sensitive keywords file.
LLM-Length Filter (DLC)-1
Filters text samples in the text field based on the text length and the maximum line length. Text is split into lines by line breaks (
\n
). The component calculates the maximum line length and filters text samples based on the specified threshold.LLM-Document Deduplicator (DLC)-1
Deduplicates text samples based on the values of the window_size, num_blocks, and hamming_distance parameters.
Run the pipeline.
After you run the pipeline, right-click the LLM-Document Deduplicator (DLC)-1 component and choose View Data > Output Table to view the sample files that are processed by all the preceding components.
References
For more information about LLM components, see LLM Data Processing (DLC).
PAI provides a series of components for data processing, model training, and model inference. After data processing is complete, you can use the components to implement the end-to-end process from LLM development to application. For more information, see E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference.