Machine Learning Designer of Platform for AI (PAI) provides various data processing components to help you edit, convert, filter, and deduplicate data. You can combine different components to filter high-quality data and generate text samples that meet your requirements. You can use the processed data to train large language models (LLMs). The topic describes how to use the data processing components of PAI to clean and process supervised fine-tuning (SFT) data. In the following example, a small amount of data obtained from the open source Alpaca-CoT project is used.
Dataset
Machine Learning Designer provides a preset template for processing SFT data. The template uses 5,000 sample data records extracted from the open source Alpaca-CoT project.
Create and run a pipeline
Go to the Visualized Modeling (Designer) page.
Log on to the PAI console.
In the upper-left corner, select a region based on your business requirements.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
In the workspace, choose Model Training > Visualized Modeling (Designer) in the left-side navigation pane.
Create a pipeline.
On the Visualized Modeling (Designer) page, click the Preset Templates tab, select Business Area from the drop-down list, and then click the Large Language Model (LLM) tab. Find the LLM Data Processing-Alpaca-Cot (SFT Data) template and click Create.
Configure the pipeline parameters and click OK. You can retain the default values.
In the pipeline list, click the pipeline that you created and then click Open.
Configure the pipeline.
The pipeline contains the following key components:
LLM-MD5 Deduplicator (MaxCompute)-1
Calculates the hash values of text samples. If multiple text samples have the same hash value, only one sample is retained.
LLM-Count Filter (MaxCompute)-1
Deletes text samples that do not meet the required number or ratio of alphanumeric characters from the text field. Most of the characters in an SFT dataset are letters and digits. This component can help clean your data in the dataset.
LLM-N-Gram Repetition Filter (MaxCompute)-1
Filters text samples in the text field based on the character-level N-gram repetition rate. The component moves an N-character window across the text to generate contiguous sequences of N characters. Each sequence is called an N-gram. The component counts the occurrences of each N-gram and calculates the repetition rate by using the following formula:
Total frequencies of N-grams that occur more than once/Total frequencies of all N-grams
.LLM-Sensitive Keywords Filter (MaxCompute)-1
Filters text samples in the text field that contain sensitive keywords specified in the Preset sensitive keywords file.
LLM-Length Filter (MaxCompute)-1
Filters text samples based on the length of the longest line. Text is split into lines by line breaks (
\n
). The component calculates the length of the longest line and filters text samples based on the specified threshold.LLM-Document Deduplicator (MaxCompute)-1
Deduplicates text samples based on the specified threshold of the Jaccard similarity index and the Levenshtein distance.
Run the pipeline.
After you run the pipeline, right-click the Write Table -1 component and choose View Data > Output to view the processed dataset.
References
For more information about LLM data processing components, see LLM data processing (MaxCompute).
PAI provides a series of components for data processing, model training, and model inference. After data processing is complete, you can use the components to implement the end-to-end process from LLM development to application. For more information, see E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference.