Train a text summarization model to generate texts - Platform For AI

Text summarization is the process of extracting key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use the Text Summarization Training component of Platform for AI (PAI) to train models that generate headlines, which summarize the main points of news. This topic describes how to configure the Text Summarization Training component.

Limits

The Text Summarization Training component can use only Deep Learning Containers (DLC) computing resources.

Model architecture

The model uses the standard Transformer architecture, including an encoder and a decoder. The encoder encodes texts and the decoder decodes texts. During training, the inputs are the original news, and the outputs are the headlines.

Usage notes

You can connect the input port of the Text Summarization Training component to the Sentence Splitting component to split a text into rows, each of which contains only one sentence.

Configure the component in the PAI console

You can configure parameters for the Text Summarization Training component in Machine Learning Designer.

Input ports
Input port (from left to right)
Data type
Recommended upstream component
Required
Training data
OSS
Read File Data
Yes
Validation data
OSS
Read File Data
Yes

Component parameters

Tab	Parameter	Description
Fields Setting	Input Schema	The text columns of the input file. Default value: title_tokens:str:1,content_tokens:str:1.
	TextColumn	The name of the column that corresponds to the original text in the input table. Default value: content_tokens.
	SummaryColumn	The name of the column that corresponds to the summary in the input table. Default value: title_tokens.
	OSS Directory for Alink Model	The directory that is used to store the generated text summarization model in an Object Storage Service (OSS) bucket.
Parameters Setting	Pretrained Model	The name of the pre-trained model. You can select a model on the Parameters Setting tab. Default value: alibaba-pai/mt5-title-generation-zh.
	batchSize	The number of samples to be processed per batch. The value must be of the INT type. Default value: 8. If the model is trained on multiple servers that have multiple GPUs, this parameter indicates the number of samples to be processed by each GPU in a batch.
	sequenceLength	The maximum length of a sequence that can be processed by the system. The value must be of the INT type. Valid values: 1 to 512. Default value: 512.
	numEpochs	The number of epochs for model training. The value must be of the INT type. Default value: 3.
	LearningRate	The learning rate during model training. The value must be of the FLOAT type. Default value: 3e-5.
	Save Checkpoint Steps	The number of steps that are performed before the system evaluates the model and saves the optimal model. Default value: 150.
	The model language	Valid values: zh: Chinese en: English
	Whether to copy text from input while decoding	Specify whether to copy text from the input table to the output table. Valid values: false (default): no true: yes
	The Minimal Length of the Predicted Sequence	The minimum length of the output text, which is of the INT type. Default value: 12.
	The Maximal Length of the Predicted Sequence	The maximum length of the output text, which is of the INT type. Default value: 32.
	The Minimal Non-Repeated N-gram Size	The minimum size of a non-repeated n-gram phrase, which is of the INT type. Default value: 2. For example, if you set the parameter to 1, the output text does not include strings such as "天天".
	The Number of Beam Search Scope	The search scope when beam search is used to select the best candidate sequences, which is of the INT type. Default value: 5. A greater value indicates a longer search time.
	The Number of Returned Candidate Sequences	The number of top candidate sequences returned by the model, which is of the INT type. Default value: 5.
Execution Tuning	GPU Machine type	The GPU-accelerated instance type of the computing resource. Default value: gn5-c8g1.2xlarge.

Output ports

Output port	Data type	Recommended downstream component	Required
output model	The OSS path of the output model. The value of this parameter is the same as the value of the ModelSavePath parameter that you set on the Fields Setting tab. The output model in the SavedModel format is stored in this OSS path.	Use the Text Summarization Prediction component	No

Examples

The following figure shows a sample workflow in which the Text Summarization Training component is used. 工作流 In this example, the components are configured and the pipeline is run in the following manner:

Prepare a training dataset (cn_train.txt) and an evaluation dataset (cn_dev.txt) and upload them to an OSS bucket. The training dataset and validation dataset used in this example are tab-delimited TXT files.
You can also upload CSV files to MaxCompute by running the Tunnel commands on a MaxCompute client. For more information about how to install and configure a MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.
Use the Read File Data - 1 and Read File Data - 2 components to read the training dataset and the evaluation dataset. Set the OSS Data Path parameter of the Read File Data component to the OSS path in which the training dataset and the evaluation dataset are stored.
Configure the training dataset and evaluation dataset as the input files of the Text Summarization Training-1 component and set the other parameters. For more information, see the "Configure the component in the PAI console" section of this topic.
Click to run the pipeline. After you run the pipeline, you can view the output in the OSS path specified in the ModelSavePath parameter of Text Summarization Training-1.

References

For more information about how to configure the Text Summarization Prediction component, see Text Summarization Prediction.

Input port (from left to right)	Data type	Recommended upstream component	Required
Training data	OSS	Read File Data	Yes
Validation data	OSS	Read File Data	Yes