Text summarization is the process of extracting key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use the Text Summarization Training component of Platform for AI (PAI) to train models that generate headlines, which summarize the main points of news. This topic describes how to configure the Text Summarization Training component.
Limits
The Text Summarization Training component can use only Deep Learning Containers (DLC) computing resources.
Model architecture
The model uses the standard Transformer architecture, including an encoder and a decoder. The encoder encodes texts and the decoder decodes texts. During training, the inputs are the original news, and the outputs are the headlines.
Usage notes
You can connect the input port of the Text Summarization Training component to the Sentence Splitting component to split a text into rows, each of which contains only one sentence.
Configure the component in the PAI console
You can configure parameters for the Text Summarization Training component in Machine Learning Designer.
Input ports
Input port (from left to right)
Data type
Recommended upstream component
Required
Training data
OSS
Yes
Validation data
OSS
Yes
Component parameters
Tab
Parameter
Description
Fields Setting
Input Schema
The text columns of the input file. Default value: title_tokens:str:1,content_tokens:str:1.
TextColumn
The name of the column that corresponds to the original text in the input table. Default value: content_tokens.
SummaryColumn
The name of the column that corresponds to the summary in the input table. Default value: title_tokens.
OSS Directory for Alink Model
The directory that is used to store the generated text summarization model in an Object Storage Service (OSS) bucket.
Parameters Setting
Pretrained Model
The name of the pre-trained model. You can select a model on the Parameters Setting tab. Default value: alibaba-pai/mt5-title-generation-zh.
batchSize
The number of samples to be processed per batch. The value must be of the INT type. Default value: 8.
If the model is trained on multiple servers that have multiple GPUs, this parameter indicates the number of samples to be processed by each GPU in a batch.
sequenceLength
The maximum length of a sequence that can be processed by the system. The value must be of the INT type. Valid values: 1 to 512. Default value: 512.
numEpochs
The number of epochs for model training. The value must be of the INT type. Default value: 3.
LearningRate
The learning rate during model training. The value must be of the FLOAT type. Default value: 3e-5.
Save Checkpoint Steps
The number of steps that are performed before the system evaluates the model and saves the optimal model. Default value: 150.
The model language
Valid values:
zh: Chinese
en: English
Whether to copy text from input while decoding
Specify whether to copy text from the input table to the output table. Valid values:
false (default): no
true: yes
The Minimal Length of the Predicted Sequence
The minimum length of the output text, which is of the INT type. Default value: 12.
The Maximal Length of the Predicted Sequence
The maximum length of the output text, which is of the INT type. Default value: 32.
The Minimal Non-Repeated N-gram Size
The minimum size of a non-repeated n-gram phrase, which is of the INT type. Default value: 2. For example, if you set the parameter to 1, the output text does not include strings such as "天天".
The Number of Beam Search Scope
The search scope when beam search is used to select the best candidate sequences, which is of the INT type. Default value: 5. A greater value indicates a longer search time.
The Number of Returned Candidate Sequences
The number of top candidate sequences returned by the model, which is of the INT type. Default value: 5.
Execution Tuning
GPU Machine type
The GPU-accelerated instance type of the computing resource. Default value: gn5-c8g1.2xlarge.
Output ports
Output port
Data type
Recommended downstream component
Required
output model
The OSS path of the output model. The value of this parameter is the same as the value of the ModelSavePath parameter that you set on the Fields Setting tab. The output model in the SavedModel format is stored in this OSS path.
No
Examples
The following figure shows a sample workflow in which the Text Summarization Training component is used. In this example, the components are configured and the pipeline is run in the following manner:
Prepare a training dataset (cn_train.txt) and an evaluation dataset (cn_dev.txt) and upload them to an OSS bucket. The training dataset and validation dataset used in this example are tab-delimited TXT files.
You can also upload CSV files to MaxCompute by running the Tunnel commands on a MaxCompute client. For more information about how to install and configure a MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.
Use the Read File Data - 1 and Read File Data - 2 components to read the training dataset and the evaluation dataset. Set the OSS Data Path parameter of the Read File Data component to the OSS path in which the training dataset and the evaluation dataset are stored.
Configure the training dataset and evaluation dataset as the input files of the Text Summarization Training-1 component and set the other parameters. For more information, see the "Configure the component in the PAI console" section of this topic.
Click to run the pipeline. After you run the pipeline, you can view the output in the OSS path specified in the ModelSavePath parameter of Text Summarization Training-1.
References
For more information about how to configure the Text Summarization Prediction component, see Text Summarization Prediction.