The Text Summarization component can automatically generate abstracts based on the TextRank model. An abstract is a simple and coherent short text that accurately reflects the main idea of a document. The component allows computers to extract an abstract from a document. This topic describes how to configure the Text Summarization component provided by Platform for AI (PAI).
Limits
You can use the Text Summarization component based only on the computing resources of MaxCompute.
Usage notes
You can use a Sentence Splitting component as an upstream component to split the text into rows. Each row contains only one sentence.
Configure the component
You can use one of the following methods to configure the Text Summarization component.
Method 1: Configure the component in the PAI console
You can configure the parameters of the Text Summarization component in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Column of Marked Document IDs | The name of the document ID column. |
Sentence Column | The sentence column. You can specify only one column. | |
Parameters Setting | Output First N Key Sentences | The top N key sentences that you want to obtain. Default value: 3. |
Sentence Similarity Calculation Method | The method used to calculate sentence similarities. Valid values:
| |
Weight of Matching String | The weight of a matched string. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk. Default value: 0.5. | |
Length of Substring | The length of a substring. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk or Cosine. Default value: 2. | |
Damping Coefficient | The damping coefficient. Default value: 0.85. | |
Maximum Iterations | The maximum number of iterations. Default value: 100. | |
Convergence Coefficient | The convergence coefficient. Default value: 0.000001. | |
Tuning | Number of Cores | The number of cores used for calculation. By default, the system determines the value. |
Memory Size per Core | The memory size of each core. By default, the system determines the value. |
Method 2: Configure the component by using PAI commands
You can use SQL scripts to call PAI commands. For more information, see SQL Script. The following table describes the parameters.
PAI -name TextSummarization
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DdocIdCol="doc_id"
-DsentenceCol="sentence"
-DtopN=2
-Dlifecycle=30;
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partitions selected from the input table for computing. | All partitions |
outputTableName | Yes | The name of the output table. | N/A |
docIdCol | Yes | The name of the document ID column. | N/A |
sentenceCol | Yes | The sentence column. You can specify only one column. | N/A |
topN | No | The top N key sentences that you want to obtain. | 3 |
similarityType | No | The method used to calculate sentence similarities. Valid values:
| lcs_sim |
lambda | No | The weight of a matched string. This parameter takes effect only if you set the similarityType parameter to ssk. | 0.5 |
k | No | The length of a substring. This parameter takes effect only if you set the similarityType parameter to ssk or cosine. | 2 |
dampingFactor | No | The damping coefficient. | 0.85 |
maxIter | No | The maximum number of iterations. | 100 |
epsilon | No | The convergence coefficient. | 0.000001 |
lifecycle | No | The lifecycle of the input and output tables. | N/A |
coreNum | No | The number of cores used for calculation. | Automatically allocated |
memSizePerCore | No | The memory size of each core. | Automatically allocated |
Examples
Prepare the input table test_input. The following section provides an example.
You can use the MaxCompute client to create a table and use Tunnel commands to upload data. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.
doc_id
sentence
1000897
Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. The issue brings huge risks to public health security, causing widespread concern in society. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results. During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.
Parameters:
doc_id: the topic ID column.
sentence: the sentence column.
Use the Sentence Splitting component to split the text in the sentence column into rows. Each role contains only one sentence. The following table provides an example of the output table which is named test_output. For more information, see Sentence Splitting.
doc_id
sentence
1000897
Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent.
1000897
The issue brings huge risks to public health security, causing widespread concern in society.
1000897
Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.
1000897
During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.
Run the following PAI command to generate a text summary.
You can use an SQL script or an ODPS SQL node component to run the following PAI commands.
PAI -name TextSummarization -project algo_public -DinputTableName="test_output" -DoutputTableName="test_output1" -DdocIdCol="doc_id" -DsentenceCol="sentence" -DtopN=2 -Dlifecycle=30;
The output table contains the doc_id and abstract columns.
doc_id
abstract
1000897
Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.
References
Use the Sentence Splitting component to split the text into rows. Each role contains only one sentence. For more information, see Sentence Splitting.
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.