what is the Text Summarization algorithm component - Platform For AI

The Text Summarization component can automatically generate abstracts based on the TextRank model. An abstract is a simple and coherent short text that accurately reflects the main idea of a document. The component allows computers to extract an abstract from a document. This topic describes how to configure the Text Summarization component provided by Platform for AI (PAI).

Limits

You can use the Text Summarization component based only on the computing resources of MaxCompute.

Usage notes

You can use a Sentence Splitting component as an upstream component to split the text into rows. Each row contains only one sentence.

Configure the component

You can use one of the following methods to configure the Text Summarization component.

Method 1: Configure the component in the PAI console

You can configure the parameters of the Text Summarization component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Column of Marked Document IDs	The name of the document ID column.
Fields Setting	Sentence Column	The sentence column. You can specify only one column.
Parameters Setting	Output First N Key Sentences	The top N key sentences that you want to obtain. Default value: 3.
	Sentence Similarity Calculation Method	The method used to calculate sentence similarities. Valid values: Ics_sim leveshtein_sim ssk cosine
	Weight of Matching String	The weight of a matched string. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk. Default value: 0.5.
	Length of Substring	The length of a substring. This parameter takes effect only if you set the Sentence Similarity Calculation Method parameter to ssk or Cosine. Default value: 2.
	Damping Coefficient	The damping coefficient. Default value: 0.85.
	Maximum Iterations	The maximum number of iterations. Default value: 100.
	Convergence Coefficient	The convergence coefficient. Default value: 0.000001.
Tuning	Number of Cores	The number of cores used for calculation. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each core. By default, the system determines the value.

Method 2: Configure the component by using PAI commands

You can use SQL scripts to call PAI commands. For more information, see SQL Script. The following table describes the parameters.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
inputTablePartitions	No	The partitions selected from the input table for computing.	All partitions
outputTableName	Yes	The name of the output table.	N/A
docIdCol	Yes	The name of the document ID column.	N/A
sentenceCol	Yes	The sentence column. You can specify only one column.	N/A
topN	No	The top N key sentences that you want to obtain.	3
similarityType	No	The method used to calculate sentence similarities. Valid values: Ics_sim leveshtein_sim ssk cosine	lcs_sim
lambda	No	The weight of a matched string. This parameter takes effect only if you set the similarityType parameter to ssk.	0.5
k	No	The length of a substring. This parameter takes effect only if you set the similarityType parameter to ssk or cosine.	2
dampingFactor	No	The damping coefficient.	0.85
maxIter	No	The maximum number of iterations.	100
epsilon	No	The convergence coefficient.	0.000001
lifecycle	No	The lifecycle of the input and output tables.	N/A
coreNum	No	The number of cores used for calculation.	Automatically allocated
memSizePerCore	No	The memory size of each core.	Automatically allocated

Examples

Prepare the input table test_input. The following section provides an example.

You can use the MaxCompute client to create a table and use Tunnel commands to upload data. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands.

doc_id

sentence

1000897

Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. The issue brings huge risks to public health security, causing widespread concern in society. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results. During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.

Parameters:

doc_id: the topic ID column.
sentence: the sentence column.

Use the Sentence Splitting component to split the text in the sentence column into rows. Each role contains only one sentence. The following table provides an example of the output table which is named test_output. For more information, see Sentence Splitting.

doc_id	sentence
1000897	Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent.
1000897	The issue brings huge risks to public health security, causing widespread concern in society.
1000897	Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.
1000897	During the process of cracking down illegal activities related to wild animals, law enforcement departments realized that the huge consumption of wild animals, huge profits of poaching, and the difficulty and high costs of identification are important reasons for the persistence of poaching of wild animals.

Run the following PAI command to generate a text summary.

You can use an SQL script or an ODPS SQL node component to run the following PAI commands.

PAI -name TextSummarization
    -project algo_public
    -DinputTableName="test_output"
    -DoutputTableName="test_output1"
    -DdocIdCol="doc_id"
    -DsentenceCol="sentence"
    -DtopN=2
    -Dlifecycle=30;

The output table contains the doc_id and abstract columns.

doc_id	abstract
1000897	Since the outbreak of the Covid-19 pandemic, the issue of consuming wild animals has been prominent. Public security, forestry, and market regulation departments across the country carried out relevant special actions to crack down on the illegal hunting, selling and consumption of wild animals, achieving remarkable results.

References

Use the Sentence Splitting component to split the text into rows. Each role contains only one sentence. For more information, see Sentence Splitting.
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.