Sentence Splitting - Platform For AI - Alibaba Cloud Documentation Center

This topic describes the Sentence Splitting component provided by Machine Learning Studio.

Text in a document can be split by punctuation. This component is used to process text before text summarization. It splits the text into rows. Each row contains only one sentence.

Configure the component

You can use one of the following methods to configure the Sentence Splitting component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Sentence Splitting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.


Tab	Parameter	Description
Fields Setting	Column of Marked Document IDs	The name of the document ID column.
	Marked Document Content Column	The name of the document column.
	Sentence Delimiter Set	The delimiters used to separate sentences. The default delimiters are periods (.), exclamation points (!), and question marks (?).
Tuning	Cores	The number of cores. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name SplitSentences    
    -project algo_public    
    -DinputTableName="test_input"    
    -DoutputTableName="test_output"    
    -DdocIdCol="doc_id"    
    -DdocContent="content"    
    -Dlifecycle=30


Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	No default value
inputTablePartitions	No	The partitions selected from the input table for computing.	All partitions
outputTableName	Yes	The name of the output table.	No default value
docIdCol	Yes	The name of the document ID column.	No default value
docContent	Yes	The name of the document content column. You can specify only one column.	No default value
delimiter	No	The delimiters used to separate sentences.	Period (.), exclamation point (!), and question mark (?)
lifecycle	No	The lifecycle of the input and output tables.	No default value
coreNum	No	The number of cores used for calculation.	Determined by the system
memSizePerCore	No	The memory size of each core.	Determined by the system

Example

The output table contains the doc_id and sentence columns.


doc_id	sentence
1000894	In 2008, the Shanghai Stock Exchange published disclosure guidelines on the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports.
1000894	In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports.