This topic describes the Sentence Splitting component provided by Machine Learning Studio.
Text in a document can be split by punctuation. This component is used to process text before text summarization. It splits the text into rows. Each row contains only one sentence.
Configure the component
You can use one of the following methods to configure the Sentence Splitting component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Sentence Splitting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
---|---|---|
Fields Setting | Column of Marked Document IDs | The name of the document ID column. |
Marked Document Content Column | The name of the document column. | |
Sentence Delimiter Set | The delimiters used to separate sentences. The default delimiters are periods (.), exclamation points (!), and question marks (?). | |
Tuning | Cores | The number of cores. By default, the system determines the value. |
Memory Size per Core | The memory size of each core. By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name SplitSentences
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DdocIdCol="doc_id"
-DdocContent="content"
-Dlifecycle=30
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input table. | No default value |
inputTablePartitions | No | The partitions selected from the input table for computing. | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
docIdCol | Yes | The name of the document ID column. | No default value |
docContent | Yes | The name of the document content column. You can specify only one column. | No default value |
delimiter | No | The delimiters used to separate sentences. | Period (.), exclamation point (!), and question mark (?) |
lifecycle | No | The lifecycle of the input and output tables. | No default value |
coreNum | No | The number of cores used for calculation. | Determined by the system |
memSizePerCore | No | The memory size of each core. | Determined by the system |
Example
The output table contains the doc_id and sentence columns.
doc_id | sentence |
---|---|
1000894 | In 2008, the Shanghai Stock Exchange published disclosure guidelines on the corporate social responsibility (CSR) of listed companies. Three types of companies were urged to disclose their CSR reports, and other qualified listed companies were encouraged to voluntarily disclose their CSR reports. |
1000894 | In 2012, a total of 379 listed companies made up 40% of all listed companies disclosed CSR reports. Among those companies, 305 were mandated to disclose CSR reports and 74 voluntarily disclosed CSR reports. |