All Products
Search
Document Center

Platform For AI:Word Splitting

Last Updated:May 17, 2024

This topic describes the Word Splitting component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

This component splits words in specific columns based on Alibaba Word Segmenter (AliWS). The words obtained after splitting are separated by spaces. If you set the POS Tagger or Semantic Tagger parameter, the system provides words after splitting, the Part-of-Speech (POS) tagging results, and the semantic tagging results. The POS tagging results are separated by forward slashes (/) and semantic tagging results are separated by vertical bars (|).

The tokenizer can be TAOBAO_CHN or INTERNET_CHN.

You can configure the component by using the Machine Learning Platform for AI (PAI) console or a PAI command.

Configure the component

You can use one of the following methods to configure the Word Splitting component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Word Splitting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Column

The columns used for word splitting.

Parameters Setting

Recognition Options

The types of content for recognition. Valid values:

  • Recognize Simple Entities

  • Recognize Individual Names

  • Recognize Organization Names

  • Recognize Telephone Numbers

  • Recognize Times

  • Recognize Dates

  • Recognize Alphanumeric Characters

By default, the following options are selected: Recognize Simple Entities, Recognize Telephone Numbers, Recognize Times, Recognize Dates, and Recognize Alphanumeric Characters.

Merge Options

The types of content for merging. Valid values:

  • Merge Chinese Numbers

  • Merge Arabic Numerals

  • Merge Chinese Dates

  • Merge Chinese Times

Default value: Merge Arabic Numbers.

Tokenizer

The type of the tokenizer. Valid values: TAOBAO_CHN and INTERNET_CHN. Default value: TAOBAO_CHN.

Pos Tagger

Specifies whether to enable POS tagging. By default, POS tagging is enabled.

Semantic Tagger

Specifies whether to enable semantic tagging. By default, semantic tagging is disabled.

Filter Out Words That Contain Only Numbers

Specifies whether to filter out words whose word segmentation results are numbers. By default, this option is cleared.

Filter Out Words That Contain Only English Letters

Specifies whether to filter out words whose word segmentation results are English letters. By default, this option is cleared.

Filter Out Words That Contain Only Punctuations

Specifies whether to filter out words whose word segmentation results are punctuation marks. By default, this option is cleared.

Tuning

Cores

The number of cores. By default, the system determines the value.

Memory Size per Core

The memory size of each core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

No default value

inputTablePartitions

No

The partitions selected from the input table for word splitting. This value must be in the partition_name=value format. To specify multiple partitions, use the following format: name1=value1/name2=value2. If you specify multiple partitions, separate them with commas (,).

All partitions

selectedColNames

Yes

The names of the columns selected from the input table for word splitting. If multiple columns are specified, separate them with commas (,).

No default value

dictTableName

No

Specifies whether to use a custom dictionary. A custom dictionary has only one column, and each row contains only one word.

No default value

tokenizer

No

The type of the tokenizer. Valid values: TAOBAO_CHN and INTERNET_CHN.

TAOBAO_CHN

enableDfa

No

Specifies whether to recognize simple entities. Valid values: True and False.

True

enablePersonNameTagger

No

Specifies whether to recognize individual names. Valid values: True and False.

False

enableOrgnizationTagger

No

Specifies whether to recognize organization names. Valid values: True and False.

False

enablePosTagger

No

Specifies whether to enable POS tagging. Valid values: True and False.

False

enableTelephoneRetrievalUnit

No

Specifies whether to recognize telephone numbers. Valid values: True and False.

True

enableTimeRetrievalUnit

No

Specifies whether to recognize time expressions. Valid values: True and False.

True

enableDateRetrievalUnit

No

Specifies whether to recognize date expressions. Valid values: True and False.

True

enableNumberLetterRetrievalUnit

No

Specifies whether to recognize digits and letters. Valid values: True and False.

True

enableChnNumMerge

No

Specifies whether to merge Chinese numbers into a retrieval unit. Valid values: True and False.

False

enableNumMerge

No

Specifies whether to merge Arabic numerals into a retrieval unit. Valid values: True and False.

True

enableChnTimeMerge

No

Specifies whether to merge Chinese time expressions into a semantic unit. Valid values: True and False.

False

enableChnDateMerge

No

Specifies whether to merge Chinese date expressions into a semantic unit. Valid values: True and False.

False

enableSemanticTagger

No

Specifies whether to enable semantic tagging. Valid values: True and False.

False

outputTableName

Yes

The name of the output table.

No default value

outputTablePartition

No

The names of the partitions in the output table.

No default value

coreNum

No

The number of cores. This parameter takes effect only when the memSizePerCore parameter is set. The value must be a positive integer in the range of [1,9999].

Determined by the system

memSizePerCore

No

The memory size of each core. Unit: MB. The value must be a positive integer in the range of [1024,64 × 1024].

Determined by the system

lifecycle

No

The lifecycle of the output table. The value must be a positive integer.

No default value

If you use a regular table, we recommend that you do not set the coreNum and memSizePerCore parameters. The Word Splitting component automatically determines the parameter values by default.

If your resources are limited, you can use the following code to calculate the number of cores and the memory size of each core:

def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
    """Calculates the number of cores and memory size of each core.
       Args:
           row: the number of rows in the input table.
           col: the number of columns in the input table.
           kOneCoreDataSize: the amount of data that can be computed by each core. Unit: MB. The value must be a positive integer. Default value: 1024.
       Return:
           coreNum, memSizePerCore
       Example:
           coreNum, memSizePerCore = CalcCoreNumAndMem(1000,99, 100, kOneCoreDataSize=2048)
    """
    kMBytes = 1024.0 * 1024.0
    # Number of cores involved in computing
    coreNum = max(1, int(row * col * 1000/ kMBytes / kOneCoreDataSize))
    # Memory size per core = Data amount
    memSizePerCore = max(1024,  int(kOneCoreDataSize*2))
    return coreNum,  memSizePerCore

Example

  • Generated data

    create table pai_aliws_test
    as select
        1 as id,
        'Today is a good day. The weather is fine and sunny.' ' as content;
  • PAI command

    pai -name split_word
        -project algo_public
        -DinputTableName=pai_aliws_test
        -DselectedColNames=content
        -DoutputTableName=doc_test_split_word
  • Input description

    The input table consists of two columns: id and content.

    +------------+------------+
    | id         | content    |
    +------------+------------+
    | 1          | Today is a good day. The weather is fine and sunny.  |
  • Output description

    • The words in the tokenization column of the input table are split and then returned. The rest columns are returned without changes.

    • When a custom dictionary is used, the system splits words based on the custom dictionary and context.