All Products
Search
Document Center

Platform For AI:Normalization

Last Updated:May 17, 2024

This topic describes the Normalization component provided by Machine Learning Designer (formerly known as Machine Learning Studio).

Configure the component

You can use one of the following methods to configure the Normalization component.

Method 1: Configure the component on the pipeline page

Configure the component parameters on the pipeline page of Machine Learning Designer.

Tab

Parameter

Description

Fields Setting

All Selected by Default

By default, all columns in the input table are selected. Specific columns may not be used for training. These columns do not affect the prediction result.

Reserve Original Columns

Specifies whether to reserve original columns. Column names are prefixed with normalized_ after normalization. Only columns of the DOUBLE or BIGINT type can be reserved.

Tuning

Cores

The number of cores. The system automatically allocates cores used for training based on the volume of input data.

Memory Size per Core

The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

  • Command for dense data

    PAI -name Normalize
        -project algo_public
        -DkeepOriginal="true"
        -DoutputTableName="test_4"
        -DinputTablePartitions="pt=20150501"
        -DinputTableName="bank_data_partition"
        -DselectedColNames="emp_var_rate,euribor3m"
  • Command for sparse data

    PAI -name Normalize
        -project projectxlib4
        -DkeepOriginal="true"
        -DoutputTableName="kv_norm_output"
        -DinputTableName=kv_norm_test
        -DselectedColNames="f0,f1,f2"
        -DenableSparse=true
        -DoutputParaTableName=kv_norm_model
        -DkvIndices=1,2,8,6
        -DitemDelimiter=",";

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

No default value

selectedColNames

No

The columns that are selected from the input table for training. The column names must be separated by commas (,). Columns of the INT and DOUBLE types are supported. If the input data is in the sparse format, columns of the STRING type are supported.

All columns

inputTablePartitions

No

The partitions that are selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

All partitions

outputTableName

Yes

The name of the output table.

No default value

outputParaTableName

No

The name of the output parameter table.

No default value

inputParaTableName

Yes

The name of the input parameter table.

No default value

keepOriginal

No

Specifies whether to reserve original columns. Valid values:

  • true: renames the normalized columns with the normalized_ prefix and reserves original columns.

  • false: reserves all columns without renaming them.

false

lifecycle

No

The lifecycle of the output table. Valid values: [1,3650].

No default value

coreNum

No

The number of cores used in computing. The value must be a positive integer.

Determined by the system

memSizePerCore

No

The memory size of each core. Unit: MB. Valid values: (1,65536).

Determined by the system

enableSparse

No

Specifies whether to support the input data in the sparse format. Valid values:

  • true

  • false

false

itemDelimiter

No

The delimiter used between key-value pairs.

,

kvDelimiter

No

The delimiter used between keys and values.

:

kvIndices

No

The feature indexes that require normalization in the table that contains data in the key-value format.

No default value

Example

  • Generate input data

    drop table if exists normalize_test_input;
    create table normalize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table normalize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
    ) tmp;
  • Run PAI commands

    drop table if exists normalize_test_input_output;
    drop table if exists normalize_test_input_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output"
        -DinputTableName="normalize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists normalize_test_input_output_using_model;
    drop table if exists normalize_test_input_output_using_model_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_output_using_model_model_output"
        -DinputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output_using_model"
        -DinputTableName="normalize_test_input";
  • Input

    normalize_test_input

    col_string

    col_bigint

    col_double

    col_boolean

    col_datetime

    01

    10

    10.1

    true

    2016-07-01 10:00:00

    NULL

    11

    10.2

    false

    2016-07-02 10:00:00

    02

    NULL

    10.3

    true

    2016-07-03 10:00:00

    03

    12

    NULL

    false

    2016-07-04 10:00:00

    04

    13

    10.4

    NULL

    2016-07-05 10:00:00

    05

    14

    10.5

    true

    NULL

  • Output

    • normalize_test_input_output

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      normalized_col_bigint

      normalized_col_double

      01

      10

      10.1

      true

      2016-07-01 10:00:00

      0.0

      0.0

      NULL

      11

      10.2

      false

      2016-07-02 10:00:00

      0.25

      0.2499999999999989

      02

      NULL

      10.3

      true

      2016-07-03 10:00:00

      NULL

      0.5000000000000022

      03

      12

      NULL

      false

      2016-07-04 10:00:00

      0.5

      NULL

      04

      13

      10.4

      NULL

      2016-07-05 10:00:00

      0.75

      0.7500000000000011

      05

      14

      10.5

      true

      NULL

      1.0

      1.0

    • normalize_test_input_model_output

      feature

      json

      col_bigint

      {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}

      col_double

      {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}

    • normalize_test_input_output_using_model

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      01

      0.0

      0.0

      true

      2016-07-01 10:00:00

      NULL

      0.25

      0.2499999999999989

      false

      2016-07-02 10:00:00

      02

      NULL

      0.5000000000000022

      true

      2016-07-03 10:00:00

      03

      0.5

      NULL

      false

      2016-07-04 10:00:00

      04

      0.75

      0.7500000000000011

      NULL

      2016-07-05 10:00:00

      05

      1.0

      1.0

      true

      NULL

    • normalize_test_input_output_using_model_model_output

      feature

      json

      col_bigint

      {"name": "normalize", "type":"bigint", "paras":{"min":10, "max": 14}}

      col_double

      {"name": "normalize", "type":"double", "paras":{"min":10.1, "max": 10.5}}