All Products
Search
Document Center

Platform For AI:Standardization

Last Updated:Jan 06, 2025

Standardization is a data preprocessing technique that converts feature data to a unified dimension. The Standardization component unifies the dimensions of different features by setting the mean to 0 and the standard deviation to 1. Standardization can enhance the convergence and performance of the models, especially models that use the gradient descent optimization algorithm.

Algorithm description

  • You can standardize one or more columns in a table and save the generated data to a new table.

  • The following formula is used for standardization: (X - Mean)/(Standard deviation).

    • Mean: the mean of samples.

    • Standard deviation: the standard deviation of samples. The standard deviation is used when samples are used to calculate the total deviation. To make the value obtained after standardization closer to the mean, you must moderately increase the calculated standard deviation by using the formula standard deviation.

    • The formula used to calculate the standard deviation of samples is expression.

      x represents the mean of samples X1, X2, ..., and Xn.

Configure the component

Method 1: Configure the component on the pipeline page

Add the Standardization component on the pipeline page of Machine Learning Designer and configure the component parameters described in the following table on the right side of the page.

Tab

Parameter

Description

Fields Setting

All Selected by Default

Default select all. Extra columns do not affect the prediction results.

Reserve Original Columns

Processed columns are prefixed with "stdized_". Supports DOUBLE and BIGINT types.

Input Sparse Matrix

Specifies whether to support the input data in the sparse format. If you enable this feature, configure the following parameters:

  • KV Pair Delimiter

  • KV Delimiter

  • KV Index

Tuning

Cores

The number of cores. The system automatically allocates cores used for training based on the volume of input data.

Memory Size per Core

The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.

Method 2: Use PAI commands

Configure the Standardization component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

  • Command for dense data

    PAI -name Standardize
        -project algo_public 
        -DkeepOriginal="false"    
        -DoutputTableName="test_5"
        -DinputTablePartitions="pt=20150501"  
        -DinputTableName="bank_data_partition" 
        -DselectedColNames="euribor3m,pdays"
  • Command for sparse data

    PAI -name Standardize    
        -project projectxlib4
        -DkeepOriginal="true"
        -DoutputTableName="kv_standard_output"
        -DinputTableName=kv_standard_test
        -DselectedColNames="f0,f1,f2"
        -DenableSparse=true
        -DoutputParaTableName=kv_standard_model    
        -DkvIndices=1,2,8,6
        -DitemDelimiter=",";

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

selectedColNames

No

All columns

The columns that are selected from the input table for training. The column names must be separated by commas (,). Columns of the INT and DOUBLE types are supported. If the input data is in the sparse format, columns of the STRING type are supported.

inputTablePartitions

No

All partitions

The partitions that are selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

outputTableName

Yes

None

The name of the output table.

outputParaTableName

Yes

None

The name of the output parameter table.

inputParaTableName

No

None

The name of the input parameter table.

keepOriginal

No

false

Specifies whether to reserve original columns. Valid values:

  • true: renames the standardized columns with the stdized_ prefix and reserves original columns.

  • false: reserves all columns without renaming them.

lifecycle

No

None

The lifecycle of the output table.

coreNum

No

Automatically assigned by the system

The number of cores.

memSizePerCore

No

Automatically assigned by the system

The memory size of each core.

enableSparse

No

false

Specifies whether to support the input data in the sparse format. Valid values:

  • true

  • false

itemDelimiter

No

","

The delimiter used between key-value pairs.

kvDelimiter

No

":"

The delimiter used between keys and values.

kvIndices

No

None

The feature indexes that require standardization in the table that contains data in the key-value format.

Examples

  • Generate input data

    drop table if exists standardize_test_input;
    create table standardize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table standardize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
    ) tmp;
  • Run PAI commands

    drop table if exists standardize_test_input_output;
    drop table if exists standardize_test_input_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output"
        -DinputTableName="standardize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists standardize_test_input_output_using_model;
    drop table if exists standardize_test_input_output_using_model_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
        -DinputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output_using_model"
        -DinputTableName="standardize_test_input";
  • Input

    standardize_test_input

    col_string

    col_bigint

    col_double

    col_boolean

    col_datetime

    01

    10

    10.1

    true

    2016-07-01 10:00:00

    NULL

    11

    10.2

    false

    2016-07-02 10:00:00

    02

    NULL

    10.3

    true

    2016-07-03 10:00:00

    03

    12

    NULL

    false

    2016-07-04 10:00:00

    04

    13

    10.4

    NULL

    2016-07-05 10:00:00

    05

    14

    10.5

    true

    NULL

  • Output

    • standardize_test_input_output

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      stdized_col_bigint

      stdized_col_double

      01

      10

      10.1

      true

      2016-07-01 10:00:00

      -1.2649110640673518

      -1.2649110640683832

      NULL

      11

      10.2

      false

      2016-07-02 10:00:00

      -0.6324555320336759

      -0.6324555320341972

      02

      NULL

      10.3

      true

      2016-07-03 10:00:00

      NULL

      0.0

      03

      12

      NULL

      false

      2016-07-04 10:00:00

      0.0

      NULL

      04

      13

      10.4

      NULL

      2016-07-05 10:00:00

      0.6324555320336759

      0.6324555320341859

      05

      14

      10.5

      true

      NULL

      1.2649110640673518

      1.2649110640683718

    • standardize_test_input_model_output

      feature

      json

      col_bigint

      {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}

      col_double

      {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}

    • standardize_test_input_output_using_model

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      01

      -1.2649110640673515

      -1.264911064068383

      true

      2016-07-01 10:00:00

      NULL

      -0.6324555320336758

      -0.6324555320341971

      false

      2016-07-02 10:00:00

      02

      NULL

      0.0

      true

      2016-07-03 10:00:00

      03

      0.0

      NULL

      false

      2016-07-04 10:00:00

      04

      0.6324555320336758

      0.6324555320341858

      NULL

      2016-07-05 10:00:00

      05

      1.2649110640673515

      1.2649110640683716

      true

      NULL

    • standardize_test_input_output_using_model_model_output

      feature

      json

      col_bigint

      {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}

      col_double

      {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}