All Products
Search
Document Center

Platform For AI:Standardization

Last Updated:Dec 31, 2024

Standardization is a data preprocessing technique that scales feature data to a uniform range. It normalizes the dimensions of different features by setting the mean to 0 and the standard deviation to 1. This is particularly beneficial in gradient descent optimization algorithms, where it can markedly enhance the model's convergence speed and performance.

Algorithm description

  • Standardizing one or more columns in a table results in the creation of a new table with the processed data.

  • The standardization formula is: (X - Mean) / Standard Deviation.

    • Mean: The sample mean.

    • Standard Deviation: The sample standard deviation, which estimates the population deviation. To approximate the population level more closely, the calculated standard deviation is adjusted, i.e., standard deviation.

    • The formula for sample standard deviation is: expression.

      Here, x denotes the mean of the sample X1, X2, ..., Xn.

Configuration widget

Method 1: Visualization method

Place the Standardization widget onto the Designer workflow page and set the parameters on the right side of the interface:

Parameter type

Parameter

Description

Field set

Default select all

Default select all. Extra columns do not affect the prediction results.

Retain original columns

Processed columns are prefixed with "stdized_". Supports DOUBLE and BIGINT types.

Sparse support

Whether to enable sparse support. If enabled, configure:

  • Separator between KV pairs.

  • Separator between Key and Value.

  • Feature index in the KV table that needs normalization.

Execution optimization

Number of computing cores

The system automatically assigns the number of training instances based on the input data volume.

Memory per core

The system automatically assigns memory based on the input data volume. The unit is MB.

Method 2: PAI command method

Configure the Standardization widget parameters using the PAI command. The SQL script widget can be used to invoke the PAI command. For more information, see Scenario 4: Execute PAI command in SQL script widget.

  • Command for dense data

    PAI -name Standardize
        -project algo_public 
        -DkeepOriginal="false"    
        -DoutputTableName="test_5"
        -DinputTablePartitions="pt=20150501"  
        -DinputTableName="bank_data_partition" 
        -DselectedColNames="euribor3m,pdays"
  • Command for sparse data

    PAI -name Standardize    
        -project projectxlib4
        -DkeepOriginal="true"
        -DoutputTableName="kv_standard_output"
        -DinputTableName=kv_standard_test
        -DselectedColNames="f0,f1,f2"
        -DenableSparse=true
        -DoutputParaTableName=kv_standard_model    
        -DkvIndices=1,2,8,6
        -DitemDelimiter=",";

Parameter name

Required

Default value

Parameter description

inputTableName

Yes

None

Name of the input table.

selectedColNames

No

All columns

Columns in the input table that participate in training. Column names are separated by commas (,) and support INT and DOUBLE types. If the input is in sparse format, STRING type columns are supported.

inputTablePartitions

No

All partitions

Partitions in the input table that participate in training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: Multi-level format

Note

If multiple partitions are specified, separate them with commas (,).

outputTableName

Yes

None

Output result table.

outputParaTableName

Yes

None

Configure the output table.

inputParaTableName

No

None

Configure the input table.

keepOriginal

No

false

Whether to retain the original columns:

  • true: Processed columns are renamed (prefixed with "stdized_"), and the original columns are retained.

  • false: All columns are retained without renaming.

lifecycle

No

None

Lifecycle of the output table.

coreNum

No

Automatically assigned by the system

Number of cores.

memSizePerCore

No

Automatically assigned by the system

Memory used per core.

enableSparse

No

false

Whether to enable sparse support:

  • true

  • false

itemDelimiter

No

Default ","

Separator between KV pairs.

kvDelimiter

No

Default ":"

Separator between Key and Value.

kvIndices

No

None

Feature index in the KV table that needs normalization.

Usage examples

  • Data Generation

    drop table if exists standardize_test_input;
    create table standardize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table standardize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
    ) tmp;
  • PAI Command Line

    drop table if exists standardize_test_input_output;
    drop table if exists standardize_test_input_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output"
        -DinputTableName="standardize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists standardize_test_input_output_using_model;
    drop table if exists standardize_test_input_output_using_model_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
        -DinputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output_using_model"
        -DinputTableName="standardize_test_input";
  • Input Description

    standardize_test_input

    col_string

    col_bigint

    col_double

    col_boolean

    col_datetime

    01

    10

    10.1

    true

    2016-07-01 10:00:00

    NULL

    11

    10.2

    false

    2016-07-02 10:00:00

    02

    NULL

    10.3

    true

    2016-07-03 10:00:00

    03

    12

    NULL

    false

    2016-07-04 10:00:00

    04

    13

    10.4

    NULL

    2016-07-05 10:00:00

    05

    14

    10.5

    true

    NULL

  • Output Description

    • standardize_test_input_output

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      stdized_col_bigint

      stdized_col_double

      01

      10

      10.1

      true

      2016-07-01 10:00:00

      -1.2649110640673518

      -1.2649110640683832

      NULL

      11

      10.2

      false

      2016-07-02 10:00:00

      -0.6324555320336759

      -0.6324555320341972

      02

      NULL

      10.3

      true

      2016-07-03 10:00:00

      NULL

      0.0

      03

      12

      NULL

      false

      2016-07-04 10:00:00

      0.0

      NULL

      04

      13

      10.4

      NULL

      2016-07-05 10:00:00

      0.6324555320336759

      0.6324555320341859

      05

      14

      10.5

      true

      NULL

      1.2649110640673518

      1.2649110640683718

    • standardize_test_input_model_output

      feature

      json

      col_bigint

      {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}

      col_double

      {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}

    • standardize_test_input_output_using_model

      col_string

      col_bigint

      col_double

      col_boolean

      col_datetime

      01

      -1.2649110640673515

      -1.264911064068383

      true

      2016-07-01 10:00:00

      NULL

      -0.6324555320336758

      -0.6324555320341971

      false

      2016-07-02 10:00:00

      02

      NULL

      0.0

      true

      2016-07-03 10:00:00

      03

      0.0

      NULL

      false

      2016-07-04 10:00:00

      04

      0.6324555320336758

      0.6324555320341858

      NULL

      2016-07-05 10:00:00

      05

      1.2649110640673515

      1.2649110640683716

      true

      NULL

    • standardize_test_input_output_using_model_model_output

      feature

      json

      col_bigint

      {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}

      col_double

      {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}