Standardization - Platform For AI - Alibaba Cloud Documentation Center

Standardization is a data preprocessing technique that scales feature data to a uniform range. It normalizes the dimensions of different features by setting the mean to 0 and the standard deviation to 1. This is particularly beneficial in gradient descent optimization algorithms, where it can markedly enhance the model's convergence speed and performance.

Algorithm description

Standardizing one or more columns in a table results in the creation of a new table with the processed data.
The standardization formula is: (X - Mean) / Standard Deviation.
- Mean: The sample mean.
- Standard Deviation: The sample standard deviation, which estimates the population deviation. To approximate the population level more closely, the calculated standard deviation is adjusted, i.e., .
- The formula for sample standard deviation is: .
  
  Here, denotes the mean of the sample X1, X2, ..., Xn.

Configuration widget

Method 1: Visualization method

Place the Standardization widget onto the Designer workflow page and set the parameters on the right side of the interface:

Parameter type	Parameter	Description
Field set	Default select all	Default select all. Extra columns do not affect the prediction results.
	Retain original columns	Processed columns are prefixed with "stdized_". Supports DOUBLE and BIGINT types.
	Sparse support	Whether to enable sparse support. If enabled, configure: Separator between KV pairs. Separator between Key and Value. Feature index in the KV table that needs normalization.
Execution optimization	Number of computing cores	The system automatically assigns the number of training instances based on the input data volume.
Execution optimization	Memory per core	The system automatically assigns memory based on the input data volume. The unit is MB.

Method 2: PAI command method

Configure the Standardization widget parameters using the PAI command. The SQL script widget can be used to invoke the PAI command. For more information, see Scenario 4: Execute PAI command in SQL script widget.

Command for dense data

PAI -name Standardize
    -project algo_public 
    -DkeepOriginal="false"    
    -DoutputTableName="test_5"
    -DinputTablePartitions="pt=20150501"  
    -DinputTableName="bank_data_partition" 
    -DselectedColNames="euribor3m,pdays"

Command for sparse data

PAI -name Standardize    
    -project projectxlib4
    -DkeepOriginal="true"
    -DoutputTableName="kv_standard_output"
    -DinputTableName=kv_standard_test
    -DselectedColNames="f0,f1,f2"
    -DenableSparse=true
    -DoutputParaTableName=kv_standard_model    
    -DkvIndices=1,2,8,6
    -DitemDelimiter=",";

Parameter name	Required	Default value	Parameter description
inputTableName	Yes	None	Name of the input table.
selectedColNames	No	All columns	Columns in the input table that participate in training. Column names are separated by commas (,) and support INT and DOUBLE types. If the input is in sparse format, STRING type columns are supported.
inputTablePartitions	No	All partitions	Partitions in the input table that participate in training. The following formats are supported: Partition_name=value name1=value1/name2=value2: Multi-level format Note If multiple partitions are specified, separate them with commas (,).
outputTableName	Yes	None	Output result table.
outputParaTableName	Yes	None	Configure the output table.
inputParaTableName	No	None	Configure the input table.
keepOriginal	No	false	Whether to retain the original columns: true: Processed columns are renamed (prefixed with "stdized_"), and the original columns are retained. false: All columns are retained without renaming.
lifecycle	No	None	Lifecycle of the output table.
coreNum	No	Automatically assigned by the system	Number of cores.
memSizePerCore	No	Automatically assigned by the system	Memory used per core.
enableSparse	No	false	Whether to enable sparse support: true false
itemDelimiter	No	Default ","	Separator between KV pairs.
kvDelimiter	No	Default ":"	Separator between Key and Value.
kvIndices	No	None	Feature index in the KV table that needs normalization.

Usage examples

Data Generation

drop table if exists standardize_test_input;
create table standardize_test_input(
    col_string string,
    col_bigint bigint,
    col_double double,
    col_boolean boolean,
    col_datetime datetime);
insert overwrite table standardize_test_input
select
    *
from
(
    select
        '01' as col_string,
        10 as col_bigint,
        10.1 as col_double,
        True as col_boolean,
        cast('2016-07-01 10:00:00' as datetime) as col_datetime
    union all
        select
            cast(null as string) as col_string,
            11 as col_bigint,
            10.2 as col_double,
            False as col_boolean,
            cast('2016-07-02 10:00:00' as datetime) as col_datetime
    union all
        select
            '02' as col_string,
            cast(null as bigint) as col_bigint,
            10.3 as col_double,
            True as col_boolean,
            cast('2016-07-03 10:00:00' as datetime) as col_datetime
    union all
        select
            '03' as col_string,
            12 as col_bigint,
            cast(null as double) as col_double,
            False as col_boolean,
            cast('2016-07-04 10:00:00' as datetime) as col_datetime
    union all
        select
            '04' as col_string,
            13 as col_bigint,
            10.4 as col_double,
            cast(null as boolean) as col_boolean,
            cast('2016-07-05 10:00:00' as datetime) as col_datetime
    union all
        select
            '05' as col_string,
            14 as col_bigint,
            10.5 as col_double,
            True as col_boolean,
            cast(null as datetime) as col_datetime
) tmp;

PAI Command Line

drop table if exists standardize_test_input_output;
drop table if exists standardize_test_input_model_output;
PAI -name Standardize
    -project algo_public
    -DoutputParaTableName="standardize_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="standardize_test_input_output"
    -DinputTableName="standardize_test_input"
    -DselectedColNames="col_double,col_bigint"
    -DkeepOriginal="true";
drop table if exists standardize_test_input_output_using_model;
drop table if exists standardize_test_input_output_using_model_model_output;
PAI -name Standardize
    -project algo_public
    -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
    -DinputParaTableName="standardize_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="standardize_test_input_output_using_model"
    -DinputTableName="standardize_test_input";

Input Description

standardize_test_input

col_string	col_bigint	col_double	col_boolean	col_datetime
01	10	10.1	true	2016-07-01 10:00:00
NULL	11	10.2	false	2016-07-02 10:00:00
02	NULL	10.3	true	2016-07-03 10:00:00
03	12	NULL	false	2016-07-04 10:00:00
04	13	10.4	NULL	2016-07-05 10:00:00
05	14	10.5	true	NULL

Output Description

standardize_test_input_output

col_string	col_bigint	col_double	col_boolean	col_datetime	stdized_col_bigint	stdized_col_double
01	10	10.1	true	2016-07-01 10:00:00	-1.2649110640673518	-1.2649110640683832
NULL	11	10.2	false	2016-07-02 10:00:00	-0.6324555320336759	-0.6324555320341972
02	NULL	10.3	true	2016-07-03 10:00:00	NULL	0.0
03	12	NULL	false	2016-07-04 10:00:00	0.0	NULL
04	13	10.4	NULL	2016-07-05 10:00:00	0.6324555320336759	0.6324555320341859
05	14	10.5	true	NULL	1.2649110640673518	1.2649110640683718

standardize_test_input_model_output

feature	json
col_bigint	{“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
col_double	{“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}

standardize_test_input_output_using_model

col_string	col_bigint	col_double	col_boolean	col_datetime
01	-1.2649110640673515	-1.264911064068383	true	2016-07-01 10:00:00
NULL	-0.6324555320336758	-0.6324555320341971	false	2016-07-02 10:00:00
02	NULL	0.0	true	2016-07-03 10:00:00
03	0.0	NULL	false	2016-07-04 10:00:00
04	0.6324555320336758	0.6324555320341858	NULL	2016-07-05 10:00:00
05	1.2649110640673515	1.2649110640683716	true	NULL

standardize_test_input_output_using_model_model_output

feature	json
col_bigint	{“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
col_double	{“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}