Standardization is a data preprocessing technique that scales feature data to a uniform range. It normalizes the dimensions of different features by setting the mean to 0 and the standard deviation to 1. This is particularly beneficial in gradient descent optimization algorithms, where it can markedly enhance the model's convergence speed and performance.
Algorithm description
-
Standardizing one or more columns in a table results in the creation of a new table with the processed data.
-
The standardization formula is: (X - Mean) / Standard Deviation.
-
Mean: The sample mean.
-
Standard Deviation: The sample standard deviation, which estimates the population deviation. To approximate the population level more closely, the calculated standard deviation is adjusted, i.e., .
-
The formula for sample standard deviation is: .
Here, denotes the mean of the sample X1, X2, ..., Xn.
-
Configuration widget
Method 1: Visualization method
Place the Standardization widget onto the Designer workflow page and set the parameters on the right side of the interface:
Parameter type | Parameter | Description |
Field set | Default select all | Default select all. Extra columns do not affect the prediction results. |
Retain original columns | Processed columns are prefixed with "stdized_". Supports DOUBLE and BIGINT types. | |
Sparse support | Whether to enable sparse support. If enabled, configure:
| |
Execution optimization | Number of computing cores | The system automatically assigns the number of training instances based on the input data volume. |
Memory per core | The system automatically assigns memory based on the input data volume. The unit is MB. |
Method 2: PAI command method
Configure the Standardization widget parameters using the PAI command. The SQL script widget can be used to invoke the PAI command. For more information, see Scenario 4: Execute PAI command in SQL script widget.
-
Command for dense data
PAI -name Standardize -project algo_public -DkeepOriginal="false" -DoutputTableName="test_5" -DinputTablePartitions="pt=20150501" -DinputTableName="bank_data_partition" -DselectedColNames="euribor3m,pdays"
-
Command for sparse data
PAI -name Standardize -project projectxlib4 -DkeepOriginal="true" -DoutputTableName="kv_standard_output" -DinputTableName=kv_standard_test -DselectedColNames="f0,f1,f2" -DenableSparse=true -DoutputParaTableName=kv_standard_model -DkvIndices=1,2,8,6 -DitemDelimiter=",";
Parameter name | Required | Default value | Parameter description |
inputTableName | Yes | None | Name of the input table. |
selectedColNames | No | All columns | Columns in the input table that participate in training. Column names are separated by commas (,) and support INT and DOUBLE types. If the input is in sparse format, STRING type columns are supported. |
inputTablePartitions | No | All partitions | Partitions in the input table that participate in training. The following formats are supported:
Note If multiple partitions are specified, separate them with commas (,). |
outputTableName | Yes | None | Output result table. |
outputParaTableName | Yes | None | Configure the output table. |
inputParaTableName | No | None | Configure the input table. |
keepOriginal | No | false | Whether to retain the original columns:
|
lifecycle | No | None | Lifecycle of the output table. |
coreNum | No | Automatically assigned by the system | Number of cores. |
memSizePerCore | No | Automatically assigned by the system | Memory used per core. |
enableSparse | No | false | Whether to enable sparse support:
|
itemDelimiter | No | Default "," | Separator between KV pairs. |
kvDelimiter | No | Default ":" | Separator between Key and Value. |
kvIndices | No | None | Feature index in the KV table that needs normalization. |
Usage examples
-
Data Generation
drop table if exists standardize_test_input; create table standardize_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table standardize_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime ) tmp;
-
PAI Command Line
drop table if exists standardize_test_input_output; drop table if exists standardize_test_input_model_output; PAI -name Standardize -project algo_public -DoutputParaTableName="standardize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="standardize_test_input_output" -DinputTableName="standardize_test_input" -DselectedColNames="col_double,col_bigint" -DkeepOriginal="true"; drop table if exists standardize_test_input_output_using_model; drop table if exists standardize_test_input_output_using_model_model_output; PAI -name Standardize -project algo_public -DoutputParaTableName="standardize_test_input_output_using_model_model_output" -DinputParaTableName="standardize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="standardize_test_input_output_using_model" -DinputTableName="standardize_test_input";
-
Input Description
standardize_test_input
col_string
col_bigint
col_double
col_boolean
col_datetime
01
10
10.1
true
2016-07-01 10:00:00
NULL
11
10.2
false
2016-07-02 10:00:00
02
NULL
10.3
true
2016-07-03 10:00:00
03
12
NULL
false
2016-07-04 10:00:00
04
13
10.4
NULL
2016-07-05 10:00:00
05
14
10.5
true
NULL
-
Output Description
-
standardize_test_input_output
col_string
col_bigint
col_double
col_boolean
col_datetime
stdized_col_bigint
stdized_col_double
01
10
10.1
true
2016-07-01 10:00:00
-1.2649110640673518
-1.2649110640683832
NULL
11
10.2
false
2016-07-02 10:00:00
-0.6324555320336759
-0.6324555320341972
02
NULL
10.3
true
2016-07-03 10:00:00
NULL
0.0
03
12
NULL
false
2016-07-04 10:00:00
0.0
NULL
04
13
10.4
NULL
2016-07-05 10:00:00
0.6324555320336759
0.6324555320341859
05
14
10.5
true
NULL
1.2649110640673518
1.2649110640683718
-
standardize_test_input_model_output
feature
json
col_bigint
{“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
col_double
{“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}
-
standardize_test_input_output_using_model
col_string
col_bigint
col_double
col_boolean
col_datetime
01
-1.2649110640673515
-1.264911064068383
true
2016-07-01 10:00:00
NULL
-0.6324555320336758
-0.6324555320341971
false
2016-07-02 10:00:00
02
NULL
0.0
true
2016-07-03 10:00:00
03
0.0
NULL
false
2016-07-04 10:00:00
04
0.6324555320336758
0.6324555320341858
NULL
2016-07-05 10:00:00
05
1.2649110640673515
1.2649110640683716
true
NULL
-
standardize_test_input_output_using_model_model_output
feature
json
col_bigint
{“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
col_double
{“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}
-