Data Conversion Module - Platform For AI - Alibaba Cloud Documentation Center

The Data Conversion Module component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.

Configure the component

You can use one of the following methods to configure the Data Conversion Module component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Data Conversion Module component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.


Tab	Parameter	Description
Fields Setting	Feature Columns in Input Table	The feature columns that are selected from the input table. By default, all columns in the input table are selected.
	Columns without Data Conversion	The columns on which data conversion is not required. The selected columns in the output are the same as those in the input. You can specify labels in the columns.
	Data Conversion Mode	Valid values: Normalization, Discretization, WOE Conversion, and Index.
	Default WOE Value	This parameter is valid only if the Data Conversion Mode parameter is set to WOE Conversion. If this parameter is specified and a sample value falls into a bin without WOE values, this value is used as the WOE value. If this parameter is not specified and a sample value falls into a bin without WOE values, the system reports an error.
Tuning	Number of Cores	The number of CPU cores that are required. By default, the system determines the value.
Tuning	Memory Size per Core	The memory size of each CPU core. By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name data_transform
-project algo_public
-DinputFeatureTableName=feature_table
-DinputBinTableName=bin_table
-DoutputTableName=output_table
-DmetaColNames=label
-DfeatureColNames=feaname1,feaname2


Parameter	Description	Required	Default value
inputFeatureTableName	The name of the input feature table.	Yes	No default value
inputBinTableName	The name of the binning result table.	Yes	No default value
inputFeatureTablePartitions	The partitions that are selected from the input feature table.	No	Full table
outputTableName	The name of the output table.	Yes	No default value
featureColNames	The feature columns that are selected from the input table.	No	All columns
metaColNames	The columns that do not need to be converted. These columns in the output are the same as those in the input. You can specify labels and sample IDs in the columns.	No	No default value
transformType	The type of data conversion. Valid values: normalize: normalization dummy: discretization woe: WOE conversion	No	dummy
itemDelimiter	The delimiter that is used to separate features. This parameter is valid only if the transformType parameter is set to dummy.	No	,
kvDelimiter	The delimiter that is used to separate keys and values. This parameter is valid only if the transformType parameter is set to dummy.	No	:
lifecycle	The lifecycle of the output table.	No	No default value
coreNum	The number of CPU cores that are required.	No	Determined by the system
memSizePerCore	The memory size of each CPU core. Unit: MB.	No	Determined by the system

To implement normalization, the Data Conversion Module component converts variable values into values between 0 and 1 based on the input binning information, and sets missing values to 0. The following algorithm is used:

if feature_raw_value == null or feature_raw_value == 0 then
    feature_norm_value = 0.0
else
    bin_index = FindBin(bin_table, feature_raw_value)
    bin_width = round(1.0 / bin_count * 1000) / 1000.0
    feature_norm_value = 1.0 - (bin_count - bin_index - 1) * bin_width

The Data Conversion Module component can convert different types of data into different formats:

For normalization and WOE conversion, the component generates a regular table.
During discretization in which data is converted into dummy variables, the component generates a table in the key-value format. Each variable in the table is in the ${feaname}]\_bin\_${bin_id} format. In the following example, the sns variable is used:
- If sns falls into the second bin, the generated variable is [sns]_bin_2.
- If sns does not have a value, it falls into the empty bin, and the generated variable is [sns]_bin_null.
- If sns has a value but does not fall into a defined bin, it falls into the else bin, and the generated variable is [sns]_bin_else.