The Data Conversion Module component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.
Configure the component
You can use one of the following methods to configure the Data Conversion Module component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Data Conversion Module component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
---|---|---|
Fields Setting | Feature Columns in Input Table | The feature columns that are selected from the input table. By default, all columns in the input table are selected. |
Columns without Data Conversion | The columns on which data conversion is not required. The selected columns in the output are the same as those in the input. You can specify labels in the columns. | |
Data Conversion Mode | Valid values: Normalization, Discretization, WOE Conversion, and Index. | |
Default WOE Value | This parameter is valid only if the Data Conversion Mode parameter is set to WOE Conversion. If this parameter is specified and a sample value falls into a bin without WOE values, this value is used as the WOE value. If this parameter is not specified and a sample value falls into a bin without WOE values, the system reports an error. | |
Tuning | Number of Cores | The number of CPU cores that are required. By default, the system determines the value. |
Memory Size per Core | The memory size of each CPU core. By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name data_transform
-project algo_public
-DinputFeatureTableName=feature_table
-DinputBinTableName=bin_table
-DoutputTableName=output_table
-DmetaColNames=label
-DfeatureColNames=feaname1,feaname2
Parameter | Description | Required | Default value |
---|---|---|---|
inputFeatureTableName | The name of the input feature table. | Yes | No default value |
inputBinTableName | The name of the binning result table. | Yes | No default value |
inputFeatureTablePartitions | The partitions that are selected from the input feature table. | No | Full table |
outputTableName | The name of the output table. | Yes | No default value |
featureColNames | The feature columns that are selected from the input table. | No | All columns |
metaColNames | The columns that do not need to be converted. These columns in the output are the same as those in the input. You can specify labels and sample IDs in the columns. | No | No default value |
transformType | The type of data conversion. Valid values:
| No | dummy |
itemDelimiter | The delimiter that is used to separate features. This parameter is valid only if the transformType parameter is set to dummy. | No | , |
kvDelimiter | The delimiter that is used to separate keys and values. This parameter is valid only if the transformType parameter is set to dummy. | No | : |
lifecycle | The lifecycle of the output table. | No | No default value |
coreNum | The number of CPU cores that are required. | No | Determined by the system |
memSizePerCore | The memory size of each CPU core. Unit: MB. | No | Determined by the system |
To implement normalization, the Data Conversion Module component converts variable values into values between 0 and 1 based on the input binning information, and sets missing values to 0. The following algorithm is used:
if feature_raw_value == null or feature_raw_value == 0 then
feature_norm_value = 0.0
else
bin_index = FindBin(bin_table, feature_raw_value)
bin_width = round(1.0 / bin_count * 1000) / 1000.0
feature_norm_value = 1.0 - (bin_count - bin_index - 1) * bin_width
The Data Conversion Module component can convert different types of data into different formats:
- For normalization and WOE conversion, the component generates a regular table.
- During discretization in which data is converted into dummy variables, the component generates a table in the key-value format. Each variable in the table is in the ${feaname}]\_bin\_${bin_id} format. In the following example, the sns variable is used:
- If sns falls into the second bin, the generated variable is [sns]_bin_2.
- If sns does not have a value, it falls into the empty bin, and the generated variable is [sns]_bin_null.
- If sns has a value but does not fall into a defined bin, it falls into the else bin, and the generated variable is [sns]_bin_else.