The Data Pivoting component provided by Machine Learning Designer allows you to view the distributions of feature values, feature columns, and label columns. This facilitates follow-up data analysis. This component supports both sparse and dense data formats. This topic describes how to configure the component and provides an example on how to use the component.
Configure the component
You can use one of the following methods to configure the Data Pivoting component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Data Pivoting component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
---|---|---|
Fields Setting | Feature Columns | The columns that represent the features of data in training samples. |
Target Column | The column that you want to use for training. | |
Enumeration Features | The features that you want to use as enumeration features. | |
Sparse Format (K:V,K:V) | Specifies whether data in the sparse format is used. | |
Parameters Setting | Continuous Feature Discretization Intervals | The maximum number of intervals for the equal-distance division of continuous features. |
Tuning | Cores | The number of cores used in computing. The value must be a positive integer. |
Memory Size per Core | The memory size of each core. Valid values: 1 to 65536. Unit: MB. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI
-name fe_meta_runner
-project algo_public
-DinputTable="pai_dense_10_10"
-DoutputTable="pai_temp_2263_20384_1"
-DmapTable="pai_temp_2263_20384_2"
-DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
-DlabelCol="y"
-DcategoryCols="previous"
-Dlifecycle="28"-DmaxBins="5" ;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTable | Yes | The name of the input table. | None |
inputTablePartitions | No | The partitions that are selected from the input table for training. Valid values:
Note If you specify multiple partitions, separate them with commas (,). | None |
outputTable | Yes | The name of the output table. | None |
mapTable | Yes | The output mapping table. The Data Pivoting component maps STRING-type data to INT-type data for PAI to use for training. | None |
selectedCols | Yes | The columns that are selected from the input table. | None |
labelCol | No | The column that you want to use for training. | None |
categoryCols | No | The INT- or DOUBLE-type columns that you want to use as enumeration features. | None |
maxBins | No | The maximum number of intervals for the equal-distance division of continuous features. | 100 |
isSparse | No | Specifies whether the input data is sparse. Valid values: true and false. | false |
itemSpliter | No | The delimiter that is used to separate key-value pairs if data in the input table is in the sparse format. | , |
kvSpliter | No | The delimiter that is used to separate keys and values if data in the input table is in the sparse format. | : |
lifecycle | No | The lifecycle of the output table. | 28 |
coreNum | No | The number of cores used in computing. The value must be a positive integer. Valid values: 1 to 9999. | Determined by the system |
memSizePerCore | No | The memory size of each core. Valid values: 1 to 65536. Unit: MB. | Determined by the system |
Examples
- Input data
age workclass fwlght edu edu_num married c family race sex gail loss work_year country income 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 13.0 United-States <=50K 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0.0 0.0 40.0 United-States <=50K 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0.0 0.0 40.0 United-States <=50K 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0.0 0.0 40.0 Other <=50K 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0.0 0.0 40.0 United-States <=50K 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0.0 0.0 16.0 Jamaica <=50K 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0.0 0.0 45.0 United-States >50K 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084.0 0.0 50.0 United-States >50K 42 Private 159449 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 5178.0 0.0 40.0 United-States >50K - ModelingClick the Data Pivoting component and then click the Fields Setting tab. Set the Target Column parameter to income and specify the other 14 columns for the Feature Columns parameter. The BIGINT-type values in the edu_num column are used as enumeration values.
- Result
- Right-click Data Pivoting and choose . The values in the family, race, sex, and income columns of the STRING data type are converted into numeric values for PAI to use for training. This is similar to data format conversion.
- Right-click Data Pivoting and choose . Note If you do not specify STRING-type data for the Feature Columns parameter, the String Column Feature Mapping Table parameter is left empty in the output.
- Right-click Data Pivoting and choose . distribute_info indicates the number of records in each interval based on the uniform distribution between the maximum value and the minimum value.