The Feature Selection (Filter Method) component selects the top N features from all feature data in sparse or dense formats by using a filter based on the feature selection method that you specify. This component saves the selected features in a feature importance table. This reduces the difficulty of model training and improves the accuracy of the trained model. This topic describes how to configure parameters for the Feature Selection (Filter Method) component provided by Machine Learning Designer (formerly known as Machine Learning Studio). This topic also provides an example on how to use the Feature Selection (Filter Method) component.
Limits
The Feature Selection (Filter Method) component does not support filtering of data in the LIBSVM or key-value pair format.
Configure the component
You can use one of the following methods to configure the Feature Selection (Filter Method) component.
Method 1: Configure the component on the pipeline page
Tab | Parameter | Description |
---|---|---|
Fields Setting | Feature Columns | The names of the feature columns that are selected from the input table for training. |
Target Column | The name of the label column that is selected from the input table. | |
Enumeration Features | The columns of features to be processed as enumeration features. Only columns of the INT and DOUBLE data types are supported. | |
Sparse Features (K:V,K:V) | Specifies whether the features are sparse features in the key-value pair format. | |
Parameters Setting | Feature Selection Method | The method that is used to select features. Valid values:
|
Top N Features | The top N features to be selected. If the specified number is greater than the number of input features, all features are selected. | |
Continuous Feature Partitioning Mode | The partitioning mode of continuous features. Valid values:
| |
Continuous Feature Discretization Intervals | You need to set this parameter only if you set the Continuous Feature Partitioning Mode parameter to Equal Width Partitioning. |
Method 2: Use PAI commands
PAI -name fe_select_runner -project algo_public
-DfeatImportanceTable=pai_temp_2260_22603_2
-DselectMethod=iv
-DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
-DtopN=5
-DlabelCol=y
-DmaxBins=100
-DinputTable=pai_dense_10_9
-DoutputTable=pai_temp_2260_22603_1;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTable | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). | All partitions |
outputTable | Yes | The feature result table that is generated after filtering. | N/A |
featImportanceTable | Yes | The table that stores the importance weight values of all input features. | N/A |
selectedCols | Yes | The feature columns that are used for training. | N/A |
labelCol | Yes | The label column that is selected from the input table. | N/A |
categoryCols | No | The columns of enumeration features. Only columns of the INT or DOUBLE data types are supported. | N/A |
maxBins | No | The maximum number of intervals for continuous feature partitioning. | 100 |
selectMethod | No | The method that is used to select features. Valid values: iv, GiniGain, InfoGain, and Lasso. | iv |
topN | No | The top N features to be selected. If the specified number is greater than the number of input features, all features are selected. | 10 |
isSparse | No | Specifies whether the features are sparse features in the key-value pair format. A value of false indicates dense features. | false |
Example
- Input data
Execute the following SQL statement to generate test data:
create table if not exists pai_dense_10_9 as select age,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y from bank_data limit 10;
- Parameter settingsThe input table is pai_dense_10_9. Select the y column for Target Column and other columns for Feature Columns. The following figures show the detailed parameter settings.
- OutputThe left output is the filtered data that is stored in the following table.
pdays nr_employed emp_var_rate cons_conf_idx cons_price_idx y 999.0 5228.1 1.4 -36.1 93.444 0.0 999.0 5195.8 -0.1 -42.0 93.2 0.0 6.0 4991.6 -1.7 -39.8 94.055 1.0 999.0 5099.1 -1.8 -47.1 93.075 0.0 3.0 5076.2 -2.9 -31.4 92.201 1.0 999.0 5228.1 1.4 -42.7 93.918 0.0 999.0 5099.1 -1.8 -46.2 92.893 0.0 999.0 5099.1 -1.8 -46.2 92.893 0.0 3.0 5076.2 -2.9 -40.8 92.963 1.0 999.0 5099.1 -1.8 -47.1 93.075 0.0 The right output is the feature importance table shown below. The featname column stores the feature names. The weight column stores the weight values that are calculated by the feature selection method.featname weight pdays 30.675544191232486 nr_employed 29.08332850085075 emp_var_rate 29.08332850085075 cons_conf_idx 28.02710269740324 cons_price_idx 28.02710269740324 euribor3m 27.829058450563718 age 27.829058450563714 previous 14.319325030742775 campaign 10.658129656314467