Feature Selection (Filter Method) is a preprocessing technique that evaluates the importance of features using statistical metrics, such as correlation coefficients and information gain, prior to modeling. This method identifies and selects the most contributory features to the target variable. It operates independently of specific machine learning algorithms and is recognized for its efficiency and ease of implementation, making it ideal for dimensionality reduction in large-scale datasets.
Limits
The Feature Selection (Filter Method) algorithm cannot directly process data in the LIBSVM or key-value pair format.
Configure the component
Method 1: Configure the component on the pipeline page
Add an Feature Selection (Filter Method) component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Feature Columns | The names of the feature columns that are selected from the input table for training. |
Target Column | The name of the label column that is selected from the input table to calculate the correlation between features and the target. | |
Enumeration Features | Specifies which features are enumeration features and may require specific processing or encoding, such as one-hot encoding. | |
Sparse Features (K:V,K:V) | Specifies whether the features are sparse features in the key-value pair format, which is typical for high-dimensional sparse data, particularly in text processing. | |
Parameters Setting | Feature Selection Method | Choose the statistical method for feature selection. Options include:
|
Top N Features | The top N features to be selected. If the specified number is greater than the number of input features, all features are selected. | |
Continuous Feature Partitioning Method | The partitioning method for continuous features. Valid values:
| |
Continuous Feature Discretization Intervals | Set the number of intervals for discretizing continuous features. This is only necessary when Continuous Feature Partitioning Method is Equal Width Partitioning. |
Method 2: Use PAI commands
Configure the Feature Selection (Filter Method) component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name fe_select_runner -project algo_public
-DfeatImportanceTable=pai_temp_2260_22603_2
-DselectMethod=iv
-DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign
-DtopN=5
-DlabelCol=y
-DmaxBins=100
-DinputTable=pai_dense_10_9
-DoutputTable=pai_temp_2260_22603_1;
Parameter | Required | Default Value | Description |
inputTable | Yes | None | The name of the input table. |
inputTablePartitions | No | All partitions | The partitions of the input table to be used in training. Supported formats include:
|
outputTable | Yes | None | The feature result table that is generated after filtering. |
featImportanceTable | Yes | None | The table that stores the importance weight values of all input features. |
selectedCols | Yes | None | The feature columns selected for training. |
labelCol | Yes | None | The target column selected from the input table. |
categoryCols | No | None | The columns of enumeration features. Only columns of the INT or DOUBLE data types are supported. |
maxBins | No | 100 | The maximum number of intervals for continuous feature partitioning. |
selectMethod | No | iv | The method used for feature selection. Valid options are iv, GiniGain, InfoGain, and Lasso. |
topN | No | 10 | The top N features to be selected. If the specified number is greater than the number of input features, all features are selected. |
isSparse | No | false | Specifies whether the features are sparse features in the key-value pair format. A value of false indicates dense features. |