All Products
Search
Document Center

Platform For AI:Feature Selection (Filter Method)

Last Updated:Nov 28, 2024

Feature Selection (Filter Method) is a preprocessing technique that evaluates the importance of features using statistical metrics, such as correlation coefficients and information gain, prior to modeling. This method identifies and selects the most contributory features to the target variable. It operates independently of specific machine learning algorithms and is recognized for its efficiency and ease of implementation, making it ideal for dimensionality reduction in large-scale datasets.

Limits

The Feature Selection (Filter Method) algorithm cannot directly process data in the LIBSVM or key-value pair format.

Configure the component

Method 1: Configure the component on the pipeline page

Add an Feature Selection (Filter Method) component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

Feature Columns

The names of the feature columns that are selected from the input table for training.

Target Column

The name of the label column that is selected from the input table to calculate the correlation between features and the target.

Enumeration Features

Specifies which features are enumeration features and may require specific processing or encoding, such as one-hot encoding.

Sparse Features (K:V,K:V)

Specifies whether the features are sparse features in the key-value pair format, which is typical for high-dimensional sparse data, particularly in text processing.

Parameters Setting

Feature Selection Method

Choose the statistical method for feature selection. Options include:

  • IV: Measures the predictive capability of features in relation to the target variable, frequently applied in binary classification scenarios.

  • Gini Gain: Primarily used to assess the significance of specific features, and is frequently employed in the context of decision trees.

  • Information Gain: Quantifies the reduction in uncertainty of the target variable achieved by a single feature, thereby assessing the feature's contribution to predicting the target.

  • Lasso: Used for feature selection in linear models that employs L1 regularization to achieve dimensionality reduction and feature selection within large-scale feature sets.

Top N Features

The top N features to be selected. If the specified number is greater than the number of input features, all features are selected.

Continuous Feature Partitioning Method

The partitioning method for continuous features. Valid values:

  • Automated Partitioning: The algorithm autonomously selects optimal partition points based on data distribution.

  • Equal Width Partitioning: Divides the data range into equal-width intervals, a straightforward method that may be less effective with uneven distributions.

Continuous Feature Discretization Intervals

Set the number of intervals for discretizing continuous features. This is only necessary when Continuous Feature Partitioning Method is Equal Width Partitioning.

Method 2: Use PAI commands

Configure the Feature Selection (Filter Method) component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name fe_select_runner -project algo_public 
     -DfeatImportanceTable=pai_temp_2260_22603_2 
     -DselectMethod=iv 
     -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign 
     -DtopN=5 
     -DlabelCol=y 
     -DmaxBins=100 
     -DinputTable=pai_dense_10_9 
     -DoutputTable=pai_temp_2260_22603_1;

Parameter

Required

Default Value

Description

inputTable

Yes

None

The name of the input table.

inputTablePartitions

No

All partitions

The partitions of the input table to be used in training. Supported formats include:

  • partition_name=value

  • name1=value1/name2=value2: For multi-level partitioning

    Note

    Use a comma (,) to separate multiple partitions, for example, name1=value1,value2.

outputTable

Yes

None

The feature result table that is generated after filtering.

featImportanceTable

Yes

None

The table that stores the importance weight values of all input features.

selectedCols

Yes

None

The feature columns selected for training.

labelCol

Yes

None

The target column selected from the input table.

categoryCols

No

None

The columns of enumeration features. Only columns of the INT or DOUBLE data types are supported.

maxBins

No

100

The maximum number of intervals for continuous feature partitioning.

selectMethod

No

iv

The method used for feature selection. Valid options are iv, GiniGain, InfoGain, and Lasso.

topN

No

10

The top N features to be selected. If the specified number is greater than the number of input features, all features are selected.

isSparse

No

false

Specifies whether the features are sparse features in the key-value pair format. A value of false indicates dense features.