Feature Selection (Filter Method) - Platform For AI - Alibaba Cloud Documentation Center

Feature Selection (Filter Method) is a preprocessing technique that evaluates the importance of features using statistical metrics, such as correlation coefficients and information gain, prior to modeling. This method identifies and selects the most contributory features to the target variable. It operates independently of specific machine learning algorithms and is recognized for its efficiency and ease of implementation, making it ideal for dimensionality reduction in large-scale datasets.

Limits

The Feature Selection (Filter Method) algorithm cannot directly process data in the LIBSVM or key-value pair format.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Feature Selection (Filter Method) component on the pipeline page and configure the following parameters:

Category	Parameter	Description
Fields Setting	Feature Columns	The names of the feature columns that are selected from the input table for training.
	Target Column	The name of the label column that is selected from the input table to calculate the correlation between features and the target.
	Enumeration Features	Specifies which features are enumeration features and may require specific processing or encoding, such as one-hot encoding.
	Sparse Features (K:V,K:V)	Specifies whether the features are sparse features in the key-value pair format, which is typical for high-dimensional sparse data, particularly in text processing.
Parameters Setting	Feature Selection Method	Choose the statistical method for feature selection. Options include: IV: Measures the predictive capability of features in relation to the target variable, frequently applied in binary classification scenarios. Gini Gain: Primarily used to assess the significance of specific features, and is frequently employed in the context of decision trees. Information Gain: Quantifies the reduction in uncertainty of the target variable achieved by a single feature, thereby assessing the feature's contribution to predicting the target. Lasso: Used for feature selection in linear models that employs L1 regularization to achieve dimensionality reduction and feature selection within large-scale feature sets.
	Top N Features	The top N features to be selected. If the specified number is greater than the number of input features, all features are selected.
	Continuous Feature Partitioning Method	The partitioning method for continuous features. Valid values: Automated Partitioning: The algorithm autonomously selects optimal partition points based on data distribution. Equal Width Partitioning: Divides the data range into equal-width intervals, a straightforward method that may be less effective with uneven distributions.
	Continuous Feature Discretization Intervals	Set the number of intervals for discretizing continuous features. This is only necessary when Continuous Feature Partitioning Method is Equal Width Partitioning.

Method 2: Use PAI commands

Configure the Feature Selection (Filter Method) component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name fe_select_runner -project algo_public 
     -DfeatImportanceTable=pai_temp_2260_22603_2 
     -DselectMethod=iv 
     -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign 
     -DtopN=5 
     -DlabelCol=y 
     -DmaxBins=100 
     -DinputTable=pai_dense_10_9 
     -DoutputTable=pai_temp_2260_22603_1;

Parameter	Required	Default Value	Description
inputTable	Yes	None	The name of the input table.
inputTablePartitions	No	All partitions	The partitions of the input table to be used in training. Supported formats include: `partition_name=value` `name1=value1/name2=value2`: For multi-level partitioning Note Use a comma (,) to separate multiple partitions, for example, name1=value1,value2.
outputTable	Yes	None	The feature result table that is generated after filtering.
featImportanceTable	Yes	None	The table that stores the importance weight values of all input features.
selectedCols	Yes	None	The feature columns selected for training.
labelCol	Yes	None	The target column selected from the input table.
categoryCols	No	None	The columns of enumeration features. Only columns of the INT or DOUBLE data types are supported.
maxBins	No	100	The maximum number of intervals for continuous feature partitioning.
selectMethod	No	iv	The method used for feature selection. Valid options are iv, GiniGain, InfoGain, and Lasso.
topN	No	10	The top N features to be selected. If the specified number is greater than the number of input features, all features are selected.
isSparse	No	false	Specifies whether the features are sparse features in the key-value pair format. A value of false indicates dense features.