The Feature Discretization component discretizes continuous features based on a specific rule.
Overview
The Feature Discretization component supports the following types of discretization:
Discretization of dense features that are of numeric data types
Unsupervised discretization such as equal frequency discretization and equal width discretization
NoteThe default unsupervised discretization is equal width discretization.
Supervised discretization such as Gini gain-based discretization and entropy gain-based discretization
NoteThe data type for label feature discretization must be ENUM, STRING, or BIGINT.
Supervised discretization is used to search for segmentation points based on entropy gains by performing constant transversal. This type of discretization may take a long time to run. The number of bins that are obtained after segmentation is not limited by the value specified by the maxBins parameter.
Configure the component
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Feature Discretization component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Discrete Features | The features that require discretization. |
Label Column | The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed. | |
Parameters Setting | Discretization Method | The method that is used for discretization. Valid values:
|
Discretization Interval | The number of discrete intervals. The value must be a positive integer that is greater than 1. | |
Tuning | Cores | The number of cores used in computing. The value must be a positive integer. |
Memory Size per Core | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name fe_discrete_runner_1 -project algo_public
-DdiscreteMethod=SameFrequecy
-Dlifecycle=28
-DmaxBins=5
-DinputTable=pai_dense_10_1
-DdiscreteCols=nr_employed
-DoutputTable=pai_temp_2262_20382_1
-DmodelTable=pai_temp_2262_20382_2;
Parameter | Required | Description | Default value |
inputTable | Yes | The name of the input table. | None |
inputTablePartitions | No | The partitions that are selected from the input table for training. Specify this parameter in the To specify multi-level partitions, specify this parameter in the If you specify multiple partitions, separate them with commas (,). | All partitions in the input table |
outputTable | Yes | The output table after discretization. | None |
discreteCols | Yes | The features that require discretization. Sparse features are automatically filtered by the system. | "" |
labelCol | No | The label column. If this parameter is specified, the x-y histograms that display the relationship between the features and the objective variables can be viewed. | None |
discreteMethod | No | The method that is used for discretization. Valid values:
| Isometric Discretization |
maxBins | No | The number of discrete intervals. The value must be a positive integer that is greater than 1. | 100 |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | 7 |
coreNum | No | The number of cores. This parameter is used together with the memSizePerCore parameter. The value must be a positive integer. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. The value must be a positive integer. | Determined by the system |
Examples
Input data
Execute the following SQL statements to generate input data:
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10;
Configure the component
The input table is pai_dense_10_1. On the Fields Setting tab, set the Discrete Features parameter to nr_employed. On the Parameters Setting tab, set the Discretization Method parameter to Equal Width Discretization and the Discrete Interval parameter to 5.
Execution results
nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0