Feature engineering is essential to model training in machine learning. Feature engineering helps find feature crosses for model optimization. Generally, algorithm engineers need to spend a lot of efforts in feature engineering. Machine Learning Platform for AI (PAI) provides the Auto Feature Cross component to help you find effective feature crosses. You can combine the features that form the feature crosses to optimize your model. This topic describes how to use the Auto Feature Cross component.
Flowchart
The Auto Feature Cross component is developed based on the deep learning framework TensorFlow. This component involves intensive parallel computing under the hood and requires GPU resources. Only the China (Beijing) and China (Shanghai) regions support the Auto Feature Cross component.
1. Authorize PAI to access your GPU resources and OSS bucket
- Log on to the PAI console and go to the Visualized Modeling (Machine Learning Studio) page. For more information, see Use DataWorks tasks to schedule experiments in Machine Learning Studio.
- On the page that appears, click Settings in the left-side navigation pane. On the Settings page, select Authorize Machine Learning Platform for AI to access my OSS resources and enable GPU computing on the General tab.
2. Bin data
The Auto Feature Cross component supports only the BIGINT data type. However, raw data in most business scenarios is of the DOUBLE data type, as shown in the following figure.
In this case, you must use the SQL Script or One Hot Encoding component to convert the raw data from the DOUBLE type to the BIGINT type. In addition, you must use the Feature Discretization component to decompose feature data in different intervals into different bins. The following figure shows the data after binning.
3. Determine the range of feature values
- The maximum value of the thalach feature is 4.
- The maximum value of the oldpeak feature is 3.
- The maximum value of the ca feature is 4.
You can execute the following SQL statement to query the maximum value of each feature:
select max(feature) from table;
In the sample data of this topic, the maximum value after binning is 4 for all features.
You must set the Feature length parameter of the Auto Feature Cross component in the [5,5,5,5,5,5,5,5,5,5,5,5,5] format, as shown in the following figure. In the format, 5 indicates a left-closed, right-open interval [0,5) that includes 4.
4. Prepare training data and test data
In this topic, the training data is the same as the test data. In actual use, the test data can differ from the training data, provided that the fields in the test data are the same as the fields in the training data.
5. Configure the Auto Feature Cross component
- Set the parameters on the Fields Setting tab
In the Auto Feature Cross component, the input port on the left is used to import training data and the input port on the right is used to import test data.
- Feature selection: the feature columns that are selected for feature crossing.
- if sparse data: specifies whether the input data is sparse. By default, this check box is not selected, which means that the data is dense.
- Label: the label column that is used to determine whether a feature cross is effective.
- Output path: the endpoint of the OSS bucket that stores the generated model.
- Set the parameters on the Parameters Setting tab
- Ergodic number: the number of iterations.
- Feature order: the maximum number of features in each feature cross. For example, a value of 3 indicates that each feature cross involves a maximum of three features.
PAI -name fives_ext -project algo_public
-DlabelColName="ifhealth" // The label column that is used to determine whether a feature cross is effective.
-Dmetric_file="metric_log.log" // The name of the system log file.
-Dfeature_meta="[5,5,5,5,5,5,5,5,5,5,5,5,5]"
-DtrainTable="odps://Project name/tables/Table name"
-Dbuckets="oss://{oss_bucket}/"
-Dthreshold="0.5"
-Dk="3"
-DossHost="oss-cn-beijing-internal.aliyuncs.com" // The region in which OSS is activated.
-Demb_dims="16"
-DenableSparse="0"
-Dtemp_anneal_steps="30000"
-DfeatureColName="sex,cp,fbs,restecg,exang,slop,thal,age,trestbps,chol,thalach,oldpeak,ca" // The feature columns that are selected for feature crossing.
-DtestTable="odps://Project name/tables/Table name"
-Darn="acs:ram::********:role/aliyunodpspaidefaultrole" //rolearn
-Depochs="1500"
-DcheckpointDir="oss://{oss_bucket}/{path}/";
View the feature crosses
In the root directory of your OSS bucket, find the interactions.json file. The root directory of your OSS bucket is specified by the Dbuckets parameter.
- [0,1] indicates that the cross of the first and second features is effective. The feature order in each feature cross is the same as the feature order in the input table.
- [8, 6, 5] indicates that the cross of the seventh, fifth, and fourth features is effective.