This topic describes how to submit a hyperparameter tuning experiment that uses MaxCompute computing resources to run the PS-SMART Binary Classification, Prediction, and Binary Classification Evaluation components to obtain an optimal hyperparameter combination for the PS-SMART component algorithm.
Prerequisites
The permissions that are required to use AutoML are granted to your account. The first time you use AutoML, make sure this prerequisite is met. For more information, see Grant permissions that are required to use AutoML.
A workspace is created and associated with the MaxCompute resources. For more information, see Create a workspace.
Step 1: Prepare data
In this example, a feature-engineered dataset for predicting bank customer product subscriptions is used. Prepare the training dataset and test dataset by performing the following operations:
Run the following SQL commands on the MaxCompute client to create a table named bank_train_data and a table named bank_test_data. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd).
create table bank_train_data( id bigint ,age double ,job double ,marital double ,education double ,default double ,housing double ,loan double ,contact double , month double ,day_of_week double ,duration double ,campaign double ,pdays double ,previous double ,poutcome double ,emp_var_rate double , cons_price_index double ,cons_conf_index double ,lending_rate3m double ,nr_employed double ,subscribe bigint ); create table bank_test_data( id bigint ,age double ,job double ,marital double ,education double ,default double ,housing double ,loan double ,contact double , month double ,day_of_week double ,duration double ,campaign double ,pdays double ,previous double ,poutcome double ,emp_var_rate double , cons_price_index double ,cons_conf_index double ,lending_rate3m double ,nr_employed double ,subscribe bigint );
Run the following Tunnel commands on the MaxCompute client to upload the following training dataset to the bank_train_data table and the test dataset to the bank_test_data table. For information about how to use Tunnel commands, see Tunnel commands.
-- Upload the training dataset to the bank_train_data table. Replace xx/train_data.csv with the path of the train_data.csv file. tunnel upload xx/train_data.csv bank_train_data; -- Upload the test dataset to the bank_test_data table. Replace xx/test_data.csv with the path of the test_data.csv file. tunnel upload xx/test_data.csv bank_test_data;
Training dataset: train_data.csv
Test dataset: test_data.csv
Step 2: Create an experiment
Go to the Create Experiment page, and perform the following steps to configure key parameters. For information about the settings of other parameters, see Create an experiment. After you configure the parameters, click Submit.
Configure parameters in the Execution Configurations section.
Parameter
Description
Job Type
The type of the destination database. Select MaxCompute.
Command
Configure the following five commands in sequence. The commands are executed in sequence.
cmd1: Run the PS-SMART Binary Classification component by using the test data that you prepared to build a binary classification model. For information about the parameters, see PS-SMART Binary Classification.
PAI -name ps_smart -project algo_public -DinputTableName='bank_train_data' -DmodelName='bi_ps_${exp_id}_${trial_id}' -DoutputTableName='bi_model_output_${exp_id}_${trial_id}' -DoutputImportanceTableName='bi_imp_${exp_id}_${trial_id}' -DlabelColName='subscribe' -DfeatureColNames='age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_index,cons_conf_index,lending_rate3m,nr_employed' -DenableSparse='false' -Dobjective='binary:logistic' -Dmetric='error' -DfeatureImportanceType='gain' -DtreeCount='${tree_count}' -DmaxDepth='${max_depth}' -Dshrinkage="0.3" -Dl2="1.0" -Dl1="0" -Dlifecycle="3" -DsketchEps="0.03" -DsampleRatio="1.0" -DfeatureRatio="1.0" -DbaseScore="0.5" -DminSplitLoss="0"
cmd2: Delete the prediction result table.
drop table if exists bi_output_${exp_id}_${trial_id};
cmd3: Run the Prediction component based on the model generated by cmd1 to predict the input data. For information about the parameters, see Prediction.
PAI -name prediction -project algo_public -DinputTableName='bank_test_data' -DmodelName='bi_ps_${exp_id}_${trial_id}' -DoutputTableName='bi_output_${exp_id}_${trial_id}' -DfeatureColNames='age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_index,cons_conf_index,lending_rate3m,nr_employed' -DappendColNames='subscribe,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_index,cons_conf_index,lending_rate3m,nr_employed' -DenableSparse='false' -Dlifecycle='3';
cmd4: Run the Binary Classification Evaluation component based on the prediction result generated by cmd3. For information about the parameters, see Binary Classification Evaluation.
PAI -name evaluate -project algo_public -DoutputDetailTableName='bi_0804_${exp_id}_${trial_id}_outputDetailTable' -DoutputMetricTableName='bi_0804_${exp_id}_${trial_id}_outputMetricTable' -DlabelColName='subscribe' -DscoreColName='prediction_score' -DpositiveLabel='1' -DbinCount='1000' -DdetailColName='prediction_detail' -DlabelMatch='true' -DinputTableName='bi_output_${exp_id}_${trial_id}';
cmd5: Obtain the evaluation metrics from the evaluation result table generated by cmd4.
INSERT OVERWRITE TABLE ps_smart_classification_metrics PARTITION(pt='${exp_id}_${trial_id}') SELECT /*+MAPJOIN(b,c,d)*/ REGEXP_EXTRACT(a.data_range, '\\\((.*?),') as threshold, a.recall, a.precision, a.f1_score, c.value as auc, d.value as ks FROM (SELECT recall, precision, f1_score, data_range, 'AUC' auc, 'KS' ks from bi_0804_${exp_id}_${trial_id}_outputDetailTable) a JOIN bi_0804_${exp_id}_${trial_id}_outputMetricTable b on b.name='F1 Score' AND a.f1_score=b.value JOIN bi_0804_${exp_id}_${trial_id}_outputMetricTable c ON c.name=a.auc JOIN bi_0804_${exp_id}_${trial_id}_outputMetricTable d ON d.name=a.ks;
Hyperparameter
The following section lists the constraint type and search space of the hyperparameters:
tree_count:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 50, 100, and 150.
max_depth:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 6, 8, and 10.
The preceding configuration generates nine hyperparameter combinations. The system creates a trial for each hyperparameter combination and runs the PS-SMART Binary Classification and Binary Classification Evaluation components in each trial by using one hyperparameter combination.
Configure parameters in the Trial Configuration section.
Parameter
Description
Metric Type
Select table.
Method
Select best.
Metric Weight
Configure the following metric weight configurations:
Key: recall. Value: 0.5.
Key: precision. Value: 0.25.
Key: auc. Value: 0.25.
Metric Source
Set the value to
select * from ps_smart_classification_metrics where pt='${exp_id}_${trial_id}';
.Optimization
Select Maximize.
Model Name
Set the value to bi_ps_${exp_id}_${trial_id}.
Configure parameters in the Search Configurations section.
Parameter
Description
Search Algorithm
Select TPE.
Maximum Trials
Set the value to 5.
Maximum Concurrent Trials
Set the value to 2.
Step 3: View the experiment details and results
On the AutoML page, click the name of the experiment to go to the Experiment Details page. On the Experiment Details page, you can view the execution progress and status of trials. The system automatically creates five trials for the experiment based on the specified search algorithm and the maximum number of trials.
Click Trials to go to the Trials tab. You can view the trials that are automatically generated for the experiment and the execution status, final metrics, and hyperparameter combination of each trial. In this example, the Optimization parameter is set to Maximize. The following hyperparameter combination indicated by the metric 0.688894 is the optimal combination: tree_count:50; max_depth:8.