GBDT Regression - Platform For AI - Alibaba Cloud Documentation Center

Gradient boosting decision tree (GBDT) is an iterative decision tree algorithm that is suitable for linear and nonlinear regression scenarios.

Configure the component

You can use one of the following methods to configure the GBDT Regression component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the GBDT Regression component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab	Parameter	Description
Fields Setting	Input Columns	The feature columns that are selected from the input table for training. The columns of the DOUBLE and BIGINT types are supported Note A maximum of 800 feature columns can be selected.
	Label Column	The label column. The columns of the DOUBLE and BIGINT types are supported.
	Group Column	The columns of the DOUBLE and BIGINT types are supported. By default, the full table is a group.
Parameters Setting	Loss Function Type	The type of the loss function. Valid values: Gbrank Loss, Lambdamart DCG Loss, Lambdamart NDCG Loss, and Regression Loss.
	Tau in gbrank loss	This parameter is required only if the Loss Function Type parameter is set to Gbrank Loss. Valid values: [0,1].
	Exponent Base of Gbrank and Regression Loss	This parameter is required only if the Loss Function Type parameter is set to Gbrank Loss or Regression Loss. Valid values: [1,10].
	Metric Type	The metric type. Valid values: NDCG and DCG.
	Number of Decision Trees	The number of trees. Valid values: 1 to 10000.
	Learning Rate	The learning rate. Valid values: (0,1).
	Maximum Leaf Quantity	The maximum number of leaf nodes on each tree. Valid values: 1 to 1000.
	Maximum Decision Tree Depth	The maximum depth of each tree. Valid values: 1 to 11.
	Minimum Sample Quantity on a Leaf Node	The minimum number of samples on each leaf node. Valid values: 1 to 1000.
	Sample Ratio	The proportion of samples that are selected for training. Valid values: (0,1).
	Feature Ratio	The proportion of features that are selected for training. Valid values: (0,1).
	Sample Ratio	The proportion of samples that are selected for testing. Valid values: [0,1).
	Random Seed	The random seed. Valid values: [0,10].
	Use Newton-Raphson Method	Specifies whether to use Newton's method.
	Maximum Feature Split Times	The maximum number of splits of each feature. Valid values: 1 to 1000.
Tuning	Number of Computing Cores	The number of cores. The system automatically allocates cores based on the volume of input data.
Tuning	Memory Size per Core	The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name gbdt
    -project algo_public
    -DfeatureSplitValueMaxSize="500"
    -DlossType="0"
    -DrandSeed="0"
    -DnewtonStep="0"
    -Dshrinkage="0.05"
    -DmaxLeafCount="32"
    -DlabelColName="campaign"
    -DinputTableName="bank_data_partition"
    -DminLeafSampleCount="500"
    -DsampleRatio="0.6"
    -DgroupIDColName="age"
    -DmaxDepth="11"
    -DmodelName="xlab_m_GBDT_83602"
    -DmetricType="2"
    -DfeatureRatio="0.6"
    -DinputTablePartitions="pt=20150501"
    -Dtau="0.6"
    -Dp="1"
    -DtestRatio="0.0"
    -DfeatureColNames="previous,cons_conf_idx,euribor3m"
    -DtreeCount="500"

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
featureColNames	No	The feature columns that are selected from the input table for training. The columns of the DOUBLE and BIGINT types are supported.	All columns of numeric data types
labelColName	Yes	The label column in the input table. The columns of the DOUBLE and BIGINT types are supported.	N/A
inputTablePartitions	No	The partitions that are selected from the input table for training. Specify this parameter in one of the following formats: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate these partitions with commas (,).	All partitions
modelName	Yes	The name of the output model.	N/A
outputImportanceTableName	No	The name of the table that provides feature importance.	N/A
groupIDColName	No	The name of the group column.	Full table
lossType	No	The type of the loss function. Valid values: 0: GBRank 1: LAMBDAMART_DCG 2: LAMBDAMART_NDCG 3: LEAST_SQUARE	0
metricType	No	The metric type. Valid values: 0: normalized discounted cumulative gain (NDCG). 1: discounted cumulative gain (DCG). 2: area under the curve (AUC). This metric type is suitable only for the scenario where the value of label is set to 0 or 1(Deprecated).	0
treeCount	No	The number of trees. Valid values: 1 to 10000.	500
shrinkage	No	The learning rate. Valid values: (0,1).	0.05
maxLeafCount	No	The maximum number of leaf nodes on each tree. Valid values: 1 to 1000.	32
maxDepth	No	The maximum depth of each tree. Valid values: 1 to 11.	10
minLeafSampleCount	No	The minimum number of samples on each leaf node. Valid values: 1 to 1000.	500
sampleRatio	No	The proportion of samples selected for training. Valid values: (0,1).	0.6
featureRatio	No	The proportion of features that are selected for training. Valid values: (0,1).	0.6
tau	No	The Tau parameter for the GBRank loss function. Valid values: [0,1].	0.6
p	No	The p parameter for the GBRank loss function. Valid values: [1,10].	1
randSeed	No	The random seed. Valid values: [0,10].	0
newtonStep	No	Specifies whether to use Newton's method. Valid values: 0 and 1.	1
featureSplitValueMaxSize	No	The maximum number of splits of each feature. Valid values: 1 to 1000.	500
lifecycle	No	The lifecycle of the output table.	N/A

Example

Execute the following SQL statements to generate test data:

drop table if exists gbdt_ls_test_input;
create table gbdt_ls_test_input
as
select
    *
from
(
    select
        cast(1 as double) as f0,
        cast(0 as double) as f1,
        cast(0 as double) as f2,
        cast(0 as double) as f3,
        cast(0 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(1 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(0 as double) as f1,
            cast(1 as double) as f2,
            cast(0 as double) as f3,
            cast(1 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(0 as double) as f1,
            cast(0 as double) as f2,
            cast(1 as double) as f3,
            cast(1 as bigint) as label
    union all
        select
            cast(1 as double) as f0,
            cast(0 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(1 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
) a;

The following test data table gbdt_ls_test_input is generated.

f0	f1	f2	f3	label
1.0	0.0	0.0	0.0	0
0.0	0.0	1.0	0.0	1
0.0	0.0	0.0	1.0	1
0.0	1.0	0.0	0.0	0
1.0	0.0	0.0	0.0	0
0.0	1.0	0.0	0.0	0

Run the following PAI command to submit the training parameters configured for the GBDT Regression component:

drop offlinemodel if exists gbdt_ls_test_model;
PAI -name gbdt
    -project algo_public
    -DfeatureSplitValueMaxSize="500"
    -DlossType="3"
    -DrandSeed="0"
    -DnewtonStep="1"
    -Dshrinkage="0.5"
    -DmaxLeafCount="32"
    -DlabelColName="label"
    -DinputTableName="gbdt_ls_test_input"
    -DminLeafSampleCount="1"
    -DsampleRatio="1"
    -DmaxDepth="10"
    -DmetricType="0"
    -DmodelName="gbdt_ls_test_model"
    -DfeatureRatio="1"
    -Dp="1"
    -Dtau="0.6"
    -DtestRatio="0"
    -DfeatureColNames="f0,f1,f2,f3"
    -DtreeCount="10"

Run the following PAI command to submit the parameters configured for the Prediction component:

drop table if exists gbdt_ls_test_prediction_result;
PAI -name prediction
    -project algo_public
    -DdetailColName="prediction_detail"
    -DmodelName="gbdt_ls_test_model"
    -DitemDelimiter=","
    -DresultColName="prediction_result"
    -Dlifecycle="28"
    -DoutputTableName="gbdt_ls_test_prediction_result"
    -DscoreColName="prediction_score"
    -DkvDelimiter=":"
    -DinputTableName="gbdt_ls_test_input"
    -DenableSparse="false"
    -DappendColNames="label"

View the prediction result table gbdt_ls_test_prediction_result.

label	prediction_result	prediction_score	prediction_detail
0	NULL	0.0	{"label": 0}
0	NULL	0.0	{"label": 0}
1	NULL	0.9990234375	{"label": 0.9990234375}
1	NULL	0.9990234375	{"label": 0.9990234375}
0	NULL	0.0	{"label": 0}
0	NULL	0.0	{"label": 0}