what is the XGboost Train component - Platform For AI - Alibaba Cloud Documentation Center

XGBoost is an extension of the gradient boosting algorithm. XGBoost provides improved usability and robustness and is widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression. The XGboost Train component is optimized based on open source XGBoost to improve the ease of use and the compatibility of the component with Platform for AI (PAI). This topic describes how to configure the XGboost Train component.

Limits

You can use the XGboost Train component based on MaxCompute, Flink, and Deep Learning Containers (DLC) resources.

Data formats

Table and LibSVM formats are supported.

Sample table-formatted data:
f0
f1
label
0.1
1
0
0.9
2
1

Sample LibSVM-formatted data:

Sample data

1 2:1 9:1 10:1 20:1 29:1 33:1 35:1 39:1 40:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1

0 0:1 9:1 18:1 20:1 23:1 33:1 35:1 38:1 41:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 115:1 121:1

1 2:1 8:1 18:1 20:1 29:1 33:1 35:1 39:1 41:1 52:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 116:1 123:1

0 2:1 9:1 13:1 21:1 28:1 33:1 36:1 38:1 40:1 53:1 57:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 97:1 105:1 113:1 119:1

0 0:1 9:1 18:1 20:1 22:1 33:1 35:1 38:1 44:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 104:1 115:1 121:1

0 0:1 8:1 18:1 20:1 23:1 33:1 35:1 38:1 41:1 52:1 55:1 64:1 68:1 76:1 85:1 87:1 91:1 94:1 101:1 105:1 116:1 121:1

Configure the component in the PAI console

You can configure the XGboost Train component in Machine Learning Designer. The following table describes the parameters.

Tab	Parameter	Type	Description
Field Setting	labelCol	String	The label column.
	featureCols	String array	The table-formatted feature columns. The values of the featureCols and vectorCol parameters are mutually exclusive. The input data must be of the table type.
	vectorCol	String	The LibSVM-formatted vector column. The values of featureCols and vectorCol parameters are mutually exclusive. The input data must be of the LibSVM type.
	weightCol	String	The name of the weight column.
	set the model file path	String	The Object Storage Service (OSS) bucket path in which the model is stored.
Parameter Setting	The number of rounds for boosting	Integer	The number of training rounds.
	objective	String	The learning task and the corresponding learning objective. Default value: binary:logistic.
	Base score	Floating-point number	The global bias, which is the initial prediction score of all instances. Default value: 0.5.
	The number of classes	Integer	The number of classes.
	Tree Method	String	The tree construction algorithm. Default value: auto. Valid values: auto exact approx hist
	L1 regularization term on weights	Floating-point number	The L1 regularization term on weights. Default value: 0.0.
	L2 regularization term on weights	Floating-point number	The L2 regularization term on weights. Default value: 1.0.
	eta	Floating-point number	The learning rate. Default value: 0.3.
	scale_pos_weight	Floating-point number	The balance between positive and negative weights. Default value: 1.0.
	sketch_eps	Floating-point number	You can use this parameter to control the number of bins. This parameter is available only if you set the Tree Method parameter to approx. Default value: 0.03.
	Maximum number of discrete bins to bucket continuous features	Integer	The maximum number of discrete bins to bucket continuous features. This parameter is available only if you set the Tree Method parameter to hist. Default value: 256.
	Maximum depth of a tree	Integer	The maximum depth of a tree. Default value: 6.
	Max leaves	Integer	The maximum number of leaf nodes that you want to add. Default value: 0.
	Min child weight	Floating-point number	The minimum weight required in a child node. Default value: 1.0.
	Max delta step	Floating-point number	The maximum delta step allowed for a leaf node. This parameter allows you to adjust the granularity of the model. Default value: 0.0.
	Subsample ratio of the training instances	Floating-point number	The subsample ratio of the training instances. Default value: 1.
	Sampling method	String	The method used to sample the training instances. Default value: GRADIENT_BASED. Valid values: GRADIENT_BASED UNIFORM
	Subsample ratio of columns for each level	Floating-point number	The subsample ratio of columns for each level. Default value: 1.0.
	Subsample ratio of columns for each node (split)	Floating-point number	The subsample ratio of columns for each node. Default value: 1.0.
	Subsample ratio of columns when constructing each tree	Floating-point number	The subsample ratio of columns when each tree is constructed. Default value: 1.0.
	Grow Policy	String	The method used to add new nodes to the tree. Default value: depthwise. Valid values: depthwise lossguide
	gamma	Floating-point number	The minimum loss reduction required to make a subsequent partition on a leaf node of the tree. Default value: 0.0.
	Interaction constraints	String	The groups of variables that are allowed to interact.
	Monotone constraints	String	The monotonicity constraints of a feature.
	Tweedie variance power	Floating-point number	The variance of the Tweedie distribution. This parameter is valid only in the Tweedie distribution. Default value: 1.5.
Execution Tuning	Number of Workers	Positive integer	The number of worker nodes. This parameter must be used together with the Memory per worker, unit MB parameter. Valid values: [1, 9999].
Execution Tuning	Memory per worker, unit MB	Positive integer	The memory size of each worker node. Unit: MB. Valid values: [1024, 64 × 1024].

Procedure

This example uses a Higgs boson classification scenario to describe how to use the XGboost component in Machine Learning Designer. The pipeline used in this example is built based on a preset template. For information about how to create a pipeline based on the Use XGBoost algorithm to identify the Higgs boson template, see Create a pipeline from a preset template.

This component generates JSON strings that are serialized from JSON objects outputted by the open source XGBoost library. To evaluate the data generated by the component, you need to convert the data to a format that is supported by an evaluation component. In this example, you can add an SQL script component as a downstream component of the XGboost Predict component to serialize the component output to a format that is supported by the subsequent Binary classification Evaluation V2 component. The following sample code provides an example on how to configure the SQL Script component to convert the data format. For more information, see XGBoost Parameters.

set odps.sql.udf.getjsonobj.new=true;

select *, CONCAT("{\"0\":", 1.0-prob, ",\"1\":", prob, "}") as detail
FROM (
select *, cast(get_json_object(pred, '$[0]') as double) as prob FROM ${t1})

References

You can use the XGboost Predict component to perform offline inference based on the model trained by the XGboost Train component. For information about how to configure the XGboost Predict component, see XGboost Predict.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on your actual business scenario. For more information, see Component reference: Overview of all components.

f0	f1	label
0.1	1	0
0.9	2	1