A parameter server (PS) is used to process a large number of offline and online training tasks. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT). The PS-SMART Binary Classification Training component supports training tasks for tens of billions of samples and hundreds of thousands of features. It can run training tasks on thousands of nodes. This component also supports multiple data formats and optimization technologies such as histogram-based approximation.
Limits
You can use this component based only on the computing resources of MaxCompute.
Usage notes
Only columns of numeric data types can be used by the PS-SMART Binary Classification Training component. 0 indicates a negative example, and 1 indicates a positive example. If the type of data in the MaxCompute table is STRING, the data type must be converted first. For example, you must convert Good/Bad to 1/0.
If data is in the key-value format, feature IDs must be positive integers, and feature values must be real numbers. If the data type of feature IDs is STRING, you must use the serialization component to serialize the data. If feature values are categorical strings, you must perform feature engineering such as feature discretization to process the values.
The PS-SMART Binary Classification Training component supports hundreds of thousands of feature tasks. However, these tasks are resource-intensive and time-consuming. To resolve this issue, you can use GBDT algorithms to train the model. GBDT algorithms are suitable for scenarios where continuous features are used for training. You can perform one-hot encoding on categorical features to filter out low-frequency features. However, we recommend that you do not perform feature discretization on continuous features of numeric data types.
The PS-SMART algorithm may introduce randomness. For example, randomness may be introduced in the following scenarios: data and feature sampling based on data_sample_ratio and fea_sample_ratio, optimization of the PS-SMART algorithm by using histograms for approximation, and merge of a local sketch into a global sketch. The structures of trees are different when tasks run on multiple workers in distributed mode. However, the training effect of the model is theoretically the same. It is normal if you use the same data and parameters during training but obtain different results.
If you want to accelerate training, you can set the Cores parameter to a larger value. The PS-SMART algorithm starts training tasks after the required resources are provided. Therefore, the more the resources are requested, the longer you must wait.
Configure the component
You can use one of the following methods to configure the component.
Method 1: Use the console
Configure the component parameters in Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Use Sparse Format | Specifies whether the input data is sparse. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9. |
Feature Columns | Select the feature columns for training from the input table. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse key-value pairs and keys and values are of numeric data types, only columns of the STRING type are supported. | |
Label Column | Select the label column from the input table. The columns of the STRING type and numeric data types are supported. Data contained in the columns must be of numeric types, such as 0 and 1 that are used in binary classification. | |
Weight Column | Select the column that contains the weight of each row of samples. The columns of numeric data types are supported. | |
Parameters Setting | Evaluation Indicator Type | Select the evaluation metric type of the training set. Valid values:
|
Trees | Enter the number of trees. The value of this parameter must be an integer. The number of trees is proportional to the amount of training time. | |
Maximum Tree Depth | Enter the maximum tree depth. The default value is 5, which indicates that a maximum of 16 leaf nodes can be configured. The value of this parameter must be a positive integer. | |
Data Sampling Fraction | Enter the data sampling ratio when trees are built. The sample data is used to build a weak learner to accelerate training. | |
Feature Sampling Fraction | Enter the feature sampling ratio when trees are built. The sample features are used to build a weak learner to accelerate training. | |
L1 Penalty Coefficient | Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
L2 Penalty Coefficient | Control the size of a leaf node. A larger value results in a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | |
Learning Rate | Enter the learning rate. Valid values: (0,1). | |
Sketch-based Approximate Precision | Enter the threshold for selecting quantiles when you build a sketch. A smaller value specifies that more bins can be obtained. In most cases, the default value 0.03 is used. | |
Minimum Split Loss Change | Enter the minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur. | |
Features | Enter the number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID. | |
Global Offset | Enter the initial prediction values of all samples. | |
Random Seed | Enter the random seed. The value of this parameter must be an integer. | |
Feature Importance Type | Select the feature importance type. Valid values:
| |
Tuning | Cores | Select the number of cores. By default, the system determines the value. |
Memory Size per Core | Select the memory size of each core. Unit: MB. In most cases, the system determines the memory size. |
Method 2: Use PAI commands
Configure the component parameters by using Platform for AI (PAI) commands. The following section describes the parameters. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
# Training
PAI -name ps_smart
-project algo_public
-DinputTableName="smart_binary_input"
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545859_2"
-DoutputImportanceTableName="pai_temp_24515_545859_3"
-DlabelColName="label"
-DfeatureColNames="f0,f1,f2,f3,f4,f5"
-DenableSparse="false"
-Dobjective="binary:logistic"
-Dmetric="error"
-DfeatureImportanceType="gain"
-DtreeCount="5"
-DmaxDepth="5"
-Dshrinkage="0.3"
-Dl2="1.0"
-Dl1="0"
-Dlifecycle="3"
-DsketchEps="0.03"
-DsampleRatio="1.0"
-DfeatureRatio="1.0"
-DbaseScore="0.5"
-DminSplitLoss="0";
# Prediction
PAI -name prediction
-project algo_public
-DinputTableName="smart_binary_input"
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545860_1"
-DfeatureColNames="f0,f1,f2,f3,f4,f5"
-DappendColNames="label,qid,f0,f1,f2,f3,f4,f5"
-DenableSparse="false"
-Dlifecycle="28";
Module | Parameter | Required | Description | Default value |
Data parameters | featureColNames | Yes | The feature columns that are selected from the input table for training. If data in the input table is dense, only the columns of the BIGINT and DOUBLE types are supported. If data in the input table is sparse data in the key-value format, and keys and values are of numeric data types, only columns of the STRING data type are supported. | None |
labelColName | Yes | The label column in the input table. The columns of the STRING type and numeric data types are supported. However, only data of numeric data types can be stored in the columns. For example, column values can be 0 or 1 in binary classification. | None | |
weightCol | No | The column that contains the weight of each row of samples. The columns of numeric data types are supported. | None | |
enableSparse | No | Specifies whether the input data is sparse. Valid values: true and false. If the input data is sparse data in the key-value format, separate key-value pairs with spaces, and separate keys and values with colons (:). Example: 1:0.3 3:0.9. | false | |
inputTableName | Yes | The name of the input table. | None | |
modelName | Yes | The name of the output model. | None | |
outputImportanceTableName | No | The name of the table that provides feature importance. | None | |
inputTablePartitions | No | The partitions that are selected from the input table for training. Format: ds=1/pt=1. | None | |
outputTableName | No | The MaxCompute table that is generated. The table is a binary file. It cannot be read and can be obtained only by using the PS-SMART prediction component. | None | |
lifecycle | No | The lifecycle of the output table. Unit: days. | 3 | |
Algorithm parameters | objective | Yes | The type of the objective function. If the training is performed by using binary classification components, set the parameter to binary:logistic. | None |
metric | No | The evaluation metric type of the training set, which is contained in stdout of the coordinator in LogView. Valid values:
| None | |
treeCount | No | The number of trees. The value is proportional to the amount of training time. | 1 | |
maxDepth | No | The maximum depth of the tree. The value must be a positive integer. Valid values: 1 to 20. | 5 | |
sampleRatio | No | The data sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled. | 1.0 | |
featureRatio | No | The feature sampling ratio. Valid values: (0,1]. If this parameter is set to 1.0, no data is sampled. | 1.0 | |
l1 | No | The L1 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | 0 | |
l2 | No | The L2 penalty coefficient. A larger value of this parameter indicates a more even distribution of leaf nodes. If overfitting occurs, increase the parameter value. | 1.0 | |
shrinkage | No | Valid values: (0,1). | 0.3 | |
sketchEps | No | The threshold for selecting quantiles when you build a sketch. The number of bins is O(1.0/sketchEps). A smaller value specifies that more bins can be obtained. In most cases, the default value is used. Valid values: (0,1). | 0.03 | |
minSplitLoss | No | The minimum loss change required for splitting a node. A larger value specifies that node splitting is less likely to occur. | 0 | |
featureNum | No | The number of features or the maximum feature ID. If this parameter is not specified for resource usage estimation, the system automatically runs an SQL task to calculate the number of features or the maximum feature ID. | None | |
baseScore | No | The initial prediction values of all samples. | 0.5 | |
randSeed | No | The random seed. The value of this parameter must be an integer. | None | |
featureImportanceType | No | The feature importance type. Valid values:
| gain | |
Tuning parameters | coreNum | No | The number of cores used in computing. A larger value indicates faster running of the computing algorithm. | Automatically allocated |
memSizePerCore | No | The memory size of each core. Unit: MB. | Automatically allocated |
Example
Execute the following SQL statements on an ODPS SQL node to generate training data. In this example, dense training data is generated.
drop table if exists smart_binary_input; create table smart_binary_input lifecycle 3 as select * from ( select 0.72 as f0, 0.42 as f1, 0.55 as f2, -0.09 as f3, 1.79 as f4, -1.2 as f5, 0 as label union all select 1.23 as f0, -0.33 as f1, -1.55 as f2, 0.92 as f3, -0.04 as f4, -0.1 as f5, 1 as label union all select -0.2 as f0, -0.55 as f1, -1.28 as f2, 0.48 as f3, -1.7 as f4, 1.13 as f5, 1 as label union all select 1.24 as f0, -0.68 as f1, 1.82 as f2, 1.57 as f3, 1.18 as f4, 0.2 as f5, 0 as label union all select -0.85 as f0, 0.19 as f1, -0.06 as f2, -0.55 as f3, 0.31 as f4, 0.08 as f5, 1 as label union all select 0.58 as f0, -1.39 as f1, 0.05 as f2, 2.18 as f3, -0.02 as f4, 1.71 as f5, 0 as label union all select -0.48 as f0, 0.79 as f1, 2.52 as f2, -1.19 as f3, 0.9 as f4, -1.04 as f5, 1 as label union all select 1.02 as f0, -0.88 as f1, 0.82 as f2, 1.82 as f3, 1.55 as f4, 0.53 as f5, 0 as label union all select 1.19 as f0, -1.18 as f1, -1.1 as f2, 2.26 as f3, 1.22 as f4, 0.92 as f5, 0 as label union all select -2.78 as f0, 2.33 as f1, 1.18 as f2, -4.5 as f3, -1.31 as f4, -1.8 as f5, 1 as label ) tmp;
The following figure shows the generated training data.
Create the pipeline shown in the following figure and run the component. For more information, see Algorithm modeling.
In the left-side component list of Machine Learning Designer, separately search for the Read Table, PS-SMART Binary Classification Training, Prediction, and Write Table components, and drag the components to the canvas on the right.
Connect nodes by drawing lines to organize the nodes into a pipeline that includes upstream and downstream relationships based on the preceding figure.
Configure the component parameters.
On the canvas, click the Read Table-1 component. On the Select Table tab in the right pane, set Table Name to smart_binary_input.
On the canvas, click the PS-SMART Binary Classification Training-1 component and configure the parameters listed in the following table in the right pane. Retain the default values for the parameters that are not listed in the table.
Tab
Parameter
Description
Fields Setting
Feature Columns
Select the feature columns. Select the f0, f1, f2, f3, f4, and f5 columns.
Label Column
Select the label column.
Parameters Setting
Evaluation Indicator Type
Select the evaluation metric type. Set the parameter to AUC for Classification.
Trees
Set this parameter to 5.
On the canvas, click the Prediction-1 component. On the Field Settings tab in the right pane, select Select All for Reserved Columns. Retain the default values for the remaining parameters.
On the canvas, click the Write Table-1 component. On the Select Table tab in the right pane, set New Table Name to smart_binary_output.
After the parameter configuration is complete, click the button to run the pipeline.
Right-click the Prediction-1 component and choose prediction_detail column indicates a positive example, and 0 indicates a negative example.
to view the prediction results. 1 in theRight-click the PS-SMART Binary Classification Training-1 component and choose
to view the feature importance table. Parameters:id: the ID of a passed feature. In this example, the f0, f1, f2, f3, f4, and f5 features are passed. Therefore, in the id column, 0 represents the feature column of f0, and 4 represents the feature column of f4. If data in the input table is key-value pairs, the id column lists keys in the key-value pairs.
value: the feature importance type. The default value of this parameter is gain, which indicates the sum of the information gains provided by the feature for the model.
The preceding feature importance table has only three features. This indicates that only the three features are used to split the tree. The feature importance of other features can be considered as 0.
PS-SMART model deployment
If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.
After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Model service deployment by using the PAI console.
References
For more information about the components provided by Machine Learning Designer, see Overview of Machine Learning Designer.
Machine Learning Designer provides various preset algorithm components. You can select a component to process data based on your actual business scenario. For more information, see Component reference: Overview of all components.