PS-SMART Multiclass Classification - Platform For AI - Alibaba Cloud Documentation Center

The Parameter Server (PS) specializes in large-scale offline and online training tasks. The Scalable Multiple Additive Regression Tree (SMART) is a Gradient Boosting Decision Tree (GBDT) iterative algorithm implemented on PS. PS-SMART is capable of handling training tasks with tens of billions of samples and hundreds of thousands of features, and can operate across thousands of nodes. It also supports various data formats and optimization techniques, including histogram approximation.

Limits

Input data for the PS-SMART multiclass classification must adhere to the following criteria:

The target column in the PS-SMART multiclass classification only accepts numeric types. If the MaxCompute table data is a STRING type, it requires conversion. For instance, a classification target labeled as Good/Medium/Bad string should be transformed to 0/1/2.
For KV format data, the feature ID must be a positive integer, and the feature value must be a real number. Serialization is necessary if the feature ID is a STRING type, using the serialization component. Categorical strings as feature values require feature discretization and other feature engineering processes.
While the PS-SMART multiclass classification supports tasks with hundreds of thousands of features, it is resource-intensive and slow. GBDT-like algorithms, suitable for training with continuous features, can be used instead. It is advisable to perform One-Hot encoding on categorical features, excluding low-frequency ones, and to avoid discretizing other continuous numerical features.
The PS-SMART algorithm incorporates elements of randomness, such as data_sample_ratio and fea_sample_ratio, which are responsible for data and feature sampling respectively. It also utilizes histogram approximation optimization and introduces randomness in the sequence of merging local Sketches into the global Sketch. Despite the potential for varying tree structures when executed by multiple workers in a distributed environment, the algorithm can theoretically ensure comparable model outcomes. Experiencing inconsistent results during training with identical data and parameters is expected and should not be cause for concern.
To expedite training, consider increasing the Number Of Computing Cores. However, since PS-SMART requires all servers to secure resources before commencing training, requesting additional resources during peak cluster times may lead to longer wait times.

Precautions

Keep the following in mind when utilizing the PS-SMART multiclass classification component:

Although the PS-SMART multiclass classification component can manage tasks with numerous features, it is resource-intensive and operates slowly. GBDT-like algorithms, which are well-suited for direct training with continuous features, are recommended. One-Hot encoding should be applied to categorical features, except for those that are low-frequency, and discretization of other continuous numerical features is generally discouraged.
The PS-SMART algorithm incorporates elements of randomness, such as data_sample_ratio and fea_sample_ratio, which are indicative of data and feature sampling. It also utilizes histogram approximation optimization and introduces variability in the sequence of merging local Sketches into the global Sketch. Despite the potential for varying tree structures when executed by multiple workers in a distributed environment, the algorithm can theoretically maintain consistent model performance. Experiencing different outcomes from repeated training sessions with identical data and parameters is expected behavior.
To speed up training, you can increase the Number Of Computing Cores. Note that PS-SMART requires all servers to acquire resources before training starts, so requesting more resources during busy times may result in increased waiting periods.

Configure the component

Method one: Visualization

Add the PS-SMART Multiclass Classification component to the Designer workflow page and set the parameters on the right side of the interface:

Parameter type	Parameter	Description
Field settings	Is sparse format	In sparse format, KV pairs are separated by spaces, and key and value are separated by a colon (:). For example, 1:0.3 3:0.9.
	Select feature column	The feature column used for training in the input table. If the input data is in Dense format, only numeric (BIGINT or DOUBLE) types can be selected. If the input data is in Sparse KV format, and key and value are numeric types, only STRING types can be selected.
	Select label column	The label column of the input table supports STRING and numeric types. If it is internally stored, only numeric types are supported. For example, in binary classification, 0 and 1.
	Select weight column	The column can weight each row of samples and supports numeric types.
Parameter settings	Number of categories	The number of categories for multiclass classification. If the number of categories is n, the values of the label column are {0,1,2,...,n-1}.
	Evaluation metric type	Supports multiclass negative log likelihood and multiclass classification error types.
	Number of trees	Configure the number of trees, a positive integer. The number of trees is proportional to the training time.
	Maximum depth of trees	The default value is 5, which means a maximum of 32 leaf nodes.
	Data sampling ratio	When building each tree, sample part of the data for learning to build a weak learner, thereby speeding up training.
	Feature sampling ratio	When building each tree, sample part of the features for learning to build a weak learner, thereby speeding up training.
	L1 penalty coefficient	Controls the size of leaf nodes. The larger the parameter value, the more evenly distributed the size of leaf nodes. If overfitting occurs, increase this parameter value.
	L2 penalty coefficient	Controls the size of leaf nodes. The larger the parameter value, the more evenly distributed the size of leaf nodes. If overfitting occurs, increase this parameter value.
	Learning rate	The range of values is (0,1).
	Approximate Sketch accuracy	The threshold for constructing Sketch cut-off quantiles. The smaller the parameter value, the more buckets are obtained. Generally, the default value 0.03 is used, and manual configuration is not required.
	Minimum split loss change	The minimum loss change required for splitting nodes. The larger the parameter value, the more conservative the split.
	Number of features	The number of features or the maximum feature ID. If you need to estimate resource usage, you must manually configure this parameter.
	Global bias term	The initial prediction value for all samples.
	Random number generator seed	Random number seed, integer.
	Feature importance type	Values: The number of times this feature is used as a split feature in the model The information gain brought by this feature in the model The number of samples covered by this feature at split nodes in the model
Execution tuning	Number of cores	Default is system auto-assigned.
Execution tuning	Memory size per core	The memory used by a single core, in MB. Usually, manual configuration is not required, and the system auto-assigns it.

Method two: PAI command

Configure the PS-SMART Multiclass Classification component parameters using PAI commands. Invoke PAI commands using the SQL script component. For more information, see Scenario 4: Execute PAI commands in the SQL script component.

--Training
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_multiclass_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="multi:softprob"
    -Dmetric="mlogloss"
    -DfeatureImportanceType="gain"
    -DtreeCount="5";
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
--Prediction
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_multiclass_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -DkvDelimiter=":"
    -Dlifecycle="28"

Module	Parameter	Is required	Default value	Description
Data parameters	featureColNames	Yes	None	The feature column used for training in the input table. If the input table is in Dense format, only numeric (BIGINT or DOUBLE) types can be selected. If the input table is in Sparse KV format, and the key and value in the KV format are numeric types, only STRING types can be selected.
	labelColName	Yes	None	The label column of the input table supports STRING and numeric types. If it is internally stored, only numeric types are supported. For example, in multiclass classification, {0,1,2,…,n-1}, where n represents the number of categories.
	weightCol	No	None	The column can weight each row of samples and supports numeric types.
	enableSparse	No	false	Whether it is in sparse format. The range of values is {true,false}. In sparse format, KV pairs are separated by spaces, and key and value are separated by a colon (:). For example, 1:0.3 3:0.9.
	inputTableName	Yes	None	The name of the input table.
	modelName	Yes	None	The name of the output model.
	outputImportanceTableName	No	None	The name of the table for outputting feature importance.
	inputTablePartitions	No	None	The format is ds=1/pt=1.
	outputTableName	No	None	Output to MaxCompute table in binary format. Reading is not supported. It can only be obtained through the SMART prediction component.
	lifecycle	No	3	The lifecycle of the output table.
Algorithm parameters	classNum	Yes	None	The number of categories for multiclass classification. If the number of categories is n, the values of the label column are {0,1,2,...,n-1}.
	objective	Yes	None	The type of objective function. If performing multiclass training, select multi:softprob.
	metric	No	None	The type of evaluation metric for the training set is output in the Logview file Coordinator area stdout. The following types are supported: mlogloss: Corresponds to the multiclass negative log likelihood type in visualization. merror: Corresponds to the multiclass classification error type in visualization.
	treeCount	No	1	The number of trees, proportional to the training time.
	maxDepth	No	5	The maximum depth of the tree, with a range of 1 to 20.
	sampleRatio	No	1.0	The data sampling ratio, with a range of (0,1]. If the value is 1.0, it means no sampling.
	featureRatio	No	1.0	The feature sampling ratio, with a range of (0,1]. If the value is 1.0, it means no sampling.
	l1	No	0	L1 penalty coefficient. The larger the parameter value, the more evenly distributed the leaf nodes. If overfitting occurs, increase this parameter value.
	l2	No	1.0	L2 penalty coefficient. The larger the parameter value, the more evenly distributed the leaf nodes. If overfitting occurs, increase this parameter value.
	shrinkage	No	0.3	The range of values is (0,1).
	sketchEps	No	0.03	The threshold for constructing Sketch cut-off quantiles. The number of buckets is O(1.0/sketchEps). The smaller the parameter value, the more buckets are obtained. Generally, the default value is used, and manual configuration is not required. The range of values is (0,1).
	minSplitLoss	No	0	The minimum loss change required for splitting nodes. The larger the parameter value, the more conservative the split.
	featureNum	No	None	The number of features or the maximum feature ID. If you need to estimate resource usage, you must manually configure this parameter.
	baseScore	No	0.5	The initial prediction value for all samples.
	randSeed	No	None	Random number seed, integer.
	featureImportanceType	No	gain	The type of feature importance calculation, including the following: weight: The number of times this feature is used as a split feature in the model. gain: The information gain brought by this feature in the model. cover: The number of samples covered by this feature at split nodes in the model.
Tuning parameters	coreNum	No	System auto-assigned	The number of cores. The larger the parameter value, the faster the algorithm runs.
Tuning parameters	memSizePerCore	No	System auto-assigned	The memory used by each core, in MB.

Usage example

Generate input data using the following SQL statement (example given for KV format data).

drop table if exists smart_multiclass_input;
create table smart_multiclass_input lifecycle 3 as
select
*
from
(
select 2 as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features from dual
    union all
select 1 as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features from dual
    union all
select 1 as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features from dual
    union all
select 2 as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features from dual
    union all
select 1 as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features from dual
    union all
select 1 as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features from dual
    union all
select 0 as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features from dual
    union all
select 1 as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features from dual
    union all
select 0 as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features from dual
    union all
select 1 as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features from dual
) tmp;

The resulting data appears as follows. PS-smart输入数据

Construct the experiment. For details, see Algorithm modeling.

Set the parameters for the PS-SMART multiclass classification component (refer to the table below for specific parameters, using default values for others).

Parameter type	Parameter	Description
Field settings	Feature column	Select the features column.
	Label column	Select the label column.
	Is sparse format	Select the Is Sparse Format check box.
Parameter settings	Number of categories	Enter 3.
	Evaluation metric type	Select multiclass negative log likelihood.
	Number of trees	Enter 5.

Adjust the parameters for the unified prediction component (refer to the table below for specific parameters, using default values for others).

Parameter type	Parameter	Description
Field settings	Feature column	Default select all. Extra columns do not affect prediction results.
	Original output column	Select the label column.
	Sparse matrix	Select the Sparse Matrix check box.
	Key and value separator	Enter a colon (:).
	Separator between kv pairs	Enter \u0020.

Configure the prediction component parameters (refer to the table below for specific parameters, using default values for others).

Parameter type	Parameter	Description
Field settings	Feature column	Default select all. Extra columns do not affect prediction results.
	Original output column	Select the label column.
	Sparse matrix	Select the Sparse Matrix check box.
	Key and value separator	Enter a colon (:).
	Separator between kv pairs	Enter \u0020.

Execute the experiment and examine the unified prediction component's prediction results. Notably:
- The numbers 0, 1, and 2 in the prediction_detail column correspond to the multiclass classification categories.
- The predict_result column indicates the predicted category.
- The predict_score column shows the probability of the predicted category.
Review the prediction results from the PS-Smart prediction component. Specifically:
- The score_class_k column displays the probability of predicting the k-th category.
- The leaf_index column denotes the predicted leaf node number. If the number of trees is N and the number of categories is M, each sample's leaf_index value is N×M numbers. For instance, in this example, the leaf_index value is 5×3=15, with each tree corresponding to a number indicating the leaf node where the sample falls.
Right-click the PS-SMART Multiclass Classification component and select View Data > View Output Pile 3 from the shortcut menu to observe feature importance.
Specifically:
- The id column lists the feature sequence number provided. In this example, since the input data is in KV format, the id column represents the key in the KV pair.
- The value column indicates the feature importance type, with gain being the default, representing the sum of the information gain contributed by the feature to the model.

PS-SMART model deployment

If you want to deploy the model generated by the PS-SMART Binary Classification Training component to EAS as an online service, you must add the Model export component as a downstream node for the PS-SMART Binary Classification Training component and configure the Model export component. For more information, see Model export.

After the Model export component is successfully run, you can deploy the generated model to EAS as an online service on the EAS-Online Model Services page. For more information, see Deploy a model service in the PAI console.