BST algorithm - PolarDB - Alibaba Cloud Documentation Center

The Behavior Sequence Transformer (BST) algorithm uses the powerful Transformer framework to capture long-term time series information from user behavior sequences. The BST algorithm can extract implicit features from behavior sequences and make predictions. The BST algorithm provides significant benefits in business scenarios related to behavior sequences, such as recommendation systems and user lifecycle value mining.

Scenarios

The BST algorithm is designed to support various prediction tasks, including classification and regression.

The input data of the BST algorithm is behavior sequences that have time series features. The input data is stored in LONGTEXT format in the database. An example of such data is the click behaviors of users over the previous seven days.
The BST algorithm outputs predictions, which are integers or floating-point numbers, such as the amount that users are expected to pay, whether user churns occur, and whether payments are made.
Sample classification scenarios:
Predict the number of new paying users and the potential churns of regular-paying and high-paying users in gaming scenarios. For example, the in-game behaviors of paying users over the previous 14 days in a gaming operation scenario are constructed into the behavior sequence input of the BST algorithm. The BST algorithm extracts the relevant features from the behavior sequences to predict the potential churns in the following 14 days. A user is considered to have churned if the user does not log on for 14 consecutive days.
Sample regression scenarios:
Predict the total spending of new users in a gaming scenario. For example, the in-game behaviors of new users within the first 24 hours in a gaming operation scenario are constructed into the behavior sequence input of the BST algorithm. The BST algorithm extracts the relevant features from the behavior sequences to predict the total spending of the new users in the following seven days.

Limits

The BST algorithm works effectively when the input data is balanced in terms of class distribution. If the input data is imbalanced, such as when a majority class have more than 20 times the samples of the minority classes, we recommend that you use the K-means clustering algorithm provided in PolarDB for AI to preprocess the imbalanced classes, such as the non-paying group, and provide a balanced overall data distribution across classes. For more information, see K-means clustering algorithm.

Format of the data table for algorithm model creation

Column	Required	Data type	Description	Example
uid	Yes	VARCHAR	The ID of each data entry, such as the user ID or product ID.	253460731706911258
event_list	Yes	LONGTEXT	A sequence of behaviors for model creation. Each behavior in the sequence is represented by a unique integer ID. The behaviors in the sequence are separated by commas (,) and sorted in ascending order based on their timestamps.	"[183, 238, 153, 152]"
target	Yes	INT, FLOAT, and DOUBLE	The labels that measure the algorithm model metrics.	0
val_row	No	INT	To prevent the model from overfitting, you can specify a validation set. Valid values: 0: labels the row as model creation data. 1: labels the row as model validation data. Note In most cases, this parameter is used together with the version and val_flag model creation parameters. When the version parameter is set to 1 and the val_flag parameter is set to 1, the val_row parameter takes effect. When the val_flag parameter is set to 0 and you previously defined the val_row parameter, only the data marked as val_row=0 is used as model creation data.	1
other_feature	No	INT, FLOAT, DOUBLE, and LONGTEXT	Other features of the model. To use a feature, include the column name of the feature in the x_value_cols and x_statics_cols model creation parameters. Note If the data type of the other_feature column is LONGTEXT, you can store various formats of text data in the column, including JSON strings, lists, or strings separated by commas (`,`). You can specify multiple other_feature parameters, such as other_feature1 and other_feature2.	2
val_x_cols	No	LONGTEXT	A sequence of behaviors for model validation and parameter tuning. Each behavior in the sequence is represented by a unique integer ID. The behaviors in the sequence are separated by commas (,) and sorted in ascending order based on their timestamps. Note This parameter takes effect only if you set the version parameter to `0`. For more information, see the description of the version parameter in this topic.	"[183, 238, 153, 152]"
val_y_cols	No	INT, FLOAT, and DOUBLE	The label of the behavior sequence used to tune the parameters. Note This parameter takes effect only if you set the version parameter to `0`. For more information, see the description of the version parameter in this topic.	1

You can execute the CREATE MODEL statement to create an algorithm model. The following table describes the configuration options for the model_parameter parameter in the CREATE MODEL statement.

Parameter	Description
model_task_type	The task type. Valid values: classification (default) regression
batch_size	The batch size. A small batch size can increase the risk of overfitting in a model. Default value: 16.
window_size	Used for embedded encoding of behavior IDs. The value must be greater than or equal to the maximum behavior ID value plus one. Otherwise, a parsing error occurs.
sequence_length	The length of the behavior sequence involved in algorithm model calculations. The value must not exceed 3000. If the window_size parameter is greater than 900, do not set the sequence_length parameter to a value that is excessively large.
success_id	The ID of the behavior for which the model makes a prediction.
max_epoch	The maximum number of iterations. Default value: 1.
learning_rate	The learning rate. Default value: 0.0002.
loss	The loss function. Valid values: CrossEntropyLoss (default): used for binary classification issues. mse: used for regression tasks. mae: used for regression tasks. msle: used for regression tasks.
val_flag	Specifies whether to perform validation after each model iteration. Valid values: 0 (default): does not perform validation after each model iteration. You do not need to specify the val_metric parameter and specify the val_row parameter in the model creation data table. The algorithm model from the final round is saved. 1: performs validation after each model iteration. You must specify the val_metric parameter and specify the val_row parameter in the model creation data table. The algorithm model that has the best performance based on the validation metrics is saved.
val_metric	The metric used for validation. Valid values: loss (default): the metric that is the same as the loss metric during model creation. This metric can be used for classification and regression tasks. f1score: the harmonic mean of precision and recall. This metric can be used for classification tasks. r2_score: the coefficient of determination. This metric can be used for regression tasks. mse: the mean square error. This metric can be used for regression tasks. mape: the mean absolute percentage error. This metric can be used for regression tasks. mape_plus: a variant of MAPE that measures only the error of positive labels. This metric can be used for regression tasks.
auto_data_statics	Specifies whether to automatically generate statistical features. Valid values: on: counts the occurrences of IDs in the sequence and generates statistical features. off (default): does not count the occurrences of IDs in the sequence.
auto_heads	Specifies whether to automatically specify the number of multi-attention headers. Valid values: 1 (default): automatically specifies the number of multi-attention headers. 0: manually specifies the number of multi-attention headers. Note If you set this parameter to 1, an insufficient video memory risk may occur. Make sure that the calculation result of `int ( sqrt \{window\_size \} ) + int ( sqrt\{sequence\_length\}) +2` is not a prime number.
num_heads	If you set the auto_heads parameter to 0, you must specify this parameter. Default value: 4.
x_value_cols	Specifies specific columns as numeric discrete features. The value cannot be empty. Note Example: If you specify `x_value_cols='num_events, max_level, max_viplevel'`, the `num_events, max_level, and max_viplevel` columns are used as numeric discrete features. The values in each column must be integers or floating-point numbers.
x_statics_cols	Specifies specific columns as statistical features. The value cannot be empty. The length of data in each row of a specified column is fixed. Note Example: If you specify `x_statics_cols='stats_item_list, stats_event_list'`, the `stats_item_list and stats_event_list` columns are used as statistical features. The data type of each column is LONGTEXT. You can store various formats of text data in a LONGTEXT column, including JSON strings, lists, or strings separated by `commas (,)`. In JSON strings, the values in the key-value pairs are used as statistical features. Example: `{"money":30,"level":21}`. In lists or strings separated by commas (`,`), the values must be of the `INT` or `FLOAT` type. Example: `stats_event_list="[1,2,4,23,2]"`. `stats_item_list="232,23123,232,2"`.
x_seq_cols	Specifies specific columns as sequence features. Note Example: `x_seq_cols='event_list'`. The data type of each column is LONGTEXT. You can store various formats of text data in a LONGTEXT column, including lists or strings separated by `commas (,)`. Example: `"[183, 238, 153, 152]"`.
version	The model version. Valid values: 0 (default): the old version, in which the val_x_cols and val_y_cols parameters in the model creation data table take effect and the val_row parameter does not take effect. 1: the new version. We recommend that you specify the new version.
data_normalization	Specifies whether to normalize data in the columns specified by the x_value_cols parameter. Valid values: 0 (default): does not perform the normalization operation. 1: performs the normalization operation.
remove_seq_adjacent_duplicates	Specifies whether to remove adjacent duplicate values from the columns specified by the x_seq_cols parameter. Valid values: off (default) on

Format of the data table for algorithm model evaluation

Column	Required	Data type	Description	Example
uid	Yes	VARCHAR(255)	The ID of each data entry, such as the user ID or product ID.	123213
event_list	Yes	LONGTEXT	A sequence of behaviors for model creation. Each behavior in the sequence is represented by a unique integer ID. The behaviors in the sequence are separated by commas (,) and sorted in ascending order based on their timestamps.	"[183, 238, 153, 152]"
target	Yes	INT, FLOAT, and DOUBLE	The label of the sample used to calculate the errors of the algorithm model.	0
other_feature	No	INT, FLOAT, DOUBLE, and LONGTEXT	Other features of the model, which are the same as those in the model creation data table. To use a feature, include the column name of the feature in the x_value_cols and x_statics_cols model creation parameters. Note If the data type of the other_feature column is LONGTEXT, you can store various formats of text data in the column, including JSON strings, lists, or strings separated by commas (`,`). You can specify multiple other_feature parameters, such as other_feature1 and other_feature2.	2

You can execute the EVALUATE statement to evaluate an algorithm model. The following table describes the configuration options for the metrics parameter in the EVALUATE statement.

Parameter

Description

metrics

The metric used for validation. Valid values:

acc: the accuracy. This metric can be used for classification tasks.
auc: the area under the ROC curve. This metric can be used for classification tasks.
Fscore: the harmonic mean of precision and recall. This metric can be used for classification tasks.
r2_score: the coefficient of determination. This metric can be used for regression tasks.
mse: the mean square error. This metric can be used for regression tasks.
mape: the mean absolute percentage error. This metric can be used for regression tasks.
mape_plus: a variant of MAPE that measures only the error of positive labels. This metric can be used for regression tasks.

Format of the data table for algorithm model prediction

Column	Required	Data type	Description	Example
uid	Yes	VARCHAR(255)	The ID of each data entry, such as the user ID or product ID.	123213
event_list	Yes	LONGTEXT	A sequence of behaviors for model creation. Each behavior in the sequence is represented by a unique integer ID. The behaviors in the sequence are separated by commas (,) and sorted in ascending order based on their timestamps.	"[183, 238, 153, 152]"
other_feature	No	INT, FLOAT, DOUBLE, and LONGTEXT	Other features of the model, which are the same as those in the model creation data table. To use a feature, include the column name of the feature in the x_value_cols and x_statics_cols model creation parameters. Note If the data type of the other_feature column is LONGTEXT, you can store various formats of text data in the column, including JSON strings, lists, or strings separated by commas (`,`). You can specify multiple other_feature parameters, such as other_feature1 and other_feature2.	2

Examples

Note

Classification tasks are used in the following examples. For more information about the task types, see the description of the model_task_type parameter in this topic.

Create a BST model

/*polar4ai*/CREATE MODEL sequential_bst WITH (
model_class = 'bst', 
x_cols = 'event_list,other_feature1', 
y_cols='target',
model_parameter=(
  batch_size=128,
   window_size=900, 
   sequence_length=3000, 
   success_id=900, 
   max_epoch=2, 
   learning_rate=0.0008, 
   val_flag=1, 
   x_seq_cols='event_list', 
   x_value_cols='other_feature1', 
   val_metric='f1score', 
   auto_data_statics='on', 
   data_normalization=1, 
   remove_seq_adjacent_duplicates='on', 
   version=1)) AS (SELECT * FROM seqential_train);

Note

sequential_train is a sample data table name used for algorithm model creation.

Evaluate the model

/*polar4ai*/SELECT uid,target FROM evaluate(MODEL sequential_bst,
SELECT * FROM seqential_eval) WITH 
(x_cols = 'event_list,other_feature1', y_cols='target', metrics='Fscore');

Note

sequential_eval is a sample data table name for algorithm model evaluation.

Make a prediction by using the model

/*polar4ai*/SELECT uid,target FROM PREDICT(MODEL sequential_bst, SELECT * FROM seqential_test) WITH 
(x_cols= 'event_list,other_feature1',mode='async');

Note

seqential_test is a sample data table name for algorithm model predication.