The Behavior Sequence Transformer (BST) algorithm uses the Transformer model to capture long-term time series information in user behavior data. The BST algorithm can extract implicit features from behavior sequences and predict results. It is suitable for business scenarios related to behavior sequences, such as recommendations and value mining for the user lifecycle.
Scenarios
The BST algorithm supports various prediction scenarios, including classification and regression. In these scenarios, the input data is usually a behavior sequence that contains time series features and is stored in the database in text format. For example, the input data can be the click events of users in the last seven days. The output data is usually the predicted results in integer or floating-point format. For example, the output data can be the payment amount, whether the users churn, and whether the users pay.
Classification scenario
Predicts new user payments, paid user churns, and premium user churns in gaming scenarios. For example, in the game operations scenario, the behaviors of paid users in the last 14 days are used as the input behavior sequence of the BST algorithm. The algorithm extracts performance features from the behavior sequence to predict whether the users will churn in the next 14 days. A user churn occurs if the user does not log on to the game for 14 consecutive days.
Regression scenario
Predicts the payment amount of new users in gaming scenarios. For example, in the game operations scenario, the behaviors of new users within the last 24 hours are used as the input behavior sequence of the BST algorithm. The algorithm extracts performance features from the behavior sequence to predict the total consumption of new users in the next 7 days.
Limits
The samples that are used as the input data of the BST algorithm must be evenly distributed. If the amount of one type of sample is significantly higher than those of other types of samples, for example, by over 2000%, the samples are considered unevenly distributed. In this case, use the K-means clustering algorithm provided by PolarDB for AI to perform clustering operations on the majority samples to ensure even sample distribution before you use the samples as the input data. For more information, see K-means clustering algorithm (K-Means).
Table format for the algorithm model training data
Column | Required | Type | Description | Example |
uid | Yes | VARCHAR | The ID of each data record to be processed, such as the user ID and product ID. | 253460731706911258 |
event_list | Yes | TEXT | The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order. | "[183, 238, 153, 152]" |
target | Yes | INT, FLOAT, DOUBLE | The label of the sample that is used to measure the algorithm model metrics. | 0 |
val_x_col | No | TEXT | The behavior sequence used for parameter tuning of the algorithm model. It is the event_list column of the validation dataset that contains data similar to that of the event_list column. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order. | "[183, 238, 153, 152]" |
val_y_col | Yes | INT, FLOAT, DOUBLE | The behavior sequence used for parameter tuning. | 1 |
The following table describes the values of the model_parameter
parameter in the CREATE MODEL
statement for creating an algorithm model. Configure the parameters based on your requirements. For more information about the CREATE MODEL statement, see Manage models.
Parameter | Description |
model_task_type | The task type. Valid values:
|
batch_size | The length of the batch. A short batch is prone to overfitting. The default value is 16. |
window_size | Used for embedded coding of behavior IDs. The value is an integer and must be greater than the largest behavior ID. Otherwise, parsing errors occur. |
sequence_length | The length of the behavior sequence involved in the algorithm model calculation. The value cannot be greater than 3000. If the value of window_size is greater than 900, do not set sequence_length to a value that is too large. |
success_id | The behavior ID to be processed. |
max_epoch | The maximum number of iterations. The default value is 1. |
learning_rate | The learning rate. The default value is 0.0002. |
loss | The learning task and its learning objectives. Valid values:
|
val_flag | Specifies whether to perform validation after each iteration during the training. Valid values:
|
val_x_cols | The input columns in the validation dataset. |
val_y_cols | The label columns in the validation dataset. |
val_metric | Validation metrics. Valid values:
|
auto_heads | Whether to automatically set the number of attention heads in the multi-head attention mechanism. Valid values:
Note If you set the value to 1, it is possible that the GPU memory is insufficient. In most cases, make sure that the calculation result of |
num_heads | The number of attention heads. If auto_heads is set to 0, you need to specify this parameter. The default value is 4. |
Sample statement for model creation and offline training
/*polar4ai*/CREATE MODEL sequential_bst WITH
(model_class = 'bst',
x_cols = 'event_list',
y_cols='target',
model_parameter=(batch_size=128, window_size=900, sequence_length=3000,
success_id=900, max_epoch=2, learning_rate=0.0008,
val_flag=1, val_x_cols='val_x_col', val_y_cols='val_y_col', val_metric='f1score'))
AS (SELECT * FROM db4ai.seqential_train);
Table format for the algorithm model evaluation data
Column | Required | Type | Description | Example |
uid | Yes | VARCHAR(255) | The ID of each data record to be processed, such as the user ID and product ID. | 123213 |
event_list | Yes | TEXT | The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order. | "[183, 238, 153, 152]" |
target | Required | INT, FLOAT, DOUBLE | The label of the sample that is used to measure the algorithm model error. | 0 |
The following table describes the values of the metrics
parameter in the EVALUATE
statment for evaluating the algorithm model. Configure the parameters based on your requirements. For more information about the EVALUATE statement, see Manage models.
Parameter | Description |
metrics | Validation metrics. Valid values:
|
Sample statement for model evaluation
/*polar4ai*/SELECT uid,target FROM evaluate(MODEL sequential_bst,SELECT * FROM db4ai.seqential_eval) WITH (x_cols = 'event_list', y_cols='target', metrics='Fscore');
Table format for the algorithm model prediction data
Column | Required | Type | Description | Example |
uid | Yes | VARCHAR(255) | The ID of each data record to be processed, such as the user ID and product ID. | 123213 |
event_list | Yes | TEXT | The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order. | "[183, 238, 153, 152]" |
Sample statement for model prediction
/*polar4ai*/SELECT uid,target FROM PREDICT (MODEL sequential_bst, SELECT * FROM db4ai.bst_test) WITH (x_cols= 'event_list', mode='async');