All Products
Search
Document Center

PolarDB:BST algorithm

Last Updated:Oct 30, 2024

The Behavior Sequence Transformer (BST) algorithm uses the Transformer model to capture long-term time series information in user behavior data. The BST algorithm can extract implicit features from behavior sequences and predict results. It is suitable for business scenarios related to behavior sequences, such as recommendations and value mining for the user lifecycle.

Scenarios

The BST algorithm supports various prediction scenarios, including classification and regression. In these scenarios, the input data is usually a behavior sequence that contains time series features and is stored in the database in text format. For example, the input data can be the click events of users in the last seven days. The output data is usually the predicted results in integer or floating-point format. For example, the output data can be the payment amount, whether the users churn, and whether the users pay.

  • Classification scenario

    Predicts new user payments, paid user churns, and premium user churns in gaming scenarios. For example, in the game operations scenario, the behaviors of paid users in the last 14 days are used as the input behavior sequence of the BST algorithm. The algorithm extracts performance features from the behavior sequence to predict whether the users will churn in the next 14 days. A user churn occurs if the user does not log on to the game for 14 consecutive days.

  • Regression scenario

    Predicts the payment amount of new users in gaming scenarios. For example, in the game operations scenario, the behaviors of new users within the last 24 hours are used as the input behavior sequence of the BST algorithm. The algorithm extracts performance features from the behavior sequence to predict the total consumption of new users in the next 7 days.

Limits

The samples that are used as the input data of the BST algorithm must be evenly distributed. If the amount of one type of sample is significantly higher than those of other types of samples, for example, by over 2000%, the samples are considered unevenly distributed. In this case, use the K-means clustering algorithm provided by PolarDB for AI to perform clustering operations on the majority samples to ensure even sample distribution before you use the samples as the input data. For more information, see K-means clustering algorithm (K-Means).

Table format for the algorithm model training data

Column

Required

Type

Description

Example

uid

Yes

VARCHAR

The ID of each data record to be processed, such as the user ID and product ID.

253460731706911258

event_list

Yes

TEXT

The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order.

"[183, 238, 153, 152]"

target

Yes

INT, FLOAT, DOUBLE

The label of the sample that is used to measure the algorithm model metrics.

0

val_x_col

No

TEXT

The behavior sequence used for parameter tuning of the algorithm model. It is the event_list column of the validation dataset that contains data similar to that of the event_list column. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order.

"[183, 238, 153, 152]"

val_y_col

Yes

INT, FLOAT, DOUBLE

The behavior sequence used for parameter tuning.

1

The following table describes the values of the model_parameter parameter in the CREATE MODEL statement for creating an algorithm model. Configure the parameters based on your requirements. For more information about the CREATE MODEL statement, see Manage models.

Parameter

Description

model_task_type

The task type. Valid values:

  • classification (default)

  • regression

batch_size

The length of the batch. A short batch is prone to overfitting. The default value is 16.

window_size

Used for embedded coding of behavior IDs. The value is an integer and must be greater than the largest behavior ID. Otherwise, parsing errors occur.

sequence_length

The length of the behavior sequence involved in the algorithm model calculation. The value cannot be greater than 3000. If the value of window_size is greater than 900, do not set sequence_length to a value that is too large.

success_id

The behavior ID to be processed.

max_epoch

The maximum number of iterations. The default value is 1.

learning_rate

The learning rate. The default value is 0.0002.

loss

The learning task and its learning objectives. Valid values:

  • CrossEntropyLoss (default value): the cross entropy. It is used for binary classification problems.

  • mse: the mean square error. It is used for regression tasks.

  • mae: the mean absolute error. It is used for regression tasks.

  • msle: the mean square error. It is used for regression tasks.

val_flag

Specifies whether to perform validation after each iteration during the training. Valid values:

  • 0 (default): No validation is performed. The val_x_cols, val_y_cols, and val_metric parameters are not requried. The algorithm model of the last iteration is saved.

  • 1: Validation is performed. The val_x_cols, val_y_cols, and val_metric parameters are required. The algorithm model that generates the optimal validation result is saved.

val_x_cols

The input columns in the validation dataset.

val_y_cols

The label columns in the validation dataset.

val_metric

Validation metrics. Valid values:

  • loss (default): Same as the loss parameter for training. It is used for classification and regression tasks.

  • f1score: the harmonic mean of precision and recall. It is used for classification tasks.

  • r2_score: coefficient of determination. It is used for regression tasks.

  • mse: the mean square error. It is used for regression tasks.

  • mape: the mean absolute percentage error. It is used for regression tasks.

  • mape_plus: MAPE for only positive number labels. It is used for regression tasks.

auto_heads

Whether to automatically set the number of attention heads in the multi-head attention mechanism. Valid values:

  • 1 (default value): the number of attention heads is automatically set.

  • 0: the number of attention heads is manually set.

Note

If you set the value to 1, it is possible that the GPU memory is insufficient. In most cases, make sure that the calculation result of is not a prime number.

num_heads

The number of attention heads. If auto_heads is set to 0, you need to specify this parameter. The default value is 4.

Sample statement for model creation and offline training

/*polar4ai*/CREATE MODEL sequential_bst WITH 
(model_class = 'bst', 
x_cols = 'event_list', 
y_cols='target',
model_parameter=(batch_size=128, window_size=900, sequence_length=3000, 
                 success_id=900, max_epoch=2, learning_rate=0.0008, 
                 val_flag=1, val_x_cols='val_x_col', val_y_cols='val_y_col', val_metric='f1score')) 
AS (SELECT * FROM db4ai.seqential_train);

Table format for the algorithm model evaluation data

Column

Required

Type

Description

Example

uid

Yes

VARCHAR(255)

The ID of each data record to be processed, such as the user ID and product ID.

123213

event_list

Yes

TEXT

The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order.

"[183, 238, 153, 152]"

target

Required

INT, FLOAT, DOUBLE

The label of the sample that is used to measure the algorithm model error.

0

The following table describes the values of the metrics parameter in the EVALUATE statment for evaluating the algorithm model. Configure the parameters based on your requirements. For more information about the EVALUATE statement, see Manage models.

Parameter

Description

metrics

Validation metrics. Valid values:

  • acc: accuracy. It is used for classification tasks.

  • auc: AUC value, indicating the area under the ROC curve. It is used for classification tasks.

  • Fscore: the harmonic mean of precision and recall. It is used for classification tasks.

  • r2_score: coefficient of determination. It is used for regression tasks.

  • mse: the mean square error. It is used for regression tasks.

  • mape: the mean absolute percentage error. It is used for regression tasks.

  • mape_plus: MAPE for only positive number labels. It is used for regression tasks.

Sample statement for model evaluation

/*polar4ai*/SELECT uid,target FROM evaluate(MODEL sequential_bst,SELECT * FROM db4ai.seqential_eval) WITH (x_cols = 'event_list', y_cols='target', metrics='Fscore');

Table format for the algorithm model prediction data

Column

Required

Type

Description

Example

uid

Yes

VARCHAR(255)

The ID of each data record to be processed, such as the user ID and product ID.

123213

event_list

Yes

TEXT

The behavior sequence used for training in the input table. The sequence consists of values of the INT type that indicate behavior IDs. These values are separated with commas (,) and sorted by time in ascending order.

"[183, 238, 153, 152]"

Sample statement for model prediction

/*polar4ai*/SELECT uid,target FROM PREDICT (MODEL sequential_bst, SELECT * FROM db4ai.bst_test) WITH (x_cols= 'event_list', mode='async');