Use SQL to implement machine learning prediction - AnalyticDB

This topic describes how to use SQL to quickly deploy a behavior sequence transformer (BST) model in AnalyticDB for MySQL to execute machine learning jobs, including data processing and model creation, training, evaluation, and prediction.

Scenarios

You can use SQL queries to quickly deploy models to implement data preprocessing, training, and inference.

AnalyticDB for MySQL supports BST models. The BST models are suitable for scenarios in which user behavior patterns and preferences are analyzed to predict future trends and provide personalized recommendations. For example, in the gaming or e-commerce industry, long-term dependencies between user behaviors can be captured to accurately understand and predict user behavior patterns and preferences to provide personalized services and recommendations.

Introduction to BST models

A BST model can be used to process behavior sequence data. The model receives a sequence of behavior event IDs and returns 0 or 1 as the classification result.

For example, the player interactions in a game can be recorded and converted into a series of behavior events.

A player may have the following sequence of behaviors in the game: log on, receive logon rewards, accept tasks, fight, fight, fight, complete tasks, recharge, fight, and log out.

The sequence of behaviors can be transformed into the following sequence of event IDs: 0, 1, 2, 3, 3, 3, 4, 5, 3, and 6. The sequence of event IDs is provided to a BST model in the string format. The BST model analyzes the sequence and returns a classification result (0 or 1) to indicate that a specific behavior sequence belongs to a preset category.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
The minor version of the AnalyticDB for MySQL cluster is 3.2.4.0 or later.
Note
To query the minor version of an AnalyticDB for MySQL Data Lakehouse Edition cluster, execute the SELECT adb_version(); statement. To update the minor version of a cluster, contact technical support.
The AI resource group feature is enabled. The AI resource group feature is in public preview. To enable the feature, contact technical support.

Step 1: Create an AI resource group and associate a general resource group with the AI resource group

If you want to use SQL to execute machine learning jobs, the following types of resource groups are required:

AI resource group: manages the GPU resources required for machine learning jobs to support compute-intensive operations in model training and prediction.
General resource group: processes regular SQL queries, such as generating training data or executing prediction functions.

SQL statements are first submitted to a general resource group for execution. If AI computing is required to execute the SQL statements, the statements are automatically resubmitted to the AI resource group associated with the general resource group.

Perform the following steps to create an AI resource group and associate a general resource group with the AI resource group:

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Clusters page, click an edition tab. Find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Cluster Management > Resource Management. On the Resource Management page, click the Resource Groups tab.

In the upper-right corner of the Resource Groups tab, click Create Resource Group. In the Create Resource Group panel, configure the parameters that are described in the following table to create an AI resource group.

Parameter	Description
Resource Group Name	The name of the resource group. The resource group name must be 2 to 30 characters in length and can contain letters, digits, and underscores (_). The resource group name must start with a letter.
Job Type	The job type. Select AI from the drop-down list. Important If the AI resource group feature is disabled for your AnalyticDB for MySQL cluster, no AI option is available. In this case, contact technical support.
Specifications	The combination of GPU, CPU, and memory resources. Select ADB.MLLarge.24, ADB.MLLarge.2, or ADB.MLAdvavced.6.
Minimum Resources	The minimum number of resources.
Maximum Resources	The maximum number of resources.

Click OK. The AI resource group is created.
Find the general resource group that you want to manage and click Modify in the Actions column. In the Modify Resource Group panel, associate the general resource group with the created AI resource group in the ML Job Resubmission Rules section.

Step 2: Process data

Data processing is the process of converting raw business data into a format suitable for model training. Data processing is highly dependent on three key factors: the format of the raw data, storage location of the raw data, and input data format requirements for model training.

Table schema requirements for model training

Input feature column: The value of the column is a string of integers separated by commas (,). Each entry specifies a feature vector of a sample.
Result label column: The column uses one of two values, such as 0 and 1, to indicate the category label.

Table schema example: ('1,2,3',0), ('3,2,1',1).

Data processing method

If the format of your raw data is different from the format required for model training, use the following method to process your data:

Upload the JAR package that contains the Spark data processing program to an Object Storage Service (OSS) bucket.
When you submit a Spark job, configure the parameters required for the job. For more information, see Spark application configuration parameters.

If your raw data meets the required format for model training, you can skip this step.

Step 3: Create and train a machine learning model

Create a machine learning model, define the model parameters, specify training data, and then check the status of the model.

In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, enter the following SQL statements to create and train a model:

-- Create a model.
-- Specify a resource group.
/*+resource_group=itrain*/
CREATE MODEL bstdemo.bst-- The name of the model.
OPTIONS (
  model_type='bst_classification', -- The type of the model.
  feature_cols=(event_list),  -- The input feature columns.
  target_cols=(target),  -- The result label columns.
  hyperparameters = (  -- Other model hyperparameters.
    use_best_ckpt = 'False',
    early_stopping_patience='0'
  )
)
AS SELECT event_list, target FROM bstdemo.adb; -- The data source of the model. The results of the query are provided to the preceding variables for training.

-- Query the training status of the model. After the training is complete, the status changes to READY.
SHOW MODEL bstdemo.bst;

Step 4: Evaluate the model

Evaluate the created machine learning model to verify the model correctness.

-- Similar to the syntax for creating a model.
/*resource_group=rg1*/
EVALUATE MODEL bstdemo.bst
OPTIONS (
    feature_cols=(event_list), 
    target_cols=(target),  
)
AS SELECT event_list, target FROM bstdemo.adb01;

Step 5: Perform model prediction

Select feature columns from a table and pass the values of the columns to the trained model. Query the prediction results of the data.

-- Use the trained model to perform prediction.
-- In the ML_PREDICT() function, the first parameter specifies the name of the model, and the second parameter specifies the input feature columns.
SELECT ML_PREDICT('bstdemo.bst', event_list) FROM bstdemo.adb02;