In-database machine learning - - Alibaba Cloud Documentation Center

This topic describes the how to use Lindorm machine learning (ML) to perform machine learning in databases, including model training, model inference, and model management.

Overview

Lindorm ML is an out-of-the-box in-database machine learning service provided by LindormTSDB. The process of Lindorm ML includes the following steps: model training, model inference, and model management. After you enable Lindorm ML, you can execute SQL statements to perform the full process of machine learning within databases to explore more data value. Lindorm ML has the following benefits:

Easy-to-use: You can execute standard SQL statements to perform the full process of machine learning within databases even if you do not have professional knowledge for machine learning.
Data retention: You do not need to export data to external platforms for machine learning. This improves the efficiency of machine learning and helps your data meet regulation requirements.
Enterprise features: In Lindorm ML, models and data are both stored in databases. Therefore, you can use the enterprise features provided for databases in machine learning, such as permission management, data audit, and data encryption.

Enable Lindorm ML

For more information about how to enable Lindorm ML, see Enable Lindorm ML.

Model training

When you train a model in Lindorm ML, you must specify the type of tasks that need to be executed by the model. Lindorm ML provides different algorithms for different types of tasks. You can use the extended CREATE MODEL statement in your database to train a model.

Syntax

CREATE MODEL model_name
FROM { table_name | (select_statement) }
[ TARGET column_name ]
TASK ( TIME_SERIES_FORECAST | TIME_SERIES_ANOMALY_DETECTION )
ALGORITHM ( DEEPAR | TFT | esd | nsigma | ttest )
[ PREPROCESSORS 'string' ]
SETTINGS (
    EPOCHS integer,
    ...
)

Parameters

model_name: the name of the model. The model name is unique within a schema.
FROM { table_name | (select_statement) }: the name of table or the query statement that contains the data used to train the model. The results of the statement must contain at least two columns, and one of the columns must be the time column.
TARGET column_name: the name of the column that is used to specify the target of time series forecasting or time series anomaly detection.
TASK: the type of tasks that need to be executed by the model. The following table describes the task types supported by Lindorm ML.
Task type Keyword
Time series forecasting TIME_SERIES_FORECAST
Time series anomaly detection TIME_SERIES_ANOMALY_DETECTION

Task type	Keyword
Time series forecasting	TIME_SERIES_FORECAST
Time series anomaly detection	TIME_SERIES_ANOMALY_DETECTION

ALGORITHM: the algorithm used by the model. The following table describes the algorithms supported by Lindorm ML.


Task type	Algorithm	Description
Time series forecasting	DeepAR	The DeepAR algorithm is a deep neural network algorithm based on recurrent neural network (RNN). For more information, see related papers.
Time series forecasting	TFT	The Temporal Fusion Transformer (TFT) algorithm is a deep neural network algorithm based on the Transformer mechanism. For more information, see related papers.
Time series anomaly detection	esd	This algorithm is developed by Alibaba DAMO Academy and is applicable to spiked anomalies, such as spikes in monitoring curves and scenarios in which a small number of data points are significantly different from other data points. For more information, see Time series anomaly detection.
	nsigma	This algorithm is developed by Alibaba DAMO Academy. It is simple and easy to analyze the causes of anomalies. For more information, see Time series anomaly detection.
	ttest	This algorithm is developed by Alibaba DAMO Academy. It is used to identify whether the metrics related to time series data are abnormal because of a change in the average value. For more information, see Time series anomaly detection.

PREPROCESSORS 'string': the preprocessing operations for specific columns. This parameter is optional. In general, the preprocessing operations are specified by JSON strings.

The value of the PREPROCESSORS parameter contains two components: Columns that indicates the columns that need to be preprocessed and Transformers that indicates the preprocessing operations to be performed. Operations specified by Transformers are performed in the specified order. Each Transformers component contains two fields: Name that specifies the name of the preprocessing operation and Parameters that specifies parameters related to the preprocessing operation. The following example shows a sample value of the PREPROCESSORS parameter:

PREPROCESSORS '[
{
  "Columns":[
    "c1"
  ],
  "Transformers":[
    {
      "Name": "Imputer",
      "Parameters": {"value": 0}
    },
    {
      "Name": "StandardScaler"
    }
  ]
},
{
  "Columns":[
    "c2",
    "c3"
  ],
  "Transformers":[
    {
      "Name": "OrdinalEncoder"
    }
  ]
}
]'

Note In the value of the PREPROCESSORS parameter, the Columns component and the Parameters field are both optional.

The preprocessing operations specified during model training are automatically performed in model inference. The following table describes the preprocessing operations supported by Lindorm ML.


Preprocessing operation	Parameter	Description
OneHotEncoder	None	Encode categorical features into binary values. This operation is applicable to categorical features that are not comparable in size.
OrdinalEncoder	None	Encode categorical features into integers from 0. This operation is applicable to categorical features that can be compared in size.
Imputer	method: a string that indicates the interpolation method. Valid values: dummy, mean, median, most_frequent, roll7, and last. Default value: dummy. value: an integer that needs to be interpolated. Default value: 0. This parameter is optional.	Interpolate missing values. You can select multiple interpolation policies.
StandardScaler	None	Convert data into values that follow a standard normal distribution with a mean of 0 and a standard deviation of 1. This operation is also known as the z-score normalization.
MinMaxScaler	min: an integer that indicates the minimum value of the range. max: an integer that indicates the maximum value of the range. This parameter is optional.	Scale the data into a range (min,max). By default, the value range is (0,1).
LogTransformer	None	Converts values into their logarithms.

SETTINGS: other parameters that you can configure. You can specify different parameters for different task types. The following table describes the parameters supported by Lindorm ML.

Parameters related to time series forecasting


Parameter	Type	Description	Required
epochs	INTEGER	The number of epochs for model training. This parameter is applicable only to time series forecasting. Default value: 80	Yes
time_column	VARCHAR	The time column.	Yes
group_columns	VARCHAR	The group column. This column is used as the TAG column to determine the time series.	Yes
freq	VARCHAR	The frequency of the time series data. Example: 1D.	Yes
prediction_length	INTEGER	The step size of time series forecasting.	Yes
feat_static_columns	VARCHAR	The set of static feature columns. Separate multiple columns with commas (,).	No

Parameters related to time series anomaly detection.

For more information about the training parameters supported by time series anomaly detection, see Time series anomaly detection.

Examples

CREATE MODEL tft_model
FROM (SELECT * FROM fresh_sales WHERE `time` > '2021-02-08T00:00:00+08:00')
TARGET sales
TASK time_series_forecast
ALGORITHM tft
SETTINGS
(
  time_column 'time',
  group_columns 'id_code',
  feat_static_columns 'cate1_id,cate2_id,brand_id',
  context_length '28',
  prediction_length '6',
  epochs '5',
  freq '1D'
);

Model management

After the CREATE MODEL statement is executed, you can execute SQL statements to check whether the model is in the Ready state.

View the information about all models in the database

You can use the SHOW MODELS statement to view the basic information about all models in the database.

The following statement provides an example on how to view the information about all models in a database:

SHOW MODELS;

The following information is returned:

+-------------------+--------+--------------------+-------------------------------+-------------------------------+
|       name        | status |    sql_function    |         created_time          |          update_time          |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
| tft_model         | Ready  | forecast           | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
1 rows in set (524 ms)

The following table describes the returned values.


Returned value	Description
name	The name of the model.
status	The state of the model. Valid values: Init, Training, Ready, and Failed. The value Init indicates that the model is being initialized. The value Training indicates that the model is being trained. The value Ready indicates that the model is trained. The value Failed indicates that the model failed to be trained.
sql_function	The inference function.
created_time	The time when the model is created.
update_time	The time when the model is last updated.

View the information about a specified model

You can use the SHOW MODEL model_name statement to view the detailed information about a specified model.

The following statement provides an example on how to view the detailed information about a specified model:

SHOW MODEL tft_model;

The following information is returned:

+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
|   name    | status | sql_function |     task_type        | algorithm |               query                | preprocessors |                    settings                     |            metrics             |         created_time          |          update_time          |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
| tft_model | Ready  | forecast     | TIME_SERIES_FORECAST | TFT       | SELECT `time`, FIRST(`sales`) AS   | []            | {time_column=time, group_columns=id_code,       | {MAPE=0.35002756118774414,     | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
|           |        |              |                      |           | `sales`, `id_code`, `cate1_id`,    |               | feat_static_columns=cate1_id,cate2_id,brand_id, | MASE=0.41281554008773325,      |                               |                               |
|           |        |              |                      |           | `cate2_id`, `brand_id` FROM        |               | context_length=28, prediction_length=6,         | MSE=456.3769938151042}         |                               |                               |
|           |        |              |                      |           | `fresh_sales` WHERE `time` >       |               | epochs=5, freq=1D,train_mode=LOCAL,             |                                |                               |                               |
|           |        |              |                      |           | '2021-02-08T00:00:00+08:00'        |               | past_length=28,                      |                                |                               |                               |
|           |        |              |                      |           | sample by 1D fill zero             |               | forecast_start=2022-07-31 08:00:00}             |                                |                               |                               |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
1 rows in set (334 ms)

The following table describes the returned values.


Returned value	Description
task_type	The type of tasks that need to be executed by the model.
algorithm	The name of the algorithm used by the model.
query	The query statement used to train the model.
preprocessors	The syntax for preprocessing operations.
settings	The parameters related to the model.
metrics	The metrics for the model.

Deletes a specified model

You can use the DROP MODEL model_name statement to delete a specified model.

The following statement provides an example on how to delete a specified model:

DROP MODEL tft_model;

The following information is returned:

No rows affected (0.397 seconds)

Model inference

If the model is in the Ready state, you can use system functions to perform model inference. You can use different functions for different types of tasks.

Syntax

SELECT function_name(field_name, model_name, params) FROM table_name [WHERE clause] SAMPLE BY 0;

function_name: the name of the system function. Valid values:

FORECAST: the function used for time series forecasting. For more information about how to configure the field_name, model_name, and params parameters of the FORECAST function, see Time series forecasting function.
ANOMALY_DETECT: the function used for time series anomaly detection. For more information about how to configure the field_name, model_name, and params parameters of the ANOMALY_DETECT function, see Time series anomaly detection function.

Examples

SELECT device_id, region, `time`, raw(temperature) as temperature, anomaly_detect(temperature, ad_model) as detect_result from sensor WHERE time >= '2022-01-01 00:00:00' and time < '2022-01-01 00:01:00' SAMPLE BY 0;