This topic describes the how to use Lindorm machine learning (ML) to perform machine learning in databases, including model training, model inference, and model management.
Overview
Lindorm ML is an out-of-the-box in-database machine learning service provided by LindormTSDB. The process of Lindorm ML includes the following steps: model training, model inference, and model management. After you enable Lindorm ML, you can execute SQL statements to perform the full process of machine learning within databases to explore more data value. Lindorm ML has the following benefits:
- Easy-to-use: You can execute standard SQL statements to perform the full process of machine learning within databases even if you do not have professional knowledge for machine learning.
- Data retention: You do not need to export data to external platforms for machine learning. This improves the efficiency of machine learning and helps your data meet regulation requirements.
- Enterprise features: In Lindorm ML, models and data are both stored in databases. Therefore, you can use the enterprise features provided for databases in machine learning, such as permission management, data audit, and data encryption.
Enable Lindorm ML
For more information about how to enable Lindorm ML, see Enable Lindorm ML.
Model training
When you train a model in Lindorm ML, you must specify the type of tasks that need to be executed by the model. Lindorm ML provides different algorithms for different types of tasks. You can use the extended CREATE MODEL statement in your database to train a model.
Syntax
CREATE MODEL model_name
FROM { table_name | (select_statement) }
[ TARGET column_name ]
TASK ( TIME_SERIES_FORECAST | TIME_SERIES_ANOMALY_DETECTION )
ALGORITHM ( DEEPAR | TFT | esd | nsigma | ttest )
[ PREPROCESSORS 'string' ]
SETTINGS (
EPOCHS integer,
...
)
Parameters
- model_name: the name of the model. The model name is unique within a schema.
- FROM { table_name | (select_statement) }: the name of table or the query statement that contains the data used to train the model. The results of the statement must contain at least two columns, and one of the columns must be the time column.
- TARGET column_name: the name of the column that is used to specify the target of time series forecasting or time series anomaly detection.
- TASK: the type of tasks that need to be executed by the model. The following table describes the task types supported by Lindorm ML.
Task type Keyword Time series forecasting TIME_SERIES_FORECAST Time series anomaly detection TIME_SERIES_ANOMALY_DETECTION - ALGORITHM: the algorithm used by the model. The following table describes the algorithms supported by Lindorm ML.
Task type Algorithm Description Time series forecasting DeepAR The DeepAR algorithm is a deep neural network algorithm based on recurrent neural network (RNN). For more information, see related papers. TFT The Temporal Fusion Transformer (TFT) algorithm is a deep neural network algorithm based on the Transformer mechanism. For more information, see related papers. Time series anomaly detection esd This algorithm is developed by Alibaba DAMO Academy and is applicable to spiked anomalies, such as spikes in monitoring curves and scenarios in which a small number of data points are significantly different from other data points. For more information, see Time series anomaly detection. nsigma This algorithm is developed by Alibaba DAMO Academy. It is simple and easy to analyze the causes of anomalies. For more information, see Time series anomaly detection. ttest This algorithm is developed by Alibaba DAMO Academy. It is used to identify whether the metrics related to time series data are abnormal because of a change in the average value. For more information, see Time series anomaly detection. - PREPROCESSORS 'string': the preprocessing operations for specific columns. This parameter is optional. In general, the preprocessing operations are specified by JSON strings.
The value of the PREPROCESSORS parameter contains two components: Columns that indicates the columns that need to be preprocessed and Transformers that indicates the preprocessing operations to be performed. Operations specified by Transformers are performed in the specified order. Each Transformers component contains two fields: Name that specifies the name of the preprocessing operation and Parameters that specifies parameters related to the preprocessing operation. The following example shows a sample value of the PREPROCESSORS parameter:
PREPROCESSORS '[ { "Columns":[ "c1" ], "Transformers":[ { "Name": "Imputer", "Parameters": {"value": 0} }, { "Name": "StandardScaler" } ] }, { "Columns":[ "c2", "c3" ], "Transformers":[ { "Name": "OrdinalEncoder" } ] } ]'
Note In the value of the PREPROCESSORS parameter, the Columns component and the Parameters field are both optional.The preprocessing operations specified during model training are automatically performed in model inference. The following table describes the preprocessing operations supported by Lindorm ML.
Preprocessing operation Parameter Description OneHotEncoder None Encode categorical features into binary values. This operation is applicable to categorical features that are not comparable in size. OrdinalEncoder None Encode categorical features into integers from 0. This operation is applicable to categorical features that can be compared in size. Imputer - method: a string that indicates the interpolation method. Valid values: dummy, mean, median, most_frequent, roll7, and last. Default value: dummy.
- value: an integer that needs to be interpolated. Default value: 0. This parameter is optional.
Interpolate missing values. You can select multiple interpolation policies. StandardScaler None Convert data into values that follow a standard normal distribution with a mean of 0 and a standard deviation of 1. This operation is also known as the z-score normalization. MinMaxScaler - min: an integer that indicates the minimum value of the range.
- max: an integer that indicates the maximum value of the range. This parameter is optional.
Scale the data into a range (min,max). By default, the value range is (0,1). LogTransformer None Converts values into their logarithms. - SETTINGS: other parameters that you can configure. You can specify different parameters for different task types. The following table describes the parameters supported by Lindorm ML.
Parameters related to time series forecasting
Parameter Type Description Required epochs INTEGER The number of epochs for model training. This parameter is applicable only to time series forecasting. Default value: 80 Yes time_column VARCHAR The time column. Yes group_columns VARCHAR The group column. This column is used as the TAG column to determine the time series. Yes freq VARCHAR The frequency of the time series data. Example: 1D. Yes prediction_length INTEGER The step size of time series forecasting. Yes feat_static_columns VARCHAR The set of static feature columns. Separate multiple columns with commas (,). No Parameters related to time series anomaly detection.
For more information about the training parameters supported by time series anomaly detection, see Time series anomaly detection.
Examples
CREATE MODEL tft_model
FROM (SELECT * FROM fresh_sales WHERE `time` > '2021-02-08T00:00:00+08:00')
TARGET sales
TASK time_series_forecast
ALGORITHM tft
SETTINGS
(
time_column 'time',
group_columns 'id_code',
feat_static_columns 'cate1_id,cate2_id,brand_id',
context_length '28',
prediction_length '6',
epochs '5',
freq '1D'
);
Model management
After the CREATE MODEL statement is executed, you can execute SQL statements to check whether the model is in the Ready state.
View the information about all models in the database
You can use the SHOW MODELS
statement to view the basic information about all models in the database.
The following statement provides an example on how to view the information about all models in a database:
SHOW MODELS;
The following information is returned:
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
| name | status | sql_function | created_time | update_time |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
| tft_model | Ready | forecast | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
+-------------------+--------+--------------------+-------------------------------+-------------------------------+
1 rows in set (524 ms)
The following table describes the returned values.
Returned value | Description |
---|---|
name | The name of the model. |
status | The state of the model. Valid values: Init, Training, Ready, and Failed. The value Init indicates that the model is being initialized. The value Training indicates that the model is being trained. The value Ready indicates that the model is trained. The value Failed indicates that the model failed to be trained. |
sql_function | The inference function. |
created_time | The time when the model is created. |
update_time | The time when the model is last updated. |
View the information about a specified model
You can use the SHOW MODEL model_name
statement to view the detailed information about a specified model.
The following statement provides an example on how to view the detailed information about a specified model:
SHOW MODEL tft_model;
The following information is returned:
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
| name | status | sql_function | task_type | algorithm | query | preprocessors | settings | metrics | created_time | update_time |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
| tft_model | Ready | forecast | TIME_SERIES_FORECAST | TFT | SELECT `time`, FIRST(`sales`) AS | [] | {time_column=time, group_columns=id_code, | {MAPE=0.35002756118774414, | 2022-11-04T11:38:05.873+08:00 | 2022-11-04T11:39:14.046+08:00 |
| | | | | | `sales`, `id_code`, `cate1_id`, | | feat_static_columns=cate1_id,cate2_id,brand_id, | MASE=0.41281554008773325, | | |
| | | | | | `cate2_id`, `brand_id` FROM | | context_length=28, prediction_length=6, | MSE=456.3769938151042} | | |
| | | | | | `fresh_sales` WHERE `time` > | | epochs=5, freq=1D,train_mode=LOCAL, | | | |
| | | | | | '2021-02-08T00:00:00+08:00' | | past_length=28, | | | |
| | | | | | sample by 1D fill zero | | forecast_start=2022-07-31 08:00:00} | | | |
+-----------+--------+--------------+----------------------+-----------+------------------------------------+---------------+-------------------------------------------------+--------------------------------+-------------------------------+-------------------------------+
1 rows in set (334 ms)
The following table describes the returned values.
Returned value | Description |
---|---|
task_type | The type of tasks that need to be executed by the model. |
algorithm | The name of the algorithm used by the model. |
query | The query statement used to train the model. |
preprocessors | The syntax for preprocessing operations. |
settings | The parameters related to the model. |
metrics | The metrics for the model. |
Deletes a specified model
You can use the DROP MODEL model_name
statement to delete a specified model.
The following statement provides an example on how to delete a specified model:
DROP MODEL tft_model;
The following information is returned:
No rows affected (0.397 seconds)
Model inference
If the model is in the Ready state, you can use system functions to perform model inference. You can use different functions for different types of tasks.
Syntax
SELECT function_name(field_name, model_name, params) FROM table_name [WHERE clause] SAMPLE BY 0;
- FORECAST: the function used for time series forecasting. For more information about how to configure the field_name, model_name, and params parameters of the FORECAST function, see Time series forecasting function.
- ANOMALY_DETECT: the function used for time series anomaly detection. For more information about how to configure the field_name, model_name, and params parameters of the ANOMALY_DETECT function, see Time series anomaly detection function.
Examples
SELECT device_id, region, `time`, raw(temperature) as temperature, anomaly_detect(temperature, ad_model) as detect_result from sensor WHERE time >= '2022-01-01 00:00:00' and time < '2022-01-01 00:01:00' SAMPLE BY 0;