What are the algorithms and syntax that are used to detect time series anomalies - Lindorm

This topic describes the algorithms and syntax that are used to detect time series anomalies.

Applicable engines and versions

The time series anomaly detection syntax is applicable only to LindormTSDB. The time series anomaly detection syntax is supported by all versions of LindormTSDB.

Limits

The time series anomaly detection syntax must be used together with the SAMPLE BY clause.

Overview

Time series anomaly detection supports the online anomaly detection algorithms developed by DAMO Academy to detect abnormal points in the specified time series. During detection, these algorithms continuously learn the characteristics of time series data, such as data trends or periods, to detect anomalies in time series points that are newly inserted. For example, if the value of a newly-added time series point is significantly different from other points, the algorithm assumes that this point may be abnormal.

You can use time series anomaly detection with the SAMPLE BY clause by using the following methods:

Use the SAMPLE BY 0 clause to detect each data point in all time series. For more information about how to use the clause, see Example 1, Example 2, and Example 3.
Use the SAMPLE BY INTERVAL clause to specify the downsampling interval and use nested downsampling operators, such as MIN, MAX, AVG, COUNT, and SUM.
Important
The value of INTERVAL cannot be 0.
For more information about how to use the clause, see Example 4.
Use the SAMPLE BY 0 clause and nested downsampling operators, such as LATEST, DELTA, and RATE, to query different data. For more information about how to use the clause, see Example 5.

Syntax

select_sample_by_statement ::=  SELECT ( select_clause )
                                FROM table_identifier
                                WHERE where_clause
                                SAMPLE BY 0
select_clause              ::=  selector [ AS identifier ] ( ',' selector [ AS identifier ] )
selector                   ::=  tag_identifier, | time | anomaly_detect '(' field_identifier ',' algo_identifier | model_identifier [ ',' options] ')'
where_clause               ::=  relation ( AND relation )* (OR relation)*
relation                   ::=  ( field_identifier| tag_identifier, ) operator term
operator                   ::=  '=' | '<' | '>' | '<=' | '>=' | '!=' | IN | CONTAINS | CONTAINS KEY

In the syntax, anomaly_detect indicates the anomaly detection function. The following table describes the parameters that you can configure.

Parameter	Description
field_identifier	The name of the field column. Note Data in the specified field column cannot be of the VARCHAR or BOOLEAN type.
algo_identifier	The name of the algorithm used to detect anomalies. The online anomaly detection algorithms developed by DAMO Academy are supported. esd: an algorithm that is applicable to spiked anomalies, such as spikes in monitoring curves and scenarios in which a small number of data points are significantly different from other data points. nsigma: an algorithm that is simple and easy to analyze the causes of anomalies. ttest: an algorithm that is used to identify whether the metrics related to time series data are abnormal because of a change in the average value. istl-esd: an algorithm that is applicable to detect anomalies in periodic data. Note The algo_identifier parameter is applicable to scenarios in which in-database machine learning is not enabled and anomalies related to time series data must be detected.
model_identifier	The name of the model used to detect anomalies. Note The value of the model_identifier parameter is of the VARCHAR type. The model_identifier parameter is applicable to scenarios in which in-database machine learning is enabled and anomalies related to time series data must be detected. For more information, see In-database machine learning.
options	The options used to adjust the detection effect. This parameter is optional. Configure the options in the `key1=value1 key2=value2` format.

Parameters

You can configure parameters for the anomaly detection algorithms that you use. These parameters can be categorized into common parameters, training parameters, and inference parameters. You can specify the optional parameter options to adjust the detection performance of the anomaly detection algorithm.

Note

Common parameters, training parameters, and inference parameters are configured in the same parameter list. For example, you can configure parameters for the ttest algorithm in the following format: lenDetectWindow=100,adhoc_stat=true.
For more information about how to configure parameters for ideal detection results, see Parameter tuning for statistical algorithms and Parameter tuning for decomposition algorithms.

Common parameters

You can configure common parameters to control the debugging, diagnosis, and other behaviors performed by algorithms during anomaly detection. Common parameters are applicable for all supported anomaly detection algorithms. The following table describes the common parameters that you can configure.

Parameter	Type	Default value	Description
verbose	BOOLEAN	FALSE	Specifies whether to return detailed information and identify the detection result of the specified columns. The returned information varies with the algorithm that you use. Valid values: TRUE FALSE If you set this parameter to `TRUE`, additional columns are displayed in the returned results to show the detailed information. For more information, see the "Detailed information returned for the verbose parameter" section.
adhoc_state	BOOLEAN	FALSE	Specifies whether the anomaly detection status of the algorithm is available only in the current query. For more information about the anomaly detection status, see Exception detection status.
direction	VARCHAR	UP	The types of anomalies that you want to detect. Valid values: Up: Only the abnormal increasing of time series data is detected as anomalies. Down: Only the abnormal decreasing of time series data is detected as anomalies. Both: The abnormal increasing and decreasing in time series data are both detected as anomalies.

Detailed information returned for the verbose parameter

Algorithm	Additional column	Type	Valid value	Description
esd	anomaly	BOOLEAN	TRUE FALSE	TRUE: The current data point is abnormal. FALSE: The current data point is normal.
	anomalyLevel	STRING	NORMAL UNKNOW	NORMAL: No anomalies are detected for the current data point. UNKNOW: Anomalies are detected for the current data point and are not classified.
	detectedDirection	STRING	UP DOWN NONE	UP: The value of the current data point is larger than the statistical value within the window. DOWN: The value of the current data point is less than the statistical value within the window. NONE: The current data point is normal and the value in the anomaly column is FALSE.
	anomalyScore	DOUBLE	[0, Double.MAX_VALUE]	The score of the detected anomaly. A larger value indicates that the anomaly of the current data point is more obvious.
	threshold	DOUBLE	[0, Double.MAX_VALUE]	The threshold based on which the algorithm determines whether the current data point is abnormal. If the value in the anomalyScore column is larger than the value in the threshold column, the current data point is abnormal. If the value in the anomalyScore column is less than the value in the threshold column, the current data point is normal. The threshold is calculated based on the alpha and lenHistoryWindow parameters. The threshold increases when the value of alpha decreases or the value of lenHistoryWindow increases.
	upperBound	DOUBLE	All values of the DOUBLE type	The upper boundary for anomaly detection. For example, if you set the `maxAnomalyRatio` parameter to 0.3, the value in the upperBound column is the 70% (calculated by the following formula: 1 - maxAnomalyRatio) distribution value of the ordered data within the window. In this case, data points whose values are smaller than the upperBound value are not detected for anomalies. Note The length of the window is specified by the lenHistoryWindow parameter. If the value of a data point is within the [lowerBound, upperBound] range, the algorithm determines that the data point is normal. If the value of a data point is not within the range, the algorithm calculates the value of anomalyScore for the data point. If the anomalyScore of the data point is larger than the value in the threshold column, the data point is determined abnormal.
	lowerBound	DOUBLE	All values of the DOUBLE type	The lower boundary for anomaly detection. For example, if you set the `maxAnomalyRatio` parameter to 0.3, the value in the lowerBound column is the 30% distribution value of the ordered data within the window. In this case, data points whose values are larger than the lowerBound value are not detected for anomalies. Note The length of the window is specified by the lenHistoryWindow parameter. If the value of a data point is within the [lowerBound, upperBound] range, the algorithm determines that the data point is normal. If the value of a data point is not within the range, the algorithm calculates the value of anomalyScore for the data point. If the anomalyScore of the data point is larger than the value in the threshold column, the data point is determined abnormal.
	mean	DOUBLE	All values of the DOUBLE type	The average value of the data points within the window.
	median	DOUBLE	All values of the DOUBLE type	The median of the data points within the window.
	std	DOUBLE	All values of the DOUBLE type	The standard deviation of the data points within the window.
	latestTimestamp	LONG	Positive integer	The timestamp of the latest data point within the window.
	warmup	BOOLEAN	TRUE FALSE	TRUE: The algorithm is being initialized and does not detect anomalies. FALSE: The algorithm is initialized.
ttest	anomaly	BOOLEAN	TRUE FALSE	TRUE: The current data point is abnormal. FALSE: The current data point is normal.
	anomalyLevel	STRING	NORMAL UNKNOW	NORMAL: No anomalies are detected for the current data point. UNKNOW: Anomalies are detected for the current data point and are not classified.
	detectedDirection	STRING	UP DOWN NONE	UP: The value of the current data point is larger than the statistical value within the window. DOWN: The value of the current data point is less than the statistical value within the window. NONE: The current data point is normal and the value in the anomaly column is `FALSE`.
	pValue	DOUBLE	(0, 1)	The ratio that indicates how the value of the current data point deviates from the statistical value within the window. A larger value indicates that value of the current data point deviates from the statistical value more significantly.
	threshold	DOUBLE	(0, 1)	The threshold based on which the algorithm determines whether the current data point is abnormal. If the value in the pValue column is less than the value in the threshold column, the current data point is abnormal. If the value of pValue is larger than the value in the threshold column, the current data point is normal.
	trendScore	DOUBLE	All values of the DOUBLE type	The degree of change in the trend of the data points. The larger the absolute value, the more obvious the trend changes. If the value in the trendScore column is larger than zero, the trend of the data points is upward. If the value in the trendScore column is less than zero, the trend of the data points is downward.
	mean	DOUBLE	All values of the DOUBLE type	The average value of the data points within the window. The length of the window is specified by the lenHistoryWindow parameter.
	std	DOUBLE	All values of the DOUBLE type	The standard deviation of the data points within the window.
	latestTimestamp	LONG	Positive integer	The timestamp of the latest data point within the window.
	warmup	BOOLEAN	TRUE FALSE	TRUE: The algorithm is being initialized and does not detect anomalies. FALSE: The algorithm is initialized.
nsigma	anomaly	BOOLEAN	TRUE FALSE	TRUE: The current data point is abnormal. FALSE: The current data point is normal.
	anomalyLevel	STRING	NORMAL UNKNOW	NORMAL: No anomalies are detected for the current data point. UNKNOW: Anomalies are detected for the current data point and are not classified.
	detectedDirection	STRING	UP DOWN NONE	UP: The value of the current data point is larger than the statistical value within the window. DOWN: The value of the current data point is less than the statistical value within the window. NONE: The current data point is normal and the value in the anomaly column is `FALSE`.
	anomalyScore	DOUBLE	[0, Double.MAX_VALUE]	The score of the detected anomaly. A larger value indicates that the anomaly of the current data point is more obvious.
	threshold	DOUBLE	[0, Double.MAX_VALUE]	The judgment threshold, which is used to determine whether the current data point is abnormal. If the value in the anomalyScore column is larger than the value in the threshold column, the current data point is abnormal. If the value in the anomalyScore column is less than the value in the threshold column, the current data point is normal.
	mean	DOUBLE	All values of the DOUBLE type	The average value of the data points within the window.
	std	DOUBLE	All values of the DOUBLE type	The standard deviation of the data points within the window.
	latestTimestamp	LONG	Positive integer	The timestamp of the latest data point within the window.
	warmup	BOOLEAN	TRUE FALSE	TRUE: The algorithm is being initialized and does not detect anomalies. FALSE: The algorithm is initialized.
istl-esd	anomaly	BOOLEAN	TRUE FALSE	TRUE: The current data point is abnormal. FALSE: The current data point is normal.
	anomalyLevel	STRING	NORMAL UNKNOW	NORMAL: No anomalies are detected for the current data point. UNKNOW: Anomalies are detected for the current data point and are not classified.
	residual	DOUBLE	All values of the DOUBLE type	The residual value of the original data after the periodic component and the trend component are removed. In the ISTL algorithm, data points are decomposed into three component in the following format: `residual+trend+season`. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	trend	DOUBLE	All values of the DOUBLE type	The trending component in the original data. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	season	DOUBLE	All values of the DOUBLE type	The periodic component in the original data. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	warmup	BOOLEAN	TRUE FALSE	TRUE: The algorithm is being initialized and does not detect anomalies. Note Four cycles of data points are required to initialize the algorithm. During initialization, values returned in the residual, trend, and season columns are invalid. The default value 0 is returned in these columns. FALSE: The algorithm is initialized.
	Other additional columns (same as the additional columns returned for the esd algorithm)	Same as the data types of the columns returned for the esd algorithm.	Same as the valid values of the columns returned for the esd algorithm.	If you specify `esd.verbose=true` when you call an anomaly detection function, the esd verbose mode is enabled. In this case, all columns for the esd and ttest algorithms (excluding the anomaly, warmup, and anomalyLevel columns) in the verbose mode are returned.
istl-nsigma	anomaly	BOOLEAN	TRUE FALSE	TRUE: The current data point is abnormal. FALSE: The current data point is normal.
	anomalyLevel	STRING	NORMAL UNKNOW	NORMAL: No anomalies are detected for the current data point. UNKNOW: Anomalies are detected for the current data point and are not classified.
	trend	DOUBLE	All values of the DOUBLE type	The trending component in the original data. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	season	DOUBLE	All values of the DOUBLE type	The periodic component in the original data. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	residual	DOUBLE	All values of the DOUBLE type	The residual value of the original data after the periodic component and the trend component are removed. If the algorithm is being initialized (value in the warmup column is `TRUE`), only the default value 0 is returned in this column.
	warmup	BOOLEAN	TRUE FALSE	TRUE: The algorithm is being initialized and does not detect anomalies. Note Four cycles of data points are required to initialize the algorithm. During initialization, values returned in the residual, trend, and season columns are invalid. The default value 0 is returned in these columns. FALSE: The algorithm is initialized.
	Other additional columns (same as the additional columns returned for the nsigma algorithm)	Same as the data types of the columns returned for the nsigma algorithm.	Same as the valid values of the columns returned for the nsigma algorithm	If you specify `nsigma.verbose=true` when you call an anomaly detection function, the nsigma verbose mode is enabled.

Training parameters

You can specify an algorithm and configure the training parameters to determine the model used to detect anomalies. The values of training parameters are cleared after you restart LindormTSDB. In this case, you must configure the training parameters again to train the model. The model is trained in real time during detection to adapt to learn and adapt the characteristics of the time series data.

Note

Take note of the following items when you configure training parameters:

The names of the parameters are not case-sensitive.
The values of training parameters can be digits and strings and cannot be NULL.
The values of the parameters must be within the specific ranges.

Algorithm	Parameter	Type	Valid value	Description
esd	compression	INTEGER	A positive integer. Valid values: `(10,1000)`. Default value: 100.	The spatial complexity of the data structure in the algorithm. A larger value of this parameter indicates that the algorithm uses more memory during detection and returns more accurate results.
esd	lenHistoryWindow	INTEGER	Valid values: positive integers that are equal to or larger than 20. Default value: null.	The length of the reference time window. If you specify a short reference time window, only the recent data points within the time window are used as references during the detection. If you set this parameter to null, all data points that are inserted after the first detection are used as references.
nsigma	lenHistoryWindow	INTEGER	Valid values: positive integers that are equal to or larger than 20. Default value: null.	The length of the reference time window. If you specify a short reference time window, only the recent data points within the time window are used as references during the detection. If you set this parameter to null, all data points that are inserted after the first detection are used as references.
ttest	lenDetectWindow	INTEGER	A positive integer. Default value: 10.	The length of the most recent time window within which you want to detect anomalies.
ttest	lenHistoryWindow	INTEGER	Valid values: positive integers that are equal to or larger than 20. Default value: 100.	The length of the reference time window. If you specify a short reference time window, only the recent data points within the time window are used as references during the detection. If you set this parameter to `null`, all data points that are inserted after the first detection are used as references. Note The value of this parameter must be larger than the value of lenDetectWindow.
istl-esd	frequency	VARCHAR	A string that consists of a digit and a time unit. Examples: 5M, 24H, and 1D. Valid time units: n/ns: nanosecond. u/us: microsecond. m/ms: millisecond. s/S: second. M/min: minute. H/h: hour. D/d: day.	The frequency at which the time series data is collected. For example, if one time series data point is collected per hour, set this parameter to `1H`. Important If this parameter is not specified, the algorithm automatically calculates the frequency at which the time series data is collected. However, if a lot of values are missing in the time series data, the calculated frequency may be inaccurate. If you specify the frequency parameter, the value of this parameter must be the same as that of the INTERVAL parameter specified in the `SAMPLE BY INTERVAL` statement.
	periods	VARCHAR	A string that consists of a digit and a time unit. Examples: 5M, 24H, and 1D. Valid time units: n/ns: nanosecond. u/us: microsecond. m/ms: millisecond. s/S: second. M/min: minute. H/h: hour. D/d: day.	The total period length of the periodic data. You can use indexers to specify multiple period lengths. Example: `periods[0]=1440`;`periods[1]=1880`. Note If this parameter is not specified, the algorithm automatically calculates the period.
	esd.*	N/A	The training parameters that are required to define the esd algorithm. These parameters are the same as the training parameters described in the esd section of this table. You can add the esd. prefix to the training parameters of the esd algorithm to configure these parameters. Example: `esd.lenHistoryWindow=10`.
istl-nsigma	frequency	VARCHAR	A string that consists of a digit and a time unit. Examples: 5M, 24H, and 1D. Valid time units: n/ns: nanosecond. u/us: microsecond. m/ms: millisecond. s/S: second. M/min: minute. H/h: hour. D/d: day.	The frequency at which the time series data is collected. For example, if one time series data point is collected per hour, set this parameter to `1H`. Important If this parameter is not specified, the algorithm automatically calculates the frequency at which the time series data is collected. However, if a lot of values are missing in the time series data, the calculated frequency may be inaccurate. If you specify the frequency parameter, the value of this parameter must be the same as that of the INTERVAL parameter specified in the `SAMPLE BY INTERVAL` statement.
	periods	VARCHAR	A string that consists of a digit and a time unit. Examples: 5M, 24H, and 1D. Valid time units: n/ns: nanosecond. u/us: microsecond. m/ms: millisecond. s/S: second. M/min: minute. H/h: hour. D/d: day.	The total period length of the periodic data. You can use indexers to specify multiple period lengths. Example: `periods[0]=1440`;`periods[1]=1880`. Note If this parameter is not specified, the algorithm automatically calculates the period.
	nsigma.*	N/A	The training parameters that are required to define the nsigma algorithm. These parameters are the same as the training parameters described in the nsigma section of this table. You can add the nsigma. prefix to the training parameters of the nsigma algorithm to configure these parameters. Example: `nsigma.lenHistoryWindow=10`.

Inference parameters

Inference parameters take effect only during anomaly detection and are not case-sensitive.

Algorithm	Parameter	Type	Valid value	Description
esd	alpha	DOUBLE	Default value: 0.1. Valid values: `(0,1)`.	The sensitivity of anomaly detection. A larger value of this parameter indicates that the algorithm is more sensitive to anomalies and reports more anomalies.
	direction	VARCHAR	Default value: Up.	The types of anomalies that you want to detect. Up: Only the abnormal increasing of time series data is detected as anomalies. Down: Only the abnormal decreasing of time series data is detected as anomalies. Both: The abnormal increasing and decreasing in time series data are both detected as anomalies.
	maxAnomalyRatio	DOUBLE	Default value: 0.3. Valid values: `(0,1]`. If you set this parameter to 1, no anomalies are returned.	The maximum ratio based on which anomalies are detected. For example, if you set maxAnomalyRatio to 0.3 and direction to Up, data points whose values are less than the 70th percentile are not detected as anomalies. If you set direction to Up, you can configure this parameter to prevent data points with smaller values from being detected as anomalies. If you set direction to Down, you can configure this parameter to prevent data points with larger values from being detected as anomalies.
	warmupCount	INTEGER	A positive integer. Default value: 20.	The minimum number of data points that is required for the algorithm to start to report anomalies. For example, if you set this parameter to 20, the algorithm does not report anomalies when the number of data points that need to be detected is less than 20.
nsigma	n	DOUBLE	A non-zero floating-point number. Default value: 3.0.	If you set n to a positive number, the algorithm reports an anomaly when the difference between the current value and the average value is larger than the product of n and the standard deviation. If you set n to a negative number, the algorithm reports an anomaly when the difference between the average value and the current value is larger than the product of n and the standard deviation.
nsigma	warmupCount	INTEGER	A positive integer. Default value: 20.	The minimum number of data points that is required for the algorithm to start to report anomalies. For example, if you set this parameter to 20, the algorithm does not report anomalies when the number of data points that need to be detected is less than 20.
ttest	alpha	DOUBLE	Default value: 0.05. Valid values: `(0,1)`.	The sensitivity of anomaly detection. A larger value of this parameter indicates that the algorithm is more sensitive to anomalies and reports more anomalies.
ttest	direction	VARCHAR	Default value: Up.	The types of anomalies that you want to detect. Up: Only the abnormal increasing of time series data is detected as anomalies. Down: Only the abnormal decreasing of time series data is detected as anomalies. Both: The abnormal increasing and decreasing in time series data are both detected as anomalies.
istl-esd	esd.*	N/A	The inference parameters that are required to define the esd algorithm. These parameters are the same as the inference parameters described in the esd section of this table. You can add the `esd.` prefix to the inference parameters of the esd algorithm to configure these parameters. Example: `esd.direction=Both`.
istl-nsigma	nsigma.*	N/A	Define the inference parameters required by the nsigma algorithm. For more information, see Inference parameters of the nsigma algorithm. You can add the `nsigma.` prefix to the inference parameters of the nsigma algorithm to configure these parameters. Example: `nsigma.n=5`.

Examples

Example 1: Use the esd algorithm to detect anomalies in the temperature data within a specific time range in a time series table named sensor.

SELECT device_id, region, time, anomaly_detect(temperature, 'esd') AS detect_result FROM sensor WHERE time >= '2022-01-01 00:00:00' and time < '2022-01-01 00:01:00' SAMPLE BY 0;