Regression models can be used in data analysis, forecasting, automatic monitoring, and anomaly detection. In complex system management, you can use regression models and configure thresholds and alert rules to significantly improve the timeliness and accuracy of problem identification and ensure the stability of your system. This topic describes the syntax of regression analysis functions. This topic also provides examples on how to use the functions.
Background information
The following formula is used: y = a1 × x1 + a2 × x2 + b + noise
.
Parameter | Description |
| A column of data that is collected by Simple Log Service. |
| A column of data that is collected by Simple Log Service. |
| A random variable. |
| The calculation results. |
A regression analysis function identifies the values of a1
, a2
, and b
based on x1
, x2
, and y
that are provided and weight data. Then, the function returns the calculation results. a1, a2, and b are coefficients.
The sample log for regression analysis functions contains six indexed fields. The following figure shows the indexed fields. For more information, see Create indexes.
The following code shows the sample log:
{"group_id":"A","observation_id":"S001","time_offset":"0","x1":"1","x2":"5","y":"23.91700530543459"} {"group_id":"A","observation_id":"S002","time_offset":"-1","x1":"2","x2":"2","y":"6.931858878794941"} {"group_id":"A","observation_id":"S003","time_offset":"-2","x1":"3","x2":"8","y":"16.17603801639615"} {"group_id":"A","observation_id":"S004","time_offset":"-3","x1":"4","x2":"6","y":"24.97127625789946"} {"group_id":"A","observation_id":"S005","time_offset":"-4","x1":"5","x2":"2","y":"11.933292736756384"} {"group_id":"A","observation_id":"S006","time_offset":"-5","x1":"6","x2":"8","y":"21.034262717019995"} {"group_id":"A","observation_id":"S007","time_offset":"-6","x1":"7","x2":"1","y":"25.966770392099868"} {"group_id":"A","observation_id":"S008","time_offset":"-7","x1":"8","x2":"7","y":"16.93019469603219"} {"group_id":"A","observation_id":"S009","time_offset":"-8","x1":"9","x2":"2","y":"19.967258015889847"} {"group_id":"A","observation_id":"S010","time_offset":"-9","x1":"10","x2":"3","y":"27.0277513207651"}
Functions
Function | Syntax | Description | Data type of the return value |
| Returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional. | varchar | |
linear_model_predict(varchar model_in_json, array(double) x_sample) | Performs data forecasting based on an existing regression model and the independent variable that you specify. | double | |
recent_regression(double y, array(double) x_array, double cur_sample_time_period, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double damping_weight_per_time_unit) | Updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages. | varchar | |
merge_recent_regression(varchar model_1_json, varchar model_2_json) | Merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data. | varchar | |
recent_regression_predict(varchar model_json, array(double) x_sample) | Performs data forecasting based on an adaptive regression model. | double |
Regression models with sample weights
You can specify sample weights for regression models. You can specify sample weights related to time and dependent variables. If the sample weight of a regression model decays as the samples age, the regression model focuses more on the most recent data to adapt to system changes. If the sample weight of a regression model is the reciprocal of the absolute value for the dependent variable, the regression model can minimize the relative error.
linear_model function
The linear_model function returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional. For more information, see array_agg function.
varchar linear_model(array(array(double)) x_samples, array(double) y_samples)
Or
varchar linear_model(array(array(double)) x_samples, array(double) y_samples, array(double) weights)
Parameter | Description |
| A data matrix that consists of multiple independent variable samples. Each row indicates an observation operation on the independent variable samples. |
| A vector that consists of dependent variable samples. |
| Optional. If this parameter is left empty, the same weight is specified for all variable samples. |
Examples
Query statement
* | select group_id, linear_model( array_agg(array[x1, x2]), array_agg(y) ) as model from log group by group_id
Query and analysis results
The
coefficients
parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.In data forecasting, the return value of the linear_model function is used as an input parameter of the linear_model_predict function.
group_id
model
A
{ "coefficients": [ 0.8350068912618618, -0.741283054726383, 19.17405856472653 ], "isBuilt": true, "isBuildSuccessful": true, "sampleCount": 10, "xCount": 2, "wSum": 10.0, "ySumSquare": 3930.0, "ySum": 188.0, "xXSumProducts": [ [ 385.0, 367.0 ], [ 367.0, 475.0 ] ], "xYSumProducts": [ 1104.0, 1239.0 ], "xSums": [ 55.0, 67.0 ], "xMeans": [ 5.5, 6.7 ], "xStdDevs": [ 2.8722813232690143, 1.6155494421403511 ], "xVariances": [ 8.25, 2.6099999999999994 ], "yMean": 18.8, "yStdDev": 6.289674077406551, "yVariance": 39.559999999999945, "xCorrelations": [ [ 1.0, -0.03232540919176149 ], [ -0.03232540919176149, 1.0 ] ], "xYCorrelations": [ 0.3874743195572169, -0.202730375711539 ], "regularized": true, "regularWeight": 1.0E-6 }
linear_model_predict function
The linear_model_predict function performs data forecasting based on the regression model and input variable samples that you specify.
double linear_model_predict(varchar model_in_json, array(double) x_sample)
Parameter | Description |
| The regression model that is returned by the linear_model function. For more information, see linear_model function. |
| The new independent variable. |
Example
Query statement
* | with group_models as ( select group_id, linear_model( array_agg(array[x1, x2]), array_agg(y) ) as model from log group by group_id ) select d.group_id, d.y, linear_model_predict(m.model, array[x1, x2]) as predicted_y from group_models as m join log as d on m.group_id = d.group_id
Query and analysis results
The values of the
predicted_y
parameter are calculated based on the independent variable.group_id
observation_id
y
predicted_y
A
S001
23.91700530543459
15.68867910570816
A
S002
6.931858878794941
15.352330987812993
...
...
...
Online adaptive regression algorithm
The online adaptive regression algorithm incrementally updates a regression model with new data when the algorithm receives the new data. The algorithm supports more efficient computing and cost-effective storage than the batch algorithm in processing a large amount of data. The online adaptive regression algorithm is suitable for continuous profiling. After data processing, the algorithm discards samples, which is more practical and convenient.
The online adaptive regression algorithm automatically and exponentially decays the impacts of historical samples on statistical features when the algorithm incrementally computes the statistical features and the regression model that is used. This way, the most recent samples maintain high weights and the regression model can adapt to system changes.
recent_regression function
The recent_regression function updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages.
varchar recent_regression(double y, array(double) x_array, double sample_time, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double unit_damping_weight)
Parameter | Description |
| The dependent variable sample. |
| The array of the independent variable samples. |
| The point in time of data in the sample row. The value is converted to a digit. |
| The start time of the time range for the data that is used for model training. |
| The end time of the time range for the data that is used for model training. The time range is a closed interval, which is presented as |
| The time interval. The unit of the time interval is the same as the unit of the value specified by the |
| The exponential decay base. If you configure this parameter in together with the time_unit parameter, the sample weight varies based on time. As the value specified by the time_unit parameter increases by 1, the sample weight decays based on a fixed value specified by the unit_damping_weight parameter. You can configure the parameters to allow the sample weight to exponentially decay based on a half-life period. For example, the weight is 1 for data at the latest point in time. Then, the weight decays to 0.5 for data of one day ago, to 0.25 for data of two days ago, and to 0.125 for data of three days ago. The value of the unit_damping_weight parameter is calculated based on the following formula: unit_damping_weight = 2 ^ - (Time interval of samples/Half-life period) |
Example
Query statement
* | select group_id, recent_regression( y, array[x1, x2, 1.0], -- The dependent and independent variable samples. time_offset, -- The point in time of the sample. -4, -- The start time of the time range for the current sample batch. 0, -- The end time of the time range for the current sample batch. 1, -- The time interval. 0.999 -- The exponential decay base. ) as reg_model from log where time_offset >= -4 and time_offset <= 0 group by group_id
Query and analysis results
The
coefficients
parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.In data forecasting, the return value of the recent_regression function is used as an input parameter of the recent_regression_predict function.
group_id
reg_model
A
{ "sampleCount": 5, "xCount": 3, "timeUnit": 1.0, "beginTimePeriod": -4.0, "endTimePeriod": 0.0, "unitDampingWeight": 0.999, "wSum": 4.990009995001, "ySumSquare": 1644.6974283836598, "ySum": 83.76770287757991, "xXSumSquares": [ [ 54.830206884025, 70.82220388003, 14.960044976005001 ], [ 70.82220388003, 173.70327985603598, 25.955043976006 ], [ 14.960044976005001, 25.955043976006, 4.990009995001 ] ], "xYSumProducts": [ 245.21187055562675, 402.5070758759011, 83.76770287757991 ], "xSums": [ 14.960044976005001, 25.955043976006, 4.990009995001 ], "xMeans": [ 2.997999000200801, 5.201401199999158, 1.0 ], "xStdDevs": [ 1.4142126422148122, 2.7848935986573244, 0.0 ], "xVariances": [ 1.9999973974002003, 7.755632355842543, 0.0 ], "yMean": 16.78708118049834, "yStdDev": 6.913170639821401, "yVariance": 47.79192829528864, "xCorrelations": [ [ 1.0, -0.35572473794248516, 0.0 ], [ -0.35572473794248516, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ], "xYCorrelations": [ -0.12142097167729436, -0.34560624507434407, 0.0 ], "coefficients": [ -1.3675797278475395, -1.104969989478544, 0.0, 26.634476066516903 ], "isBuilt": true, "isBuildSuccessful": true }
merge_recent_regression function
The merge_recent_regression function merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data.
varchar merge_recent_regression(varchar model_1_json, varchar model_2_json)
Parameter | Description |
| The return value of the recent_regression function. For more information, see recent_regression function. |
| The return value of the recent_regression function. For more information, see recent_regression function. |
Example
Query statement
* | with model1 as ( select group_id, recent_regression( y, array[x1, x2, 1.0], -- The dependent and independent variable samples. time_offset, -- The point in time of the sample. -4, -- The start time of the time range for the current sample batch. 0, -- The end time of the time range for the current sample batch. 1, -- The time interval. 0.999 -- The exponential decay base. ) as reg_model from log where time_offset >= -4 and time_offset <= 0 group by group_id ), model2 as ( select group_id, recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model from log where time_offset >= -9 and time_offset <= -5 group by group_id ) select m1.group_id, merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model from model1 as m1 join model2 as m2 on m1.group_id = m2.group_id
Query and analysis results
The
coefficients
parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.In data forecasting, the return value of the merge_recent_regression function is used as an input parameter of the recent_regression_predict function.
group_id
reg_model
A
{ "sampleCount": 10, "xCount": 3, "timeUnit": 1.0, "beginTimePeriod": -9.0, "endTimePeriod": 0.0, "unitDampingWeight": 0.999, "wSum": 9.955119790251791, "ySumSquare": 4159.2626495224, "ySum": 193.9139516502596, "xXSumSquares": [ [ 382.3684973894312, 268.46629177582946, 54.67098815430803 ], [ 268.46629177582946, 358.44803436913094, 51.78255011892536 ], [ 54.67098815430803, 51.78255011892536, 9.955119790251791 ] ], "xYSumProducts": [ 1132.090921413269, 919.4071924317548, 193.9139516502596 ], "xSums": [ 54.67098815430803, 51.78255011892536, 9.955119790251791 ], "xMeans": [ 5.4917458861562585, 5.201599901352432, 1.0 ], "xStdDevs": [ 2.8722740635191735, 2.991614845817865, 0.0 ], "xVariances": [ 8.249958295964944, 8.949759385717847, 0.0 ], "yMean": 19.478816502051856, "yStdDev": 6.1949232381571, "yVariance": 38.37707392665885, "xCorrelations": [ [ 1.0, -0.1859947674356197, 0.0 ], [ -0.1859947674356197, 1.0, 0.0 ], [ 0.0, 0.0, 1.0 ] ], "xYCorrelations": [ 0.3791693893070564, -0.4837793996174176, 0.0 ], "coefficients": [ 0.6460732812209116, -0.8864195347835274, 0.0, 20.541545982438304 ], "isBuilt": true, "isBuildSuccessful": true }
recent_regression_predict function
The recent_regression_predict function performs data forecasting based on an adaptive regression model.
double recent_regression_predict(varchar model_json, array(double) x_sample)
Parameter | Description |
model_json | The return value of the recent_regression or merge_recent_regression function. |
| The independent sample that is used to calculate the forecast value. |
Example
Query statement
* | with model1 as ( select group_id, recent_regression( y, array[x1, x2, 1.0], -- The dependent and independent variable samples. time_offset, -- The point in time of the sample. -4, -- The start time of the time range for the current sample batch. 0, -- The end time of the time range for the current sample batch. 1, -- The time interval. 0.999 -- The exponential decay base. ) as reg_model from log where time_offset >= -4 and time_offset <= 0 group by group_id ), model2 as ( select group_id, recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model from log where time_offset >= -9 and time_offset <= -5 group by group_id ), model as ( select m1.group_id, merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model from model1 as m1 join model2 as m2 on m1.group_id = m2.group_id ), new_data as ( select 'A' as group_id, 1 as obs_id, 3.0 as x1, 5.0 as x2, 1.0 as x3 union all select 'A' as group_id, 2 as obs_id, 7.0 as x1, 8.0 as x2, 1.0 as x3 ) select m.group_id, n.obs_id, recent_regression_predict(m.reg_model, array[n.x1, n.x2, 1.0]) as predicted_value from model as m join new_data as n on m.group_id = n.group_id order by m.group_id, n.obs_id
Query and analysis results
The value of the
predicted_value
parameter indicates the forecast value.group_id
obs_id
predicted_value
A
1
17.489274877305804
A
2
22.3233353394362