Regression analysis functions - Simple Log Service - Alibaba Cloud Documentation Center

Regression models can be used in data analysis, forecasting, automatic monitoring, and anomaly detection. In complex system management, you can use regression models and configure thresholds and alert rules to significantly improve the timeliness and accuracy of problem identification and ensure the stability of your system. This topic describes the syntax of regression analysis functions. This topic also provides examples on how to use the functions.

Background information

The following formula is used: y = a1 × x1 + a2 × x2 + b + noise.

Parameter	Description

Parameter	Description
`x1`	A column of data that is collected by Simple Log Service.
`x2`	A column of data that is collected by Simple Log Service.
`noise`	A random variable.
`y`	The calculation results.

A regression analysis function identifies the values of a1, a2, and b based on x1, x2, and y that are provided and weight data. Then, the function returns the calculation results. a1, a2, and b are coefficients.

The sample log for regression analysis functions contains six indexed fields. The following figure shows the indexed fields. For more information, see Create indexes.

The following code shows the sample log:

{"group_id":"A","observation_id":"S001","time_offset":"0","x1":"1","x2":"5","y":"23.91700530543459"}
{"group_id":"A","observation_id":"S002","time_offset":"-1","x1":"2","x2":"2","y":"6.931858878794941"}
{"group_id":"A","observation_id":"S003","time_offset":"-2","x1":"3","x2":"8","y":"16.17603801639615"}
{"group_id":"A","observation_id":"S004","time_offset":"-3","x1":"4","x2":"6","y":"24.97127625789946"}
{"group_id":"A","observation_id":"S005","time_offset":"-4","x1":"5","x2":"2","y":"11.933292736756384"}
{"group_id":"A","observation_id":"S006","time_offset":"-5","x1":"6","x2":"8","y":"21.034262717019995"}
{"group_id":"A","observation_id":"S007","time_offset":"-6","x1":"7","x2":"1","y":"25.966770392099868"}
{"group_id":"A","observation_id":"S008","time_offset":"-7","x1":"8","x2":"7","y":"16.93019469603219"}
{"group_id":"A","observation_id":"S009","time_offset":"-8","x1":"9","x2":"2","y":"19.967258015889847"}
{"group_id":"A","observation_id":"S010","time_offset":"-9","x1":"10","x2":"3","y":"27.0277513207651"}

Functions

Function	Syntax	Description	Data type of the return value

Function	Syntax	Description	Data type of the return value
linear_model function	linear_model(array(array(double)) x_samples, array(double) y_samples) linear_model(array(array(double)) x_samples, array(double) y_samples, array(double) weights)	Returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional.	varchar
linear_model_predict function	linear_model_predict(varchar model_in_json, array(double) x_sample)	Performs data forecasting based on an existing regression model and the independent variable that you specify.	double
recent_regression function	recent_regression(double y, array(double) x_array, double cur_sample_time_period, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double damping_weight_per_time_unit)	Updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages.	varchar
merge_recent_regression function	merge_recent_regression(varchar model_1_json, varchar model_2_json)	Merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data.	varchar
recent_regression_predict function	recent_regression_predict(varchar model_json, array(double) x_sample)	Performs data forecasting based on an adaptive regression model.	double

Regression models with sample weights

You can specify sample weights for regression models. You can specify sample weights related to time and dependent variables. If the sample weight of a regression model decays as the samples age, the regression model focuses more on the most recent data to adapt to system changes. If the sample weight of a regression model is the reciprocal of the absolute value for the dependent variable, the regression model can minimize the relative error.

linear_model function

The linear_model function returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional. For more information, see array_agg function.

varchar linear_model(array(array(double)) x_samples, array(double) y_samples)

varchar linear_model(array(array(double)) x_samples, array(double) y_samples, array(double) weights)

Parameter	Description

Parameter	Description
`x_samples`	A data matrix that consists of multiple independent variable samples. Each row indicates an observation operation on the independent variable samples.
`y_samples`	A vector that consists of dependent variable samples.
`weights`	Optional. If this parameter is left empty, the same weight is specified for all variable samples.

Examples

Query statement

 * |   select group_id,
        linear_model(
            array_agg(array[x1, x2]),
            array_agg(y)
        ) as model
    from log
    group by group_id

Query and analysis results

The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

In data forecasting, the return value of the linear_model function is used as an input parameter of the linear_model_predict function.

group_id	model

group_id

model

{
  "coefficients": [
    0.8350068912618618,
    -0.741283054726383,
    19.17405856472653
  ],
  "isBuilt": true,
  "isBuildSuccessful": true,
  "sampleCount": 10,
  "xCount": 2,
  "wSum": 10.0,
  "ySumSquare": 3930.0,
  "ySum": 188.0,
  "xXSumProducts": [
    [
      385.0,
      367.0
    ],
    [
      367.0,
      475.0
    ]
  ],
  "xYSumProducts": [
    1104.0,
    1239.0
  ],
  "xSums": [
    55.0,
    67.0
  ],
  "xMeans": [
    5.5,
    6.7
  ],
  "xStdDevs": [
    2.8722813232690143,
    1.6155494421403511
  ],
  "xVariances": [
    8.25,
    2.6099999999999994
  ],
  "yMean": 18.8,
  "yStdDev": 6.289674077406551,
  "yVariance": 39.559999999999945,
  "xCorrelations": [
    [
      1.0,
      -0.03232540919176149
    ],
    [
      -0.03232540919176149,
      1.0
    ]
  ],
  "xYCorrelations": [
    0.3874743195572169,
    -0.202730375711539
  ],
  "regularized": true,
  "regularWeight": 1.0E-6
}

linear_model_predict function

The linear_model_predict function performs data forecasting based on the regression model and input variable samples that you specify.

double linear_model_predict(varchar model_in_json, array(double) x_sample)

Parameter	Description

Parameter	Description
`model_in_json`	The regression model that is returned by the linear_model function. For more information, see linear_model function.
`x_sample`	The new independent variable.

Example

Query statement

* | with group_models as
(
    select group_id,
        linear_model(
            array_agg(array[x1, x2]),
            array_agg(y)
        ) as model
    from log
    group by group_id
)

select d.group_id,
    d.y,
    linear_model_predict(m.model, array[x1, x2]) as predicted_y
from group_models as m
    join log as d
    on m.group_id = d.group_id

Query and analysis results
The values of the predicted_y parameter are calculated based on the independent variable.
group_id
observation_id
y
predicted_y
group_id
observation_id
y
predicted_y
A
S001
23.91700530543459
15.68867910570816
A
S002
6.931858878794941
15.352330987812993
...
...
...

Online adaptive regression algorithm

The online adaptive regression algorithm incrementally updates a regression model with new data when the algorithm receives the new data. The algorithm supports more efficient computing and cost-effective storage than the batch algorithm in processing a large amount of data. The online adaptive regression algorithm is suitable for continuous profiling. After data processing, the algorithm discards samples, which is more practical and convenient.

The online adaptive regression algorithm automatically and exponentially decays the impacts of historical samples on statistical features when the algorithm incrementally computes the statistical features and the regression model that is used. This way, the most recent samples maintain high weights and the regression model can adapt to system changes.

recent_regression function

The recent_regression function updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages.

varchar recent_regression(double y, array(double) x_array, double sample_time, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double unit_damping_weight)

Parameter	Description

Parameter	Description
`y`	The dependent variable sample.
`x_array`	The array of the independent variable samples.
`sample_time`	The point in time of data in the sample row. The value is converted to a digit.
`cur_batch_begin_period`	The start time of the time range for the data that is used for model training.
`cur_batch_end_period`	The end time of the time range for the data that is used for model training. The time range is a closed interval, which is presented as `[batch_window_begin_time, batch_window_end_time]`.
`time_unit`	The time interval. The unit of the time interval is the same as the unit of the value specified by the `sample_time` parameter.
`unit_damping_weight`	The exponential decay base. If you configure this parameter in together with the time_unit parameter, the sample weight varies based on time. As the value specified by the time_unit parameter increases by 1, the sample weight decays based on a fixed value specified by the unit_damping_weight parameter. You can configure the parameters to allow the sample weight to exponentially decay based on a half-life period. For example, the weight is 1 for data at the latest point in time. Then, the weight decays to 0.5 for data of one day ago, to 0.25 for data of two days ago, and to 0.125 for data of three days ago. The value of the unit_damping_weight parameter is calculated based on the following formula: unit_damping_weight = 2 ^ - (Time interval of samples/Half-life period)

Example

Query statement

  * | select group_id,
        recent_regression(
          y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
          time_offset, -- The point in time of the sample.
          -4,          -- The start time of the time range for the current sample batch.
          0,           -- The end time of the time range for the current sample batch.
          1,           -- The time interval.
          0.999        -- The exponential decay base.
        ) as reg_model
    from log
    where time_offset >= -4 and time_offset <= 0
    group by group_id

Query and analysis results

The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

In data forecasting, the return value of the recent_regression function is used as an input parameter of the recent_regression_predict function.

group_id	reg_model

group_id

reg_model

{
  "sampleCount": 5,
  "xCount": 3,
  "timeUnit": 1.0,
  "beginTimePeriod": -4.0,
  "endTimePeriod": 0.0,
  "unitDampingWeight": 0.999,
  "wSum": 4.990009995001,
  "ySumSquare": 1644.6974283836598,
  "ySum": 83.76770287757991,
  "xXSumSquares": [
    [
      54.830206884025,
      70.82220388003,
      14.960044976005001
    ],
    [
      70.82220388003,
      173.70327985603598,
      25.955043976006
    ],
    [
      14.960044976005001,
      25.955043976006,
      4.990009995001
    ]
  ],
  "xYSumProducts": [
    245.21187055562675,
    402.5070758759011,
    83.76770287757991
  ],
  "xSums": [
    14.960044976005001,
    25.955043976006,
    4.990009995001
  ],
  "xMeans": [
    2.997999000200801,
    5.201401199999158,
    1.0
  ],
  "xStdDevs": [
    1.4142126422148122,
    2.7848935986573244,
    0.0
  ],
  "xVariances": [
    1.9999973974002003,
    7.755632355842543,
    0.0
  ],
  "yMean": 16.78708118049834,
  "yStdDev": 6.913170639821401,
  "yVariance": 47.79192829528864,
  "xCorrelations": [
    [
      1.0,
      -0.35572473794248516,
      0.0
    ],
    [
      -0.35572473794248516,
      1.0,
      0.0
    ],
    [
      0.0,
      0.0,
      1.0
    ]
  ],
  "xYCorrelations": [
    -0.12142097167729436,
    -0.34560624507434407,
    0.0
  ],
  "coefficients": [
    -1.3675797278475395,
    -1.104969989478544,
    0.0,
    26.634476066516903
  ],
  "isBuilt": true,
  "isBuildSuccessful": true
}

merge_recent_regression function

The merge_recent_regression function merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data.

varchar merge_recent_regression(varchar model_1_json, varchar model_2_json)

Parameter	Description

Parameter	Description
`model_1_json`	The return value of the recent_regression function. For more information, see recent_regression function.
`model_2_json`	The return value of the recent_regression function. For more information, see recent_regression function.

Example

Query statement

* | with model1 as
(
    select group_id,
        recent_regression(
          y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
          time_offset, -- The point in time of the sample.
          -4,          -- The start time of the time range for the current sample batch.
          0,           -- The end time of the time range for the current sample batch.
          1,           -- The time interval.
          0.999        -- The exponential decay base.
        ) as reg_model
    from log
    where time_offset >= -4 and time_offset <= 0
    group by group_id
),

model2 as
(
    select group_id,
        recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model
    from log
    where time_offset >= -9 and time_offset <= -5
    group by group_id
)

select m1.group_id,
    merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model
from model1 as m1
    join model2 as m2
        on m1.group_id = m2.group_id

Query and analysis results

The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

In data forecasting, the return value of the merge_recent_regression function is used as an input parameter of the recent_regression_predict function.

group_id	reg_model

group_id

reg_model

{
  "sampleCount": 10,
  "xCount": 3,
  "timeUnit": 1.0,
  "beginTimePeriod": -9.0,
  "endTimePeriod": 0.0,
  "unitDampingWeight": 0.999,
  "wSum": 9.955119790251791,
  "ySumSquare": 4159.2626495224,
  "ySum": 193.9139516502596,
  "xXSumSquares": [
    [
      382.3684973894312,
      268.46629177582946,
      54.67098815430803
    ],
    [
      268.46629177582946,
      358.44803436913094,
      51.78255011892536
    ],
    [
      54.67098815430803,
      51.78255011892536,
      9.955119790251791
    ]
  ],
  "xYSumProducts": [
    1132.090921413269,
    919.4071924317548,
    193.9139516502596
  ],
  "xSums": [
    54.67098815430803,
    51.78255011892536,
    9.955119790251791
  ],
  "xMeans": [
    5.4917458861562585,
    5.201599901352432,
    1.0
  ],
  "xStdDevs": [
    2.8722740635191735,
    2.991614845817865,
    0.0
  ],
  "xVariances": [
    8.249958295964944,
    8.949759385717847,
    0.0
  ],
  "yMean": 19.478816502051856,
  "yStdDev": 6.1949232381571,
  "yVariance": 38.37707392665885,
  "xCorrelations": [
    [
      1.0,
      -0.1859947674356197,
      0.0
    ],
    [
      -0.1859947674356197,
      1.0,
      0.0
    ],
    [
      0.0,
      0.0,
      1.0
    ]
  ],
  "xYCorrelations": [
    0.3791693893070564,
    -0.4837793996174176,
    0.0
  ],
  "coefficients": [
    0.6460732812209116,
    -0.8864195347835274,
    0.0,
    20.541545982438304
  ],
  "isBuilt": true,
  "isBuildSuccessful": true
}

recent_regression_predict function

The recent_regression_predict function performs data forecasting based on an adaptive regression model.

double recent_regression_predict(varchar model_json, array(double) x_sample)

Parameter	Description

Parameter	Description
model_json	The return value of the recent_regression or merge_recent_regression function.
`x_sample`	The independent sample that is used to calculate the forecast value.

Example

Query statement

* | with model1 as
(
    select group_id,
        recent_regression(
          y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
          time_offset, -- The point in time of the sample.
          -4,          -- The start time of the time range for the current sample batch.
          0,           -- The end time of the time range for the current sample batch.
          1,           -- The time interval.
          0.999        -- The exponential decay base.
        ) as reg_model
    from log
    where time_offset >= -4 and time_offset <= 0
    group by group_id
),

model2 as
(
    select group_id,
        recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model
    from log
    where time_offset >= -9 and time_offset <= -5
    group by group_id
),

model as
(
    select m1.group_id,
        merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model
    from model1 as m1
        join model2 as m2
            on m1.group_id = m2.group_id
),

new_data as
(
    select 'A' as group_id, 1 as obs_id, 3.0 as x1, 5.0 as x2, 1.0 as x3 union all
    select 'A' as group_id, 2 as obs_id, 7.0 as x1, 8.0 as x2, 1.0 as x3
)

select m.group_id,
    n.obs_id,
    recent_regression_predict(m.reg_model, array[n.x1, n.x2, 1.0]) as predicted_value
from model as m
    join new_data as n
        on m.group_id = n.group_id
order by m.group_id, n.obs_id

Query and analysis results
The value of the predicted_value parameter indicates the forecast value.
group_id
obs_id
predicted_value
group_id
obs_id
predicted_value
A
1
17.489274877305804
A
2
22.3233353394362

group_id	observation_id	y	predicted_y
A	S001	23.91700530543459	15.68867910570816
A	S002	6.931858878794941	15.352330987812993
...		...	...

group_id	obs_id	predicted_value
A	1	17.489274877305804
A	2	22.3233353394362