All Products
Search
Document Center

Simple Log Service:Regression analysis functions

Last Updated:Oct 20, 2024

Regression models can be used in data analysis, forecasting, automatic monitoring, and anomaly detection. In complex system management, you can use regression models and configure thresholds and alert rules to significantly improve the timeliness and accuracy of problem identification and ensure the stability of your system. This topic describes the syntax of regression analysis functions. This topic also provides examples on how to use the functions.

Background information

The following formula is used: y = a1 × x1 + a2 × x2 + b + noise.

Parameter

Description

x1

A column of data that is collected by Simple Log Service.

x2

A column of data that is collected by Simple Log Service.

noise

A random variable.

y

The calculation results.

A regression analysis function identifies the values of a1, a2, and b based on x1, x2, and y that are provided and weight data. Then, the function returns the calculation results. a1, a2, and b are coefficients.

  • The sample log for regression analysis functions contains six indexed fields. The following figure shows the indexed fields. For more information, see Create indexes.

    image

  • The following code shows the sample log:

    {"group_id":"A","observation_id":"S001","time_offset":"0","x1":"1","x2":"5","y":"23.91700530543459"}
    {"group_id":"A","observation_id":"S002","time_offset":"-1","x1":"2","x2":"2","y":"6.931858878794941"}
    {"group_id":"A","observation_id":"S003","time_offset":"-2","x1":"3","x2":"8","y":"16.17603801639615"}
    {"group_id":"A","observation_id":"S004","time_offset":"-3","x1":"4","x2":"6","y":"24.97127625789946"}
    {"group_id":"A","observation_id":"S005","time_offset":"-4","x1":"5","x2":"2","y":"11.933292736756384"}
    {"group_id":"A","observation_id":"S006","time_offset":"-5","x1":"6","x2":"8","y":"21.034262717019995"}
    {"group_id":"A","observation_id":"S007","time_offset":"-6","x1":"7","x2":"1","y":"25.966770392099868"}
    {"group_id":"A","observation_id":"S008","time_offset":"-7","x1":"8","x2":"7","y":"16.93019469603219"}
    {"group_id":"A","observation_id":"S009","time_offset":"-8","x1":"9","x2":"2","y":"19.967258015889847"}
    {"group_id":"A","observation_id":"S010","time_offset":"-9","x1":"10","x2":"3","y":"27.0277513207651"}

Functions

Function

Syntax

Description

Data type of the return value

linear_model function

  • linear_model(array(array(double)) x_samples, array(double) y_samples)

  • linear_model(array(array(double)) x_samples, array(double) y_samples, array(double) weights)

Returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional.

varchar

linear_model_predict function

linear_model_predict(varchar model_in_json, array(double) x_sample)

Performs data forecasting based on an existing regression model and the independent variable that you specify.

double

recent_regression function

recent_regression(double y, array(double) x_array, double cur_sample_time_period, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double damping_weight_per_time_unit)

Updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages.

varchar

merge_recent_regression function

merge_recent_regression(varchar model_1_json, varchar model_2_json)

Merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data.

varchar

recent_regression_predict function

recent_regression_predict(varchar model_json, array(double) x_sample)

Performs data forecasting based on an adaptive regression model.

double

Regression models with sample weights

You can specify sample weights for regression models. You can specify sample weights related to time and dependent variables. If the sample weight of a regression model decays as the samples age, the regression model focuses more on the most recent data to adapt to system changes. If the sample weight of a regression model is the reciprocal of the absolute value for the dependent variable, the regression model can minimize the relative error.

linear_model function

The linear_model function returns a regression model in the JSON format. This function is a scalar function. The input is samples aggregated by the array_agg function and a weight, which is optional. For more information, see array_agg function.

varchar linear_model(array(array(double)) x_samples, array(double) y_samples)

Or

varchar linear_model(array(array(double)) x_samples, array(double) y_samples, array(double) weights)

Parameter

Description

x_samples

A data matrix that consists of multiple independent variable samples. Each row indicates an observation operation on the independent variable samples.

y_samples

A vector that consists of dependent variable samples.

weights

Optional. If this parameter is left empty, the same weight is specified for all variable samples.

Examples

  • Query statement

     * |   select group_id,
            linear_model(
                array_agg(array[x1, x2]),
                array_agg(y)
            ) as model
        from log
        group by group_id
  • Query and analysis results

    The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

    In data forecasting, the return value of the linear_model function is used as an input parameter of the linear_model_predict function.

    group_id

    model

    A

    {
      "coefficients": [
        0.8350068912618618,
        -0.741283054726383,
        19.17405856472653
      ],
      "isBuilt": true,
      "isBuildSuccessful": true,
      "sampleCount": 10,
      "xCount": 2,
      "wSum": 10.0,
      "ySumSquare": 3930.0,
      "ySum": 188.0,
      "xXSumProducts": [
        [
          385.0,
          367.0
        ],
        [
          367.0,
          475.0
        ]
      ],
      "xYSumProducts": [
        1104.0,
        1239.0
      ],
      "xSums": [
        55.0,
        67.0
      ],
      "xMeans": [
        5.5,
        6.7
      ],
      "xStdDevs": [
        2.8722813232690143,
        1.6155494421403511
      ],
      "xVariances": [
        8.25,
        2.6099999999999994
      ],
      "yMean": 18.8,
      "yStdDev": 6.289674077406551,
      "yVariance": 39.559999999999945,
      "xCorrelations": [
        [
          1.0,
          -0.03232540919176149
        ],
        [
          -0.03232540919176149,
          1.0
        ]
      ],
      "xYCorrelations": [
        0.3874743195572169,
        -0.202730375711539
      ],
      "regularized": true,
      "regularWeight": 1.0E-6
    }

linear_model_predict function

The linear_model_predict function performs data forecasting based on the regression model and input variable samples that you specify.

double linear_model_predict(varchar model_in_json, array(double) x_sample)

Parameter

Description

model_in_json

The regression model that is returned by the linear_model function. For more information, see linear_model function.

x_sample

The new independent variable.

Example

  • Query statement

    * | with group_models as
    (
        select group_id,
            linear_model(
                array_agg(array[x1, x2]),
                array_agg(y)
            ) as model
        from log
        group by group_id
    )
    
    select d.group_id,
        d.y,
        linear_model_predict(m.model, array[x1, x2]) as predicted_y
    from group_models as m
        join log as d
        on m.group_id = d.group_id
  • Query and analysis results

    The values of the predicted_y parameter are calculated based on the independent variable.

    group_id

    observation_id

    y

    predicted_y

    A

    S001

    23.91700530543459

    15.68867910570816

    A

    S002

    6.931858878794941

    15.352330987812993

    ...

    ...

    ...

Online adaptive regression algorithm

The online adaptive regression algorithm incrementally updates a regression model with new data when the algorithm receives the new data. The algorithm supports more efficient computing and cost-effective storage than the batch algorithm in processing a large amount of data. The online adaptive regression algorithm is suitable for continuous profiling. After data processing, the algorithm discards samples, which is more practical and convenient.

The online adaptive regression algorithm automatically and exponentially decays the impacts of historical samples on statistical features when the algorithm incrementally computes the statistical features and the regression model that is used. This way, the most recent samples maintain high weights and the regression model can adapt to system changes.

recent_regression function

The recent_regression function updates the parameters and state variables of a regression model based on the recently collected data in online mode. The weight of a regression model is adjusted based on the sample age. The importance of a sample decays exponentially as the sample ages.

varchar recent_regression(double y, array(double) x_array, double sample_time, double cur_batch_begin_period, double cur_batch_end_period, double time_unit, double unit_damping_weight)

Parameter

Description

y

The dependent variable sample.

x_array

The array of the independent variable samples.

sample_time

The point in time of data in the sample row. The value is converted to a digit.

cur_batch_begin_period

The start time of the time range for the data that is used for model training.

cur_batch_end_period

The end time of the time range for the data that is used for model training. The time range is a closed interval, which is presented as [batch_window_begin_time, batch_window_end_time].

time_unit

The time interval. The unit of the time interval is the same as the unit of the value specified by the sample_time parameter.

unit_damping_weight

The exponential decay base. If you configure this parameter in together with the time_unit parameter, the sample weight varies based on time. As the value specified by the time_unit parameter increases by 1, the sample weight decays based on a fixed value specified by the unit_damping_weight parameter.

You can configure the parameters to allow the sample weight to exponentially decay based on a half-life period. For example, the weight is 1 for data at the latest point in time. Then, the weight decays to 0.5 for data of one day ago, to 0.25 for data of two days ago, and to 0.125 for data of three days ago.

The value of the unit_damping_weight parameter is calculated based on the following formula:

unit_damping_weight = 2 ^ - (Time interval of samples/Half-life period)

Example

  • Query statement

      * | select group_id,
            recent_regression(
              y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
              time_offset, -- The point in time of the sample.
              -4,          -- The start time of the time range for the current sample batch.
              0,           -- The end time of the time range for the current sample batch.
              1,           -- The time interval.
              0.999        -- The exponential decay base.
            ) as reg_model
        from log
        where time_offset >= -4 and time_offset <= 0
        group by group_id
  • Query and analysis results

    The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

    In data forecasting, the return value of the recent_regression function is used as an input parameter of the recent_regression_predict function.

    group_id

    reg_model

    A

    {
      "sampleCount": 5,
      "xCount": 3,
      "timeUnit": 1.0,
      "beginTimePeriod": -4.0,
      "endTimePeriod": 0.0,
      "unitDampingWeight": 0.999,
      "wSum": 4.990009995001,
      "ySumSquare": 1644.6974283836598,
      "ySum": 83.76770287757991,
      "xXSumSquares": [
        [
          54.830206884025,
          70.82220388003,
          14.960044976005001
        ],
        [
          70.82220388003,
          173.70327985603598,
          25.955043976006
        ],
        [
          14.960044976005001,
          25.955043976006,
          4.990009995001
        ]
      ],
      "xYSumProducts": [
        245.21187055562675,
        402.5070758759011,
        83.76770287757991
      ],
      "xSums": [
        14.960044976005001,
        25.955043976006,
        4.990009995001
      ],
      "xMeans": [
        2.997999000200801,
        5.201401199999158,
        1.0
      ],
      "xStdDevs": [
        1.4142126422148122,
        2.7848935986573244,
        0.0
      ],
      "xVariances": [
        1.9999973974002003,
        7.755632355842543,
        0.0
      ],
      "yMean": 16.78708118049834,
      "yStdDev": 6.913170639821401,
      "yVariance": 47.79192829528864,
      "xCorrelations": [
        [
          1.0,
          -0.35572473794248516,
          0.0
        ],
        [
          -0.35572473794248516,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          1.0
        ]
      ],
      "xYCorrelations": [
        -0.12142097167729436,
        -0.34560624507434407,
        0.0
      ],
      "coefficients": [
        -1.3675797278475395,
        -1.104969989478544,
        0.0,
        26.634476066516903
      ],
      "isBuilt": true,
      "isBuildSuccessful": true
    }

merge_recent_regression function

The merge_recent_regression function merges the parameters and state variables of a regression model that are returned by the recent_regression function after the function is called twice. The results are the same as the parameters and state variables of a new regression model that is trained based on the two sets of data.

varchar merge_recent_regression(varchar model_1_json, varchar model_2_json)

Parameter

Description

model_1_json

The return value of the recent_regression function. For more information, see recent_regression function.

model_2_json

The return value of the recent_regression function. For more information, see recent_regression function.

Example

  • Query statement

    * | with model1 as
    (
        select group_id,
            recent_regression(
              y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
              time_offset, -- The point in time of the sample.
              -4,          -- The start time of the time range for the current sample batch.
              0,           -- The end time of the time range for the current sample batch.
              1,           -- The time interval.
              0.999        -- The exponential decay base.
            ) as reg_model
        from log
        where time_offset >= -4 and time_offset <= 0
        group by group_id
    ),
    
    model2 as
    (
        select group_id,
            recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model
        from log
        where time_offset >= -9 and time_offset <= -5
        group by group_id
    )
    
    select m1.group_id,
        merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model
    from model1 as m1
        join model2 as m2
            on m1.group_id = m2.group_id
  • Query and analysis results

    The coefficients parameter in the query and analysis results indicates the coefficient of the linear regression model that is trained based on the input data.

    In data forecasting, the return value of the merge_recent_regression function is used as an input parameter of the recent_regression_predict function.

    group_id

    reg_model

    A

    {
      "sampleCount": 10,
      "xCount": 3,
      "timeUnit": 1.0,
      "beginTimePeriod": -9.0,
      "endTimePeriod": 0.0,
      "unitDampingWeight": 0.999,
      "wSum": 9.955119790251791,
      "ySumSquare": 4159.2626495224,
      "ySum": 193.9139516502596,
      "xXSumSquares": [
        [
          382.3684973894312,
          268.46629177582946,
          54.67098815430803
        ],
        [
          268.46629177582946,
          358.44803436913094,
          51.78255011892536
        ],
        [
          54.67098815430803,
          51.78255011892536,
          9.955119790251791
        ]
      ],
      "xYSumProducts": [
        1132.090921413269,
        919.4071924317548,
        193.9139516502596
      ],
      "xSums": [
        54.67098815430803,
        51.78255011892536,
        9.955119790251791
      ],
      "xMeans": [
        5.4917458861562585,
        5.201599901352432,
        1.0
      ],
      "xStdDevs": [
        2.8722740635191735,
        2.991614845817865,
        0.0
      ],
      "xVariances": [
        8.249958295964944,
        8.949759385717847,
        0.0
      ],
      "yMean": 19.478816502051856,
      "yStdDev": 6.1949232381571,
      "yVariance": 38.37707392665885,
      "xCorrelations": [
        [
          1.0,
          -0.1859947674356197,
          0.0
        ],
        [
          -0.1859947674356197,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          1.0
        ]
      ],
      "xYCorrelations": [
        0.3791693893070564,
        -0.4837793996174176,
        0.0
      ],
      "coefficients": [
        0.6460732812209116,
        -0.8864195347835274,
        0.0,
        20.541545982438304
      ],
      "isBuilt": true,
      "isBuildSuccessful": true
    }

recent_regression_predict function

The recent_regression_predict function performs data forecasting based on an adaptive regression model.

double recent_regression_predict(varchar model_json, array(double) x_sample)

Parameter

Description

model_json

The return value of the recent_regression or merge_recent_regression function.

x_sample

The independent sample that is used to calculate the forecast value.

Example

  • Query statement

    * | with model1 as
    (
        select group_id,
            recent_regression(
              y, array[x1, x2, 1.0], -- The dependent and independent variable samples.
              time_offset, -- The point in time of the sample.
              -4,          -- The start time of the time range for the current sample batch.
              0,           -- The end time of the time range for the current sample batch.
              1,           -- The time interval.
              0.999        -- The exponential decay base.
            ) as reg_model
        from log
        where time_offset >= -4 and time_offset <= 0
        group by group_id
    ),
    
    model2 as
    (
        select group_id,
            recent_regression(y, array[x1, x2, 1.0], time_offset, -9, -5, 1, 0.999) as reg_model
        from log
        where time_offset >= -9 and time_offset <= -5
        group by group_id
    ),
    
    model as
    (
        select m1.group_id,
            merge_recent_regression(m1.reg_model, m2.reg_model) as reg_model
        from model1 as m1
            join model2 as m2
                on m1.group_id = m2.group_id
    ),
    
    new_data as
    (
        select 'A' as group_id, 1 as obs_id, 3.0 as x1, 5.0 as x2, 1.0 as x3 union all
        select 'A' as group_id, 2 as obs_id, 7.0 as x1, 8.0 as x2, 1.0 as x3
    )
    
    select m.group_id,
        n.obs_id,
        recent_regression_predict(m.reg_model, array[n.x1, n.x2, 1.0]) as predicted_value
    from model as m
        join new_data as n
            on m.group_id = n.group_id
    order by m.group_id, n.obs_id
  • Query and analysis results

    The value of the predicted_value parameter indicates the forecast value.

    group_id

    obs_id

    predicted_value

    A

    1

    17.489274877305804

    A

    2

    22.3233353394362