全部產品
Search
文件中心

Simple Log Service:多變數模式識別函數

更新時間:Nov 01, 2024

Log Service通過異常檢測功能,自動識別服務系統中的異常狀態及其根源。該功能可自動識別指標的變化是否偏離正常模式,並結合指標當前模式與機器學習技術來進行異常檢測。本文主要介紹異常檢測功能(多變數模式識別函數)。

多變數模式識別函數列表

函數名稱

文法

說明

傳回值類型

summarize函數

  • summarize(array(array(double)) data_samples)

  • summarize(array(array(double)) data_samples, array(double) weights)

通過對給定樣本及樣本權重(可選)進行統計學習,識別並返回多變數的統計模式,輸出結果為統計模式。統計模式涵蓋多種統計量與聯合統計量,例如平均值、標準差、共變數矩陣等。

varchar

merge_summary函數

  • merge_summary(varchar summary1, varchar summary2)

  • merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)

將不同階段分別用summarize函數學習得到的模式進行合并,包括同一資料集在不同時段學習出的模式,或者來自兩個獨立資料集各自學習出的模式。

varchar

normalize_vector函數

normalize_vector(varchar summary, array(double) x_vector)

使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector進行歸一化處理,確保其每個分量都被映射至[0, 1]的區間。

array(double)

standardize_vector函數

standardize_vector(varchar summary, array(double) x_vector)

使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector進行標準化處理,將向量分量標準化為均值0和標準差1的某個值。

array(double)

mah_distance函數

mah_distance(varchar summary, array(double) x_vector)

使用summarize函數獲得的多變數模式summary,對新樣本x_vector計算其馬氏距離。馬氏距離能夠有效處理不同變數間的尺度差異問題,通過標準化給定樣本向量x_vector到向量重心的距離來進行衡量。當該距離值為1時,表示樣本向量與重心之間的距離等於所有向量到重心的平均距離。

double

standard_distance函數

standard_distance(varchar summary, double metric_value, int element_index)

使用summarize函數獲得的多變數模式summary,計算新樣本metric_value的標準化距離。標準化距離和馬氏距離類似,馬氏距離是計算由多個指標構成的一個向量到重心的標準化距離,而標準化距離是單個指標到重心的距離,其中element_index是該指標在向量中的索引,metric_value是要標準化的指標值。

double

anomaly_level函數

  • anomaly_level(varchar summary, array(double) x_vector)

  • anomaly_level(varchar summary, array(double) x_vector, int element_index)

使用summarize函數獲得的多變數模式summary,對新樣本x_vector計算其馬氏距離,接著,通過對結果進行向下取整處理,得到1、2、3、4等,來分別表示不同層級的異常機率,具體為:0.1(一級異常)、0.01(二級異常)、0.001(三級異常)、0.0001(四級異常)等。異常等級增加表明機率減小,樣本點的可疑性增加。使用者通常會對異常檢測結果設定閾值過濾,例如僅保留四級異常及以上的異常。

若指定了element_index,則僅計算向量指定索引分量的異常機率;若未指定,則計算所有分量的異常機率。

array(double)

summarize函數

通過對給定樣本及樣本權重(可選)進行統計學習,識別並返回多變數的統計模式,輸出結果為統計模式。統計模式涵蓋多種統計量與聯合統計量,例如平均值、標準差、共變數矩陣等。

varchar summarize(array(array(double)) data_samples)

varchar summarize(array(array(double)) data_samples, array(double) weights)

參數

說明

data_samples

一個二維數組,可以看成一個二維表格。每列表示一個變數,每行表示一次觀測樣本。

weights

選擇性參數。表示樣本的權重,表示為一個與data_samples第一維度長度相同的一維數組。若未指定,則預設所有樣本權重相等。

使用樣本

  • 查詢和分析語句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    )
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
  • 返回結果

    entity_group

    statistical_summary

    A

    {
      "sampleCount": 8,
      "vectorSize": 4,
      "means": [
        11.5,
        12.5,
        9.25,
        0.0
      ],
      "stdDevs": [
        6.87386354243376,
        6.87386354243376,
        7.361215932167728,
        0.0
      ],
      "variances": [
        47.25,
        47.25,
        54.1875,
        0.0
      ],
      "mins": [
        1.0,
        2.0,
        1.0,
        0.0
      ],
      "maxs": [
        22.0,
        23.0,
        21.0,
        0.0
      ],
      "covariance": [
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          19.125,
          19.125,
          54.1875,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "correlations": [
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          0.37796447300922725,
          0.37796447300922725,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          1.0
        ]
      ],
      "sums": [
        92.0,
        100.0,
        74.0,
        0.0
      ],
      "weightSum": 8.0,
      "sumProducts": [
        [
          1436.0,
          1528.0,
          1004.0,
          0.0
        ],
        [
          1528.0,
          1628.0,
          1078.0,
          0.0
        ],
        [
          1004.0,
          1078.0,
          1118.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "isSummarized": true
    }
    

    返回參數說明:

    參數

    說明

    sampleCount

    樣本數量。

    vectorSize

    向量長度。

    means

    所有向量各分量的平均值。

    stdDevs

    所有向量各分量的標準差。

    variances

    所有向量各分量的方差。

    mins

    所有向量各分量的最小值。

    maxs

    所有向量各分量的最大值。

    covariance

    所有向量各分量間的共變數矩陣。

    correlations

    所有向量各分量間的相關係數矩陣。

    sums

    所有向量各分量的和。

    weightSum

    所有樣本權重的和。

    sumProducts

    合并統計模式時候用到的中間結果。

    isSummarized

    模式統計計算是否正常返回。

    • true:正常返回。

    • false:非正常返回。

merge_summary函數

將不同階段分別用summarize函數學習得到的模式進行合并,包括同一資料集在不同時段學習出的模式,或者來自兩個獨立資料集各自學習出的模式。

varchar merge_summary(varchar summary1, varchar summary2)

varchar merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)

參數

說明

summary1

summarize函數學習過程得到的模式。

weight1

summary1模式對應的整體權重。

summary2

summarize函數學習過程得到的模式。

weight2

summary2模式對應的整體權重。

使用樣本

  • 查詢和分析語句

    * | with data_table_01 as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features
    ),
    
    summaries_01 as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table_01
        group by entity_group
    ),
    
    data_table_02 as
    (
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries_02 as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table_02
        group by entity_group
    )
    select s1.entity_group,
        merge_summary(s1.statistical_summary, s2.statistical_summary) as statistical_summary
    from summaries_01 as s1
        join summaries_02 as s2
        on s1.entity_group = s2.entity_group
  • 返回結果

    statistical_summary為整合後的模式。

    entity_group

    statistical_summary

    2

    {
      "sampleCount": 8,
      "vectorSize": 4,
      "means": [
        11.5,
        12.5,
        9.25,
        0.0
      ],
      "stdDevs": [
        6.87386354243376,
        6.87386354243376,
        7.361215932167728,
        0.0
      ],
      "variances": [
        47.25,
        47.25,
        54.1875,
        0.0
      ],
      "mins": [
        1.0,
        2.0,
        1.0,
        0.0
      ],
      "maxs": [
        22.0,
        23.0,
        21.0,
        0.0
      ],
      "covariance": [
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          47.25,
          47.25,
          19.125,
          0.0
        ],
        [
          19.125,
          19.125,
          54.1875,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "correlations": [
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          1.0,
          1.0,
          0.37796447300922725,
          0.0
        ],
        [
          0.37796447300922725,
          0.37796447300922725,
          1.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          1.0
        ]
      ],
      "sums": [
        92.0,
        100.0,
        74.0,
        0.0
      ],
      "weightSum": 8.0,
      "sumProducts": [
        [
          1436.0,
          1528.0,
          1004.0,
          0.0
        ],
        [
          1528.0,
          1628.0,
          1078.0,
          0.0
        ],
        [
          1004.0,
          1078.0,
          1118.0,
          0.0
        ],
        [
          0.0,
          0.0,
          0.0,
          0.0
        ]
      ],
      "isSummarized": true
    }

    返回參數說明:

    參數

    說明

    sampleCount

    樣本數量。

    vectorSize

    向量長度。

    means

    所有向量各分量的平均值。

    stdDevs

    所有向量各分量的標準差。

    variances

    所有向量各分量的方差。

    mins

    所有向量各分量的最小值。

    maxs

    所有向量各分量的最大值。

    covariance

    所有向量各分量間的共變數矩陣。

    correlations

    所有向量各分量間的相關係數矩陣。

    sums

    所有向量各分量的和。

    weightSum

    所有樣本權重的和。

    sumProducts

    合并統計模式時候用到的中間結果。

    isSummarized

    模式統計計算是否正常返回。

    • true:正常返回。

    • false:非正常返回。

normalize_vector函數

使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector進行歸一化處理,確保其每個分量都被映射至[0, 1]的區間。

array(double) normalize_vector(varchar summary, array(double) x_vector)

參數

說明

summary

summarize函數學習過程得到的模式。

x_vector

一組新樣本資料。

使用樣本

  • 查詢和分析語句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        normalize_vector(t2.statistical_summary, t1.features) as normalized_features
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回結果

    normalized_features為輸入樣本向量x_vector歸一化處理後的結果。

    entity_id

    entity_group

    normalized_features

    2

    A

    [0.14285714285714286,0.14285714285714286,0.25,0.5]

    4

    A

    [0.42857142857142857,0.42857142857142857,0.0,0.5]

    3

    A

    [0.2857142857142857,0.2857142857142857,0.4,0.5]

    ...

    ...

    ...

standardize_vector函數

使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector進行標準化處理,將向量分量標準化為均值0和標準差1的某個值。

array(double) standardize_vector(varchar summary, array(double) x_vector)

參數

說明

summary

summarize函數學習過程得到的模式。

x_vector

一組新樣本資料。

使用樣本

  • 查詢和分析語句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        standardize_vector(t2.statistical_summary, t1.features) as standardized_features
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回結果

    standardized_features為輸入樣本向量x_vector標準化處理後的結果。

    entity_id

    entity_group

    standardized_features

    2

    A

    [-1.0910894511799619,-1.0910894511799619,-0.4415031470273609,0.0]

    4

    A

    [-0.21821789023599237,-0.21821789023599237,-1.1207387578386854,0.0]

    3

    A

    [-0.6546536707079771,-0.6546536707079771,-0.03396178054056622,0.0]

    ...

    ...

    ...

mah_distance函數

使用summarize函數獲得的多變數模式summary,對新樣本x_vector計算其馬氏距離。馬氏距離能夠有效處理不同變數間的尺度差異問題,通過標準化樣本向量x_vector到向量重心的距離來進行衡量。當該距離值為1時,表示樣本向量與重心之間的距離等於所有向量到重心平均距離。

double mah_distance(varchar summary, array(double) x_vector)

參數

說明

summary

summarize函數學習過程得到的模式。

x_vector

一組新樣本資料。

使用樣本

  • 查詢和分析語句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        mah_distance(t2.statistical_summary, t1.features) as std_distance
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
  • 返回結果

    std_distance為輸入樣本向量x_vector的馬氏距離。

    entity_id

    entity_group

    std_distance

    8

    A

    2.386927730244857

    7

    A

    1.6809080087793125

    1

    A

    1.5554594371997328

    ...

    ...

    ...

standard_distance函數

使用summarize函數獲得的多變數模式summary,計算新樣本metric_value的標準化距離。使用summarize函數獲得的多變數模式summary,計算新樣本metric_value的標準化距離。標準化距離和馬氏距離類似,馬氏距離是計算由多個指標構成的一個向量到重心的標準化距離,而標準化距離是單個指標到重心的距離,其中element_index是該指標在向量中的索引(分量的索引是從0開始的,而非從1開始),metric_value是要標準化的指標值。

double standard_distance(varchar summary, double metric_value, int element_index)

參數

說明

summary

summarize函數學習過程得到的模式。

metric_value

新樣本資料。

element_index

summary數組指定分量的索引。分量的索引是從0開始的,而非從1開始。

使用樣本

  • 查詢和分析語句

    * | with data_table as
    (
        select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all
        select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all
        select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all
        select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all
        select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all
        select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all
        select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all
        select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features
    ),
    
    summaries as
    (
        select entity_group,
            summarize(array_agg(features)) as statistical_summary
        from data_table
        group by entity_group
    )
    
    select t1.entity_id,
        t1.entity_group,
        standard_distance(t2.statistical_summary, 30, 1) as std_distance
    from data_table as t1
        join summaries as t2
        on t1.entity_group = t2.entity_group
    
  • 返回結果

    std_distance為輸入樣本metric_value指定索引的標準化距離。

    entity_id

    entity_group

    std_distance

    8

    A

    2.386927730244857

    7

    A

    1.6809080087793125

    1

    A

    1.5554594371997328

    ...

    ...

    ...

anomaly_level函數

使用summarize函數獲得的多變數模式summary,對新樣本x_vector計算其馬氏距離,接著,通過對結果進行向下取整處理,得到1、2、3、4等結果,來分別表示不同層級的異常機率,異常機率的層級具體為:0.1(一級異常)、0.01(二級異常)、0.001(三級異常)、0.0001(四級異常)等。異常等級增加表明機率減小,樣本點的可疑性增加。使用者通常會對異常檢測結果設定閾值過濾,例如僅保留四級異常及以上的異常。

若指定了element_index,則僅計算向量指定索引分量的異常機率;若未指定,則計算所有分量的異常機率。

double anomaly_level(varchar summary, array(double) x_vector)

double anomaly_level(varchar summary, array(double) x_vector, int element_index)

參數

說明

summary

summarize函數學習過程得到的模式。

x_vector

一組新樣本資料。

element_index

選擇性參數。x_vector數組指定索引的元素。

使用樣本

  • 查詢和分析語句

    * |
    with dummy as
    (
        select sequence(1, 1000) as seq_data,
            count(*) as record_count from log
    ),
    
    sample_data as
    (
        select 'G1' as group_id,
            s.seq_num,
             -- 產生1000個二維的隨機向量資料點,向量中心在(100, 5000)這個位置,兩個分量的標準差分別為20和500
            inverse_normal_cdf(100, 20, random()) as x1,
            inverse_normal_cdf(5000, 500, rand()) as x2
        from dummy,
            unnest(seq_data) as s(seq_num)
    ),
    
    data_summary as
    (
        select group_id,
            summarize(array_agg(array[x1, x2])) as metric_summary
        from sample_data
        group by group_id
    ),
    
    new_data as
    (
        select 'G1' as group_id, 1001 as object_id, 100.0 as x1, 5000.0 as x2
        union all select 'G1' as group_id, 1002 as object_id, 118.0 as x1, 5450.0 as x2
        union all select 'G1' as group_id, 1003 as object_id, 138.0 as x1, 5950.0 as x2
        union all select 'G1' as group_id, 1004 as object_id, 158.0 as x1, 6450.0 as x2
        union all select 'G1' as group_id, 1005 as object_id, 178.0 as x1, 6950.0 as x2
        union all select 'G1' as group_id, 1006 as object_id, 198.0 as x1, 7450.0 as x2
        union all select 'G1' as group_id, 1007 as object_id, 318.0 as x1, 10000.0 as x2
    )
    
    select n.group_id,
        json_extract(s.metric_summary, '$.means') as metric_vector_mean,
        json_extract(s.metric_summary, '$.covariance') as metric_covariance,
        n.object_id,
        n.x1,
        n.x2,
        anomaly_level(s.metric_summary, array[x1, x2]) as anomaly_level
    from data_summary as s
        join new_data as n
            on s.group_id = n.group_id
    order by n.group_id, n.object_id
    limit 100000
    
  • 返回結果

    anomaly_level為輸入樣本向量x_vector的異常機率。

    group_id

    object_id

    anomaly_level

    G1

    1007

    13.0

    G1

    1006

    5.0

    G1

    1005

    4.0

    ...

    ...

    ...