Log Service通過異常檢測功能,自動識別服務系統中的異常狀態及其根源。該功能可自動識別指標的變化是否偏離正常模式,並結合指標當前模式與機器學習技術來進行異常檢測。本文主要介紹異常檢測功能(多變數模式識別函數)。
多變數模式識別函數列表
函數名稱 | 文法 | 說明 | 傳回值類型 |
| 通過對給定樣本及樣本權重(可選)進行統計學習,識別並返回多變數的統計模式,輸出結果為統計模式。統計模式涵蓋多種統計量與聯合統計量,例如平均值、標準差、共變數矩陣等。 | varchar | |
| 將不同階段分別用summarize函數學習得到的模式進行合并,包括同一資料集在不同時段學習出的模式,或者來自兩個獨立資料集各自學習出的模式。 | varchar | |
normalize_vector(varchar summary, array(double) x_vector) | 使用summarize函數獲得的多變數模式summary,對新樣本向量 | array(double) | |
standardize_vector(varchar summary, array(double) x_vector) | 使用summarize函數獲得的多變數模式summary,對新樣本向量 | array(double) | |
mah_distance(varchar summary, array(double) x_vector) | 使用summarize函數獲得的多變數模式summary,對新樣本 | double | |
standard_distance(varchar summary, double metric_value, int element_index) | 使用summarize函數獲得的多變數模式 | double | |
| 使用summarize函數獲得的多變數模式summary,對新樣本 若指定了 | array(double) |
summarize函數
通過對給定樣本及樣本權重(可選)進行統計學習,識別並返回多變數的統計模式,輸出結果為統計模式。統計模式涵蓋多種統計量與聯合統計量,例如平均值、標準差、共變數矩陣等。
varchar summarize(array(array(double)) data_samples)
或
varchar summarize(array(array(double)) data_samples, array(double) weights)
參數 | 說明 |
| 一個二維數組,可以看成一個二維表格。每列表示一個變數,每行表示一次觀測樣本。 |
| 選擇性參數。表示樣本的權重,表示為一個與 |
使用樣本
查詢和分析語句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ) select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group
返回結果
entity_group
statistical_summary
A
{ "sampleCount": 8, "vectorSize": 4, "means": [ 11.5, 12.5, 9.25, 0.0 ], "stdDevs": [ 6.87386354243376, 6.87386354243376, 7.361215932167728, 0.0 ], "variances": [ 47.25, 47.25, 54.1875, 0.0 ], "mins": [ 1.0, 2.0, 1.0, 0.0 ], "maxs": [ 22.0, 23.0, 21.0, 0.0 ], "covariance": [ [ 47.25, 47.25, 19.125, 0.0 ], [ 47.25, 47.25, 19.125, 0.0 ], [ 19.125, 19.125, 54.1875, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "correlations": [ [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 0.37796447300922725, 0.37796447300922725, 1.0, 0.0 ], [ 0.0, 0.0, 0.0, 1.0 ] ], "sums": [ 92.0, 100.0, 74.0, 0.0 ], "weightSum": 8.0, "sumProducts": [ [ 1436.0, 1528.0, 1004.0, 0.0 ], [ 1528.0, 1628.0, 1078.0, 0.0 ], [ 1004.0, 1078.0, 1118.0, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "isSummarized": true }
返回參數說明:
參數
說明
sampleCount
樣本數量。
vectorSize
向量長度。
means
所有向量各分量的平均值。
stdDevs
所有向量各分量的標準差。
variances
所有向量各分量的方差。
mins
所有向量各分量的最小值。
maxs
所有向量各分量的最大值。
covariance
所有向量各分量間的共變數矩陣。
correlations
所有向量各分量間的相關係數矩陣。
sums
所有向量各分量的和。
weightSum
所有樣本權重的和。
sumProducts
合并統計模式時候用到的中間結果。
isSummarized
模式統計計算是否正常返回。
true:正常返回。
false:非正常返回。
merge_summary函數
將不同階段分別用summarize函數學習得到的模式進行合并,包括同一資料集在不同時段學習出的模式,或者來自兩個獨立資料集各自學習出的模式。
varchar merge_summary(varchar summary1, varchar summary2)
或
varchar merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| summary1模式對應的整體權重。 |
| summarize函數學習過程得到的模式。 |
| summary2模式對應的整體權重。 |
使用樣本
查詢和分析語句
* | with data_table_01 as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features ), summaries_01 as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table_01 group by entity_group ), data_table_02 as ( select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries_02 as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table_02 group by entity_group ) select s1.entity_group, merge_summary(s1.statistical_summary, s2.statistical_summary) as statistical_summary from summaries_01 as s1 join summaries_02 as s2 on s1.entity_group = s2.entity_group
返回結果
statistical_summary
為整合後的模式。entity_group
statistical_summary
2
{ "sampleCount": 8, "vectorSize": 4, "means": [ 11.5, 12.5, 9.25, 0.0 ], "stdDevs": [ 6.87386354243376, 6.87386354243376, 7.361215932167728, 0.0 ], "variances": [ 47.25, 47.25, 54.1875, 0.0 ], "mins": [ 1.0, 2.0, 1.0, 0.0 ], "maxs": [ 22.0, 23.0, 21.0, 0.0 ], "covariance": [ [ 47.25, 47.25, 19.125, 0.0 ], [ 47.25, 47.25, 19.125, 0.0 ], [ 19.125, 19.125, 54.1875, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "correlations": [ [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 0.37796447300922725, 0.37796447300922725, 1.0, 0.0 ], [ 0.0, 0.0, 0.0, 1.0 ] ], "sums": [ 92.0, 100.0, 74.0, 0.0 ], "weightSum": 8.0, "sumProducts": [ [ 1436.0, 1528.0, 1004.0, 0.0 ], [ 1528.0, 1628.0, 1078.0, 0.0 ], [ 1004.0, 1078.0, 1118.0, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "isSummarized": true }
返回參數說明:
參數
說明
sampleCount
樣本數量。
vectorSize
向量長度。
means
所有向量各分量的平均值。
stdDevs
所有向量各分量的標準差。
variances
所有向量各分量的方差。
mins
所有向量各分量的最小值。
maxs
所有向量各分量的最大值。
covariance
所有向量各分量間的共變數矩陣。
correlations
所有向量各分量間的相關係數矩陣。
sums
所有向量各分量的和。
weightSum
所有樣本權重的和。
sumProducts
合并統計模式時候用到的中間結果。
isSummarized
模式統計計算是否正常返回。
true:正常返回。
false:非正常返回。
normalize_vector函數
使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector
進行歸一化處理,確保其每個分量都被映射至[0, 1]的區間。
array(double) normalize_vector(varchar summary, array(double) x_vector)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| 一組新樣本資料。 |
使用樣本
查詢和分析語句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, normalize_vector(t2.statistical_summary, t1.features) as normalized_features from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回結果
normalized_features
為輸入樣本向量x_vector
歸一化處理後的結果。entity_id
entity_group
normalized_features
2
A
[0.14285714285714286,0.14285714285714286,0.25,0.5]
4
A
[0.42857142857142857,0.42857142857142857,0.0,0.5]
3
A
[0.2857142857142857,0.2857142857142857,0.4,0.5]
...
...
...
standardize_vector函數
使用summarize函數獲得的多變數模式summary,對新樣本向量x_vector
進行標準化處理,將向量分量標準化為均值0和標準差1的某個值。
array(double) standardize_vector(varchar summary, array(double) x_vector)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| 一組新樣本資料。 |
使用樣本
查詢和分析語句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, standardize_vector(t2.statistical_summary, t1.features) as standardized_features from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回結果
standardized_features
為輸入樣本向量x_vector
標準化處理後的結果。entity_id
entity_group
standardized_features
2
A
[-1.0910894511799619,-1.0910894511799619,-0.4415031470273609,0.0]
4
A
[-0.21821789023599237,-0.21821789023599237,-1.1207387578386854,0.0]
3
A
[-0.6546536707079771,-0.6546536707079771,-0.03396178054056622,0.0]
...
...
...
mah_distance函數
使用summarize函數獲得的多變數模式summary,對新樣本x_vector計
算其馬氏距離。馬氏距離能夠有效處理不同變數間的尺度差異問題,通過標準化樣本向量x_vector
到向量重心的距離來進行衡量。當該距離值為1時,表示樣本向量與重心之間的距離等於所有向量到重心平均距離。
double mah_distance(varchar summary, array(double) x_vector)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| 一組新樣本資料。 |
使用樣本
查詢和分析語句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, mah_distance(t2.statistical_summary, t1.features) as std_distance from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回結果
std_distance
為輸入樣本向量x_vector
的馬氏距離。entity_id
entity_group
std_distance
8
A
2.386927730244857
7
A
1.6809080087793125
1
A
1.5554594371997328
...
...
...
standard_distance函數
使用summarize函數獲得的多變數模式summary
,計算新樣本metric_value
的標準化距離。使用summarize函數獲得的多變數模式summary
,計算新樣本metric_value
的標準化距離。標準化距離和馬氏距離類似,馬氏距離是計算由多個指標構成的一個向量到重心的標準化距離,而標準化距離是單個指標到重心的距離,其中element_index
是該指標在向量中的索引(分量的索引是從0開始的,而非從1開始),metric_value
是要標準化的指標值。
double standard_distance(varchar summary, double metric_value, int element_index)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| 新樣本資料。 |
|
|
使用樣本
查詢和分析語句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, standard_distance(t2.statistical_summary, 30, 1) as std_distance from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回結果
std_distance
為輸入樣本metric_value
指定索引的標準化距離。entity_id
entity_group
std_distance
8
A
2.386927730244857
7
A
1.6809080087793125
1
A
1.5554594371997328
...
...
...
anomaly_level函數
使用summarize函數獲得的多變數模式summary,對新樣本x_vector
計算其馬氏距離,接著,通過對結果進行向下取整處理,得到1、2、3、4等結果,來分別表示不同層級的異常機率,異常機率的層級具體為:0.1(一級異常)、0.01(二級異常)、0.001(三級異常)、0.0001(四級異常)等。異常等級增加表明機率減小,樣本點的可疑性增加。使用者通常會對異常檢測結果設定閾值過濾,例如僅保留四級異常及以上的異常。
若指定了element_index
,則僅計算向量指定索引分量的異常機率;若未指定,則計算所有分量的異常機率。
double anomaly_level(varchar summary, array(double) x_vector)
或
double anomaly_level(varchar summary, array(double) x_vector, int element_index)
參數 | 說明 |
| summarize函數學習過程得到的模式。 |
| 一組新樣本資料。 |
| 選擇性參數。 |
使用樣本
查詢和分析語句
* | with dummy as ( select sequence(1, 1000) as seq_data, count(*) as record_count from log ), sample_data as ( select 'G1' as group_id, s.seq_num, -- 產生1000個二維的隨機向量資料點,向量中心在(100, 5000)這個位置,兩個分量的標準差分別為20和500 inverse_normal_cdf(100, 20, random()) as x1, inverse_normal_cdf(5000, 500, rand()) as x2 from dummy, unnest(seq_data) as s(seq_num) ), data_summary as ( select group_id, summarize(array_agg(array[x1, x2])) as metric_summary from sample_data group by group_id ), new_data as ( select 'G1' as group_id, 1001 as object_id, 100.0 as x1, 5000.0 as x2 union all select 'G1' as group_id, 1002 as object_id, 118.0 as x1, 5450.0 as x2 union all select 'G1' as group_id, 1003 as object_id, 138.0 as x1, 5950.0 as x2 union all select 'G1' as group_id, 1004 as object_id, 158.0 as x1, 6450.0 as x2 union all select 'G1' as group_id, 1005 as object_id, 178.0 as x1, 6950.0 as x2 union all select 'G1' as group_id, 1006 as object_id, 198.0 as x1, 7450.0 as x2 union all select 'G1' as group_id, 1007 as object_id, 318.0 as x1, 10000.0 as x2 ) select n.group_id, json_extract(s.metric_summary, '$.means') as metric_vector_mean, json_extract(s.metric_summary, '$.covariance') as metric_covariance, n.object_id, n.x1, n.x2, anomaly_level(s.metric_summary, array[x1, x2]) as anomaly_level from data_summary as s join new_data as n on s.group_id = n.group_id order by n.group_id, n.object_id limit 100000
返回結果
anomaly_level
為輸入樣本向量x_vector
的異常機率。group_id
object_id
anomaly_level
G1
1007
13.0
G1
1006
5.0
G1
1005
4.0
...
...
...