日志服务通过异常检测功能,自动识别服务系统中的异常状态及其根源。该功能可自动识别指标的变化是否偏离正常模式,并结合指标当前模式与机器学习技术来进行异常检测。本文主要介绍异常检测功能(多变量模式识别函数)。
多变量模式识别函数列表
函数名称 | 语法 | 说明 | 返回值类型 |
| 通过对给定样本及样本权重(可选)进行统计学习,识别并返回多变量的统计模式,输出结果为统计模式。统计模式涵盖多种统计量与联合统计量,例如平均值、标准差、协方差矩阵等。 | varchar | |
| 将不同阶段分别用summarize函数学习得到的模式进行合并,包括同一数据集在不同时段学习出的模式,或者来自两个独立数据集各自学习出的模式。 | varchar | |
normalize_vector(varchar summary, array(double) x_vector) | 使用summarize函数获得的多变量模式summary,对新样本向量 | array(double) | |
standardize_vector(varchar summary, array(double) x_vector) | 使用summarize函数获得的多变量模式summary,对新样本向量 | array(double) | |
mah_distance(varchar summary, array(double) x_vector) | 使用summarize函数获得的多变量模式summary,对新样本 | double | |
standard_distance(varchar summary, double metric_value, int element_index) | 使用summarize函数获得的多变量模式 | double | |
| 使用summarize函数获得的多变量模式summary,对新样本 若指定了 | array(double) |
summarize函数
通过对给定样本及样本权重(可选)进行统计学习,识别并返回多变量的统计模式,输出结果为统计模式。统计模式涵盖多种统计量与联合统计量,例如平均值、标准差、协方差矩阵等。
varchar summarize(array(array(double)) data_samples)
或
varchar summarize(array(array(double)) data_samples, array(double) weights)
参数 | 说明 |
| 一个二维数组,可以看成一个二维表格。每列表示一个变量,每行表示一次观测样本。 |
| 可选参数。表示样本的权重,表示为一个与 |
使用示例
查询和分析语句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ) select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group
返回结果
entity_group
statistical_summary
A
{ "sampleCount": 8, "vectorSize": 4, "means": [ 11.5, 12.5, 9.25, 0.0 ], "stdDevs": [ 6.87386354243376, 6.87386354243376, 7.361215932167728, 0.0 ], "variances": [ 47.25, 47.25, 54.1875, 0.0 ], "mins": [ 1.0, 2.0, 1.0, 0.0 ], "maxs": [ 22.0, 23.0, 21.0, 0.0 ], "covariance": [ [ 47.25, 47.25, 19.125, 0.0 ], [ 47.25, 47.25, 19.125, 0.0 ], [ 19.125, 19.125, 54.1875, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "correlations": [ [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 0.37796447300922725, 0.37796447300922725, 1.0, 0.0 ], [ 0.0, 0.0, 0.0, 1.0 ] ], "sums": [ 92.0, 100.0, 74.0, 0.0 ], "weightSum": 8.0, "sumProducts": [ [ 1436.0, 1528.0, 1004.0, 0.0 ], [ 1528.0, 1628.0, 1078.0, 0.0 ], [ 1004.0, 1078.0, 1118.0, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "isSummarized": true }
返回参数说明:
参数
说明
sampleCount
样本数量。
vectorSize
向量长度。
means
所有向量各分量的平均值。
stdDevs
所有向量各分量的标准差。
variances
所有向量各分量的方差。
mins
所有向量各分量的最小值。
maxs
所有向量各分量的最大值。
covariance
所有向量各分量间的协方差矩阵。
correlations
所有向量各分量间的相关系数矩阵。
sums
所有向量各分量的和。
weightSum
所有样本权重的和。
sumProducts
合并统计模式时候用到的中间结果。
isSummarized
模式统计计算是否正常返回。
true:正常返回。
false:非正常返回。
merge_summary函数
将不同阶段分别用summarize函数学习得到的模式进行合并,包括同一数据集在不同时段学习出的模式,或者来自两个独立数据集各自学习出的模式。
varchar merge_summary(varchar summary1, varchar summary2)
或
varchar merge_summary(varchar summary1, double weight1, varchar summary2, double weight2)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| summary1模式对应的整体权重。 |
| summarize函数学习过程得到的模式。 |
| summary2模式对应的整体权重。 |
使用示例
查询和分析语句
* | with data_table_01 as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features ), summaries_01 as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table_01 group by entity_group ), data_table_02 as ( select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries_02 as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table_02 group by entity_group ) select s1.entity_group, merge_summary(s1.statistical_summary, s2.statistical_summary) as statistical_summary from summaries_01 as s1 join summaries_02 as s2 on s1.entity_group = s2.entity_group
返回结果
statistical_summary
为整合后的模式。entity_group
statistical_summary
2
{ "sampleCount": 8, "vectorSize": 4, "means": [ 11.5, 12.5, 9.25, 0.0 ], "stdDevs": [ 6.87386354243376, 6.87386354243376, 7.361215932167728, 0.0 ], "variances": [ 47.25, 47.25, 54.1875, 0.0 ], "mins": [ 1.0, 2.0, 1.0, 0.0 ], "maxs": [ 22.0, 23.0, 21.0, 0.0 ], "covariance": [ [ 47.25, 47.25, 19.125, 0.0 ], [ 47.25, 47.25, 19.125, 0.0 ], [ 19.125, 19.125, 54.1875, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "correlations": [ [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 1.0, 1.0, 0.37796447300922725, 0.0 ], [ 0.37796447300922725, 0.37796447300922725, 1.0, 0.0 ], [ 0.0, 0.0, 0.0, 1.0 ] ], "sums": [ 92.0, 100.0, 74.0, 0.0 ], "weightSum": 8.0, "sumProducts": [ [ 1436.0, 1528.0, 1004.0, 0.0 ], [ 1528.0, 1628.0, 1078.0, 0.0 ], [ 1004.0, 1078.0, 1118.0, 0.0 ], [ 0.0, 0.0, 0.0, 0.0 ] ], "isSummarized": true }
返回参数说明:
参数
说明
sampleCount
样本数量。
vectorSize
向量长度。
means
所有向量各分量的平均值。
stdDevs
所有向量各分量的标准差。
variances
所有向量各分量的方差。
mins
所有向量各分量的最小值。
maxs
所有向量各分量的最大值。
covariance
所有向量各分量间的协方差矩阵。
correlations
所有向量各分量间的相关系数矩阵。
sums
所有向量各分量的和。
weightSum
所有样本权重的和。
sumProducts
合并统计模式时候用到的中间结果。
isSummarized
模式统计计算是否正常返回。
true:正常返回。
false:非正常返回。
normalize_vector函数
使用summarize函数获得的多变量模式summary,对新样本向量x_vector
进行归一化处理,确保其每个分量都被映射至[0, 1]的区间。
array(double) normalize_vector(varchar summary, array(double) x_vector)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| 一组新样本数据。 |
使用示例
查询和分析语句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, normalize_vector(t2.statistical_summary, t1.features) as normalized_features from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回结果
normalized_features
为输入样本向量x_vector
归一化处理后的结果。entity_id
entity_group
normalized_features
2
A
[0.14285714285714286,0.14285714285714286,0.25,0.5]
4
A
[0.42857142857142857,0.42857142857142857,0.0,0.5]
3
A
[0.2857142857142857,0.2857142857142857,0.4,0.5]
...
...
...
standardize_vector函数
使用summarize函数获得的多变量模式summary,对新样本向量x_vector
进行标准化处理,将向量分量标准化为均值0和标准差1的某个值。
array(double) standardize_vector(varchar summary, array(double) x_vector)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| 一组新样本数据。 |
使用示例
查询和分析语句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, standardize_vector(t2.statistical_summary, t1.features) as standardized_features from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回结果
standardized_features
为输入样本向量x_vector
标准化处理后的结果。entity_id
entity_group
standardized_features
2
A
[-1.0910894511799619,-1.0910894511799619,-0.4415031470273609,0.0]
4
A
[-0.21821789023599237,-0.21821789023599237,-1.1207387578386854,0.0]
3
A
[-0.6546536707079771,-0.6546536707079771,-0.03396178054056622,0.0]
...
...
...
mah_distance函数
使用summarize函数获得的多变量模式summary,对新样本x_vector计
算其马氏距离。马氏距离能够有效处理不同变量间的尺度差异问题,通过标准化样本向量x_vector
到向量重心的距离来进行衡量。当该距离值为1时,表示样本向量与重心之间的距离等于所有向量到重心平均距离。
double mah_distance(varchar summary, array(double) x_vector)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| 一组新样本数据。 |
使用示例
查询和分析语句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, mah_distance(t2.statistical_summary, t1.features) as std_distance from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回结果
std_distance
为输入样本向量x_vector
的马氏距离。entity_id
entity_group
std_distance
8
A
2.386927730244857
7
A
1.6809080087793125
1
A
1.5554594371997328
...
...
...
standard_distance函数
使用summarize函数获得的多变量模式summary
,计算新样本metric_value
的标准化距离。使用summarize函数获得的多变量模式summary
,计算新样本metric_value
的标准化距离。标准化距离和马氏距离类似,马氏距离是计算由多个指标构成的一个向量到重心的标准化距离,而标准化距离是单个指标到重心的距离,其中element_index
是该指标在向量中的索引(分量的索引是从0开始的,而非从1开始),metric_value
是要标准化的指标值。
double standard_distance(varchar summary, double metric_value, int element_index)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| 新样本数据。 |
|
|
使用示例
查询和分析语句
* | with data_table as ( select 1 as entity_id, 'A' as entity_group, cast(array[1, 2, 3, 0] as array(double)) as features union all select 2 as entity_id, 'A' as entity_group, cast(array[4, 5, 6, 0] as array(double)) as features union all select 3 as entity_id, 'A' as entity_group, cast(array[7, 8, 9, 0] as array(double)) as features union all select 4 as entity_id, 'A' as entity_group, cast(array[10, 11, 1, 0] as array(double)) as features union all select 5 as entity_id, 'A' as entity_group, cast(array[13, 14, 15, 0] as array(double)) as features union all select 6 as entity_id, 'A' as entity_group, cast(array[16, 17, 18, 0] as array(double)) as features union all select 7 as entity_id, 'A' as entity_group, cast(array[19, 20, 21, 0] as array(double)) as features union all select 8 as entity_id, 'A' as entity_group, cast(array[22, 23, 1, 0] as array(double)) as features ), summaries as ( select entity_group, summarize(array_agg(features)) as statistical_summary from data_table group by entity_group ) select t1.entity_id, t1.entity_group, standard_distance(t2.statistical_summary, 30, 1) as std_distance from data_table as t1 join summaries as t2 on t1.entity_group = t2.entity_group
返回结果
std_distance
为输入样本metric_value
指定索引的标准化距离。entity_id
entity_group
std_distance
8
A
2.386927730244857
7
A
1.6809080087793125
1
A
1.5554594371997328
...
...
...
anomaly_level函数
使用summarize函数获得的多变量模式summary,对新样本x_vector
计算其马氏距离,接着,通过对结果进行向下取整处理,得到1、2、3、4等结果,来分别表示不同级别的异常概率,异常概率的级别具体为:0.1(一级异常)、0.01(二级异常)、0.001(三级异常)、0.0001(四级异常)等。异常等级增加表明概率减小,样本点的可疑性增加。用户通常会对异常检测结果设置阈值过滤,例如仅保留四级异常及以上的异常。
若指定了element_index
,则仅计算向量指定索引分量的异常概率;若未指定,则计算所有分量的异常概率。
double anomaly_level(varchar summary, array(double) x_vector)
或
double anomaly_level(varchar summary, array(double) x_vector, int element_index)
参数 | 说明 |
| summarize函数学习过程得到的模式。 |
| 一组新样本数据。 |
| 可选参数。 |
使用示例
查询和分析语句
* | with dummy as ( select sequence(1, 1000) as seq_data, count(*) as record_count from log ), sample_data as ( select 'G1' as group_id, s.seq_num, -- 产生1000个二维的随机向量数据点,向量中心在(100, 5000)这个位置,两个分量的标准差分别为20和500 inverse_normal_cdf(100, 20, random()) as x1, inverse_normal_cdf(5000, 500, rand()) as x2 from dummy, unnest(seq_data) as s(seq_num) ), data_summary as ( select group_id, summarize(array_agg(array[x1, x2])) as metric_summary from sample_data group by group_id ), new_data as ( select 'G1' as group_id, 1001 as object_id, 100.0 as x1, 5000.0 as x2 union all select 'G1' as group_id, 1002 as object_id, 118.0 as x1, 5450.0 as x2 union all select 'G1' as group_id, 1003 as object_id, 138.0 as x1, 5950.0 as x2 union all select 'G1' as group_id, 1004 as object_id, 158.0 as x1, 6450.0 as x2 union all select 'G1' as group_id, 1005 as object_id, 178.0 as x1, 6950.0 as x2 union all select 'G1' as group_id, 1006 as object_id, 198.0 as x1, 7450.0 as x2 union all select 'G1' as group_id, 1007 as object_id, 318.0 as x1, 10000.0 as x2 ) select n.group_id, json_extract(s.metric_summary, '$.means') as metric_vector_mean, json_extract(s.metric_summary, '$.covariance') as metric_covariance, n.object_id, n.x1, n.x2, anomaly_level(s.metric_summary, array[x1, x2]) as anomaly_level from data_summary as s join new_data as n on s.group_id = n.group_id order by n.group_id, n.object_id limit 100000
返回结果
anomaly_level
为输入样本向量x_vector
的异常概率。group_id
object_id
anomaly_level
G1
1007
13.0
G1
1006
5.0
G1
1005
4.0
...
...
...