全部產品
Search
文件中心

Simple Log Service:差異模式挖掘(表格式資料下鑽分析)函數

更新時間:Sep 24, 2025

差異模式挖掘函數基於給定的多屬性欄位樣本,在給定的判別條件下,分析出影響該條件劃分的差異化模式集合,協助您快速診斷導致當前判別條件差異的原因。

diff_patterns

用於比較兩個具有相同結構的資料集,並找出表格這兩個資料集之間差異的離散屬性(維度)模式。專指對於表格式資料的模板(頻繁集)比較。

文法

diff_patterns($TABLE, $HEADER, $SPLIT_COLUMN, $TEST_VAL, $CONTROL_VAL, $METRIC_COLUMN, $WEIGHT_COLUMN, $PARAM)

參數說明

重要

在使用差異模式統計函數時,必須確保 $SPLIT_COLUMN 中包含 $TEST_VAL 和 $CONTROL_VAL 的行,即測試組和對照組均有資料存在。缺少任意一組的資料將導致無法進行比較。

參數

資料類型

是否必選

說明

$TABLE

row<array<T>, array<E>, ..., array<F>>

diff_patterns 待下鑽的輸入資料的表格,每一列是一個待下鑽的維度列。

$HEADER

array<varchar>

列名,和$TABLE對應,header名字數和table列的數量一致。

$SPLIT_COLUMN

varchar

標註列的列名,用於區分不同的分組的標籤列名稱,標籤類目前只支援string類型。

$TEST_VAL

varchar

測試組資料對應的 $SPLIT_COLUMN 中的值。

需要保證 $TEST_VAL 在 $SPLIT_COLUMN 出現過,不然無法區分兩組。通常來說,$TEST_VAL對應於異常值。

$CONTROL_VAL

varchar

對照組資料對應的 $SPLIT_COLUMN 中的值。

需要保證 $TEST_VAL 在 $SPLIT_COLUMN 出現過,不然無法區分兩組。$CONTROL_VAL對應於正常值。

$METRIC_COLUMN

double/int

指標列,預設為 '' (沒有指標列)。

如果我們不僅關注資料的分組,還關注不同維度組合對於某個指標的影響,比如說'qps'這一列記錄了我們關注的指標,那麼我們通過增加指標列,來進一步讓演算法能計算出不同維度組合在指標列上面的差異。

$WEIGHT_COLUMN

varchar

權重列,預設為 '' (沒有權重列)。

認為各行的權重相同。如果設定了的話,我們的全部計算過程都會和權重有關。比如說某一行的權重為2而其他行的權重為1,那麼這一行相當於出現了兩次。

$PARAM

varchar

參見param參數說明

param參數說明

參數名

參數解釋

參數類型

是否必填

預設值

取值範圍

minimum_support_fraction

輸出的pattern在測試組的最低的支援度。比如某個pattern在測試組出現的頻率是0.1,那麼我們說這個pattern的支援度是0.1。這個參數可以通過控制pattern的敏感度控制pattern的數量。

double

0.05

(0, 1)

allow_sample

是否允許採樣計算。如果不允許採樣計算的話

bool

true

[true, false]

agg_op

一個pattern會篩選出很多行資料,我們會計算這些行在METRIC_COLUMN的彙總值。這個參數是指定我們考慮的彙總operation。如果METRIC_COLUMN沒有取值的話,預設測試組取1,對照組取0。

string

"avg"

["sum", "avg", "max", "min", "count", "p95"]

shapley_value_min_fraction

某個條件的最低的shapley value佔比。這個參數可以控制一個pattern裡麵條件的數量

double

0.1

(0, 1)

樣本

  • 查詢分析:

    關於"set session enable_remote_functions=true ; ", 目前功能處於公測階段,需手動添加該 flag。後續版本將移除此要求,實現自動化支援。
    * | set session enable_remote_functions=true ;
    with t0 as (select  JSON_EXTRACT_SCALAR(entity, '$.platform') AS platform,  JSON_EXTRACT_SCALAR(entity, '$.region') AS region, cast(value as double) as value, if((value > 100), 'true', 'false') as anomaly_label from log), 
    t1 as ( select array_agg(platform) as platform, array_agg(region) as region, array_agg(anomaly_label) as anomaly_label, array_agg(value) as value from t0),
    t2 as (select row(platform, region, anomaly_label, value) as table_row from t1),
    t3 as (select diff_patterns(table_row, ARRAY['platform', 'region', 'anomaly_label', 'value'],'anomaly_label', 'true', 'false', 'value') as ret from t2)
    select * from t3
  • 輸出結果:

    [["\"platform\"='Amazon Prime'","\"platform\"='Console'","\"platform\"='VR'","\"platform\"='Mobile'","\"platform\"='PC'","\"platform\"='Hulu'","\"platform\"='Netflix' AND \"region\"='Asia'","\"platform\"='YouTube'","\"platform\"='Disney+' AND \"region\"='South America'"],[18720,56921,44516,47520,44640,8640,4320,15840,4320],[0,4999,7324,5760,4320,0,0,1440,0],[0.05248902272841978,0.15960083668399478,0.12481844742405635,0.13324136538752713,0.1251661311216164,0.024225702797732206,0.012112851398866103,0.04441378846250904,0.012112851398866103],[0.0,0.0032988991351195829,0.004833194091941553,0.003801092022062172,0.0028508190165466289,0.0,0.0,0.000950273005515543,0.0],[0.05248902272841978,0.1563019375488752,0.1199852533321148,0.12944027336546497,0.12231531210506977,0.024225702797732206,0.012112851398866103,0.0434635154569935,0.012112851398866103],[608.6946996258738,552.6152340898742,501.3011076154948,525.0147121416072,580.8151715760738,409.282718625793,631.6115865793741,603.7699053925991,526.960199868643],[0.0,58.46518308219622,55.396758003518488,49.12926473747681,44.68946999270535,0.0,0.0,20.76262974302821,0.0],null]

返回結果說明

參數

類型

說明

對應樣本

$RET.patterns

array<varchar>

表格模板,頻繁集。

["\"platform\"='Amazon Prime'","\"platform\"='Console'","\"platform\"='VR'","\"platform\"='Mobile'","\"platform\"='PC'","\"platform\"='Hulu'","\"platform\"='Netflix' AND \"region\"='Asia'","\"platform\"='YouTube'","\"platform\"='Disney+' AND \"region\"='South America'"]

$RET.test_supports

array<bigint>

測試組中對應模板出現的頻次。

[18720,56921,44516,47520,44640,8640,4320,15840,4320],[0,4999,7324,5760,4320,0,0,1440,0]

$RET.control_supports

array<bigint>

對照組中對應模板出現的頻次。

通過對比,diff_patterns返回結果中靠前的結果在測試組 (異常組)裡面的出現的頻次遠高於在對照組(正常組)中出現的頻次。

0,4999,7324,5760,4320,0,0,1440,0]

$RET.test_ratio

array<double>

測試組(label=1)中對應模板占測試集資料的比例。

[0.05248902272841978,0.15960083668399478,0.12481844742405635,0.13324136538752713,0.1251661311216164,0.024225702797732206,0.012112851398866103,0.04441378846250904,0.012112851398866103]

$RET.control_ratio

array<double>

對照組(label=0)中對應模板占驗證集資料的比例。

[0.0,0.0032988991351195829,0.004833194091941553,0.003801092022062172,0.0028508190165466289,0.0,0.0,0.000950273005515543,0.0]

$RET.difference

array<double>

測試組資料相較對照組資料的差異。

[0.05248902272841978,0.1563019375488752,0.1199852533321148,0.12944027336546497,0.12231531210506977,0.024225702797732206,0.012112851398866103,0.0434635154569935,0.012112851398866103]

$RET.test_average_metric

array<double>

滿足pattern的資料中 METRIC_COLUMN的平均值。如果METRIC_COLUMN為'',這個傳回值不具備參考意義。

[608.6946996258738,552.6152340898742,501.3011076154948,525.0147121416072,580.8151715760738,409.282718625793,631.6115865793741,603.7699053925991,526.960199868643]

$RET.control_average_metric

array<double>

不滿足pattern的資料中 METRIC_COLUMN的平均值。如果METRIC_COLUMN為'',這個傳回值不具備參考意義。

[0.0,58.46518308219622,55.396758003518488,49.12926473747681,44.68946999270535,0.0,0.0,20.76262974302821,0.0]

$RET.error_msg

varchar/null

報錯資訊。如果沒有報錯,則傳回值為null

null