The pattern mining function analyzes samples with multiple attributes based on specific conditions. It finds distinct patterns that affect data partitioning, helping you quickly identify the causes of differences between the conditions.
diff_patterns
This function compares two datasets with the same structure to identify patterns of differing discrete attributes (dimensions). It is designed to compare templates (frequent itemsets) in tabular data.
Syntax
diff_patterns($TABLE, $HEADER, $SPLIT_COLUMN, $TEST_VAL, $CONTROL_VAL, $METRIC_COLUMN, $WEIGHT_COLUMN, $PARAM)Parameters
When you use the pattern difference statistics function, ensure that the $SPLIT_COLUMN contains rows for both $TEST_VAL and $CONTROL_VAL. The function cannot perform a comparison if data for the test group or the control group is missing.
Parameter | Data type | Required | Description |
$TABLE | row<array<T>, array<E>, ..., array<F>> | Yes | The input table for the diff_patterns function. Each column is a dimension column for drill-down analysis. |
$HEADER | array<varchar> | Yes | The column names. These names correspond to the columns in $TABLE. The number of header names must match the number of table columns. |
$SPLIT_COLUMN | varchar | Yes | The name of the target column used to distinguish between groups. The tag category currently supports only the string type. |
$TEST_VAL | varchar | Yes | The value in $SPLIT_COLUMN that corresponds to the test group. Ensure that $TEST_VAL exists in $SPLIT_COLUMN. Otherwise, the two groups cannot be distinguished. Typically, $TEST_VAL corresponds to an abnormal value. |
$CONTROL_VAL | varchar | Yes | The value in $SPLIT_COLUMN that corresponds to the control group. Ensure that $CONTROL_VAL exists in $SPLIT_COLUMN. Otherwise, the two groups cannot be distinguished. $CONTROL_VAL corresponds to a normal value. |
$METRIC_COLUMN | double/int | No | The metric column. The default value is If you want to analyze the impact of dimension combinations on a specific metric in addition to the data groups, specify a metric column. For example, if a 'qps' column records your metric, the algorithm can calculate the differences in this metric for various dimension combinations. |
$WEIGHT_COLUMN | varchar | No | The weight column. The default value is '' , which means no weight column is specified. If this parameter is not set, all rows are assumed to have the same weight. If a weight column is set, all calculations consider the weights. For example, a row with a weight of 2 is treated as if it appeared twice compared to rows with a weight of 1. |
$PARAM | varchar | No | For more information, see param parameter details. |
param parameter details
Parameter | Description | Type | Required | Default | Valid values |
minimum_support_fraction | The minimum support of an output pattern in the test group. For example, if a pattern appears with a frequency of 0.1 in the test group, its support is 0.1. This parameter controls the number of output patterns by adjusting the sensitivity. | double | No | 0.05 | (0, 1) |
allow_sample | Specifies whether to allow sampling for calculations. | bool | No | true | [true, false] |
agg_op | A pattern filters multiple rows of data. The function then calculates the aggregate value of these rows in the METRIC_COLUMN. This parameter specifies the aggregation operation to use. If METRIC_COLUMN is not specified, the default value is 1 for the test group and 0 for the control group. | string | No | "avg" | ["sum", "avg", "max", "min", "count", "p95"] |
shapley_value_min_fraction | The minimum Shapley value fraction for a condition. This parameter controls the number of conditions in a pattern. | double | No | 0.1 | (0, 1) |
Examples
Query analysis:
The `set session enable_remote_functions=true` flag is required because this feature is in public preview. You must add this flag manually. This requirement will be removed in future versions.
* | set session enable_remote_functions=true ; with t0 as (select JSON_EXTRACT_SCALAR(entity, '$.platform') AS platform, JSON_EXTRACT_SCALAR(entity, '$.region') AS region, cast(value as double) as value, if((value > 100), 'true', 'false') as anomaly_label from log), t1 as ( select array_agg(platform) as platform, array_agg(region) as region, array_agg(anomaly_label) as anomaly_label, array_agg(value) as value from t0), t2 as (select row(platform, region, anomaly_label, value) as table_row from t1), t3 as (select diff_patterns(table_row, ARRAY['platform', 'region', 'anomaly_label', 'value'],'anomaly_label', 'true', 'false', 'value') as ret from t2) select * from t3Outputs:
[["\"platform\"='Amazon Prime'","\"platform\"='Console'","\"platform\"='VR'","\"platform\"='Mobile'","\"platform\"='PC'","\"platform\"='Hulu'","\"platform\"='Netflix' AND \"region\"='Asia'","\"platform\"='YouTube'","\"platform\"='Disney+' AND \"region\"='South America'"],[18720,56921,44516,47520,44640,8640,4320,15840,4320],[0,4999,7324,5760,4320,0,0,1440,0],[0.05248902272841978,0.15960083668399478,0.12481844742405635,0.13324136538752713,0.1251661311216164,0.024225702797732206,0.012112851398866103,0.04441378846250904,0.012112851398866103],[0.0,0.0032988991351195829,0.004833194091941553,0.003801092022062172,0.0028508190165466289,0.0,0.0,0.000950273005515543,0.0],[0.05248902272841978,0.1563019375488752,0.1199852533321148,0.12944027336546497,0.12231531210506977,0.024225702797732206,0.012112851398866103,0.0434635154569935,0.012112851398866103],[608.6946996258738,552.6152340898742,501.3011076154948,525.0147121416072,580.8151715760738,409.282718625793,631.6115865793741,603.7699053925991,526.960199868643],[0.0,58.46518308219622,55.396758003518488,49.12926473747681,44.68946999270535,0.0,0.0,20.76262974302821,0.0],null]
Return value details
Parameter | Type | Description | Example |
$RET.patterns | array<varchar> | The table templates, which are frequent itemsets. |
|
$RET.test_supports | array<bigint> | The frequency of the corresponding template in the test group. |
|
$RET.control_supports | array<bigint> | The frequency of the corresponding template in the control group. By comparison, the results that appear earlier in the diff_patterns output have a much higher frequency in the test group (abnormal group) than in the control group (normal group). |
|
$RET.test_ratio | array<double> | The ratio of the corresponding template in the test group (label=1) to the total test dataset. |
|
$RET.control_ratio | array<double> | The ratio of the corresponding template in the control group (label=0) to the total validation set. |
|
$RET.difference | array<double> | The difference between the test group data and the control group data. |
|
$RET.test_average_metric | array<double> | The average value of METRIC_COLUMN for data that matches the pattern. If METRIC_COLUMN is '', this return value is not meaningful. |
|
$RET.control_average_metric | array<double> | The average value of METRIC_COLUMN for data that does not match the pattern. If METRIC_COLUMN is '', this return value is not meaningful. |
|
$RET.error_msg | varchar/null | The error message. If no error occurs, the return value is |
|