在商业运作中,聚类分析可以帮助我们发现相似的对象如用户、商品、市场等,从而进行针对性的营销策略,提高效率和收益。本文介绍聚合分类函数的基本语法和示例。
背景信息
日志示例和索引示例(clustering_centroids函数和to_cluster_label函数)
索引如图所示。更多信息,请参见创建索引。
日志样例如下:
{PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"} {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"} {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"} {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"} {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"} {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"} {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"} {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"} {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"} {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"} {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"} {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"} {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"} {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"} {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"} {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"} {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"} {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"} {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"} {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"} {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
日志示例和索引示例(clustering函数)
索引如图所示。更多信息,请参见创建索引。
日志样例如下:
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None 2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia 7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia 8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea 18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea 19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia 20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea 32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia 33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None 34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea 51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None 68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia 69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea 95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia 96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None
聚类分析函数列表
聚类分析是一种数据挖掘技术,它可以将数据集中的对象自动地分为不同的组,使得同一组内的对象具有较高的相似性,而不同组之间的对象具有较低的相似性。
函数名称 | 语法 | 说明 | 返回值类型 |
clustering_centroids(array(array(double)) samples, integer num_of_clusters) | 通过聚类分析对输入的对象属性样本数据进行分类,用户可以指定希望聚成的类组数量。在完成聚类处理后,函数将输出每个类组数据的平均值(类心)。 | array(array(double)) | |
to_cluster_label(array(array(double)) centroids, array(double) attribute_vector) | 通过对象的属性向量及clustering_centroids函数计算的类中心,将对象分配至与其最接近的类组,并返回该类组的数字标识。 | long | |
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters) | 聚合函数,通过对提供的样本进行聚类分析,并返回一个二维字符串数组,其中包含各对象及其对应的分类编号。 | array(array(varchar)) |
clustering_centroids函数
聚类是一种无监督学习方法,将数据集分成若干个互不重叠的类,使得同一个类内的数据点相似度较高,而不同类的数据点相似度较低。clustering_centroids函数从输入的样本二维数组和聚类个数中,返回类组数据的平均值(类心)。
array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)
参数 | 说明 |
| 一个二维数组,可以看成一个二维表格,每一行数据构成一个一维数据,表示一个对象的属性向量。 |
| 期望聚和成类的个数。 参数返回各个聚类的类心。 |
使用示例
查询和分析语句
* | select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group
返回结果
返回由各个聚类的类心组成的二维数组,每个聚类的类心是描述一个属性向量的一维数组。
entity_group
cluster_centroids
A
[[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]
to_cluster_label函数
根据clustering_centroids函数获取类心后,使用to_cluster_label函数将对象归类到最接近的类中,to_cluster_label函数的返回值是对象的类组的数字编号。
long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)
参数 | 说明 |
| 函数clustering_centroids返回的类心(二维数组)中。 |
| 被归类的数据集的属性向量(每个数据点的一组特征或属性)。 |
使用示例
查询和分析语句
* | with centroids as ( select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group ) select t1.entity_id, t1.entity_group, to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label from log as t1 join centroids as t2 on t1.entity_group = t2.entity_group order by entity_id limit 100000
返回结果
cluster_label
字段为类组的数字编号。entity_id
entity_group
cluster_label
0
A
3
1
A
3
2
A
3
...
...
...
clustering函数
聚合函数,通过自动对提供的样本进行聚类分析,并返回一个二维字符串数组,其中包含各对象及其对应的分类编号。
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)
参数 | 说明 |
| 将一个对象的不同属性转换为字符串后,拼装成一个一维数组。属性字段的原始类型可以是数字,也可以是离散文本。 |
| 将属性字段名组合成一维数组,对象属性需统一转换为数字类型并进行归一化处理,以便在各阶段中使用。 |
| 将属性字段类型整合为一维数组,其中对象属性统一转换为数字类型并进行归一化处理,这些步骤为后续阶段的分析提供必要信息。 属性字段类型:
重要
|
| 可选参数。聚合的类数,如果不指定则会自动选择。 |
使用示例
查询和分析语句
用clustering聚合函数进行分析,然后再用unnest函数展开结果。更多信息,请参见UNNEST子句。
* | with clustering_agg as ( select group_id, clustering( array[cast(person_id as varchar), cast(gender as varchar), cast(age as varchar), cast(occupation as varchar), cast(sleep_duration as varchar), cast(quality_of_sleep as varchar), cast(physical_activity_level as varchar), cast(stress_level as varchar), cast(bmi_category as varchar), cast(blood_pressure_systolic as varchar), cast(blood_pressure_diastolic as varchar), cast(heart_rate as varchar), cast(daily_steps as varchar)], -- 对象属性内容数组,也包括对象ID字段在内,方便后面unnest函数展开以后再提取这个字段出来 array['person_id', 'gender', 'age', 'occupation', 'sleep_duration', 'quality_of_sleep', 'physical_activity_level', 'stress_level', 'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'heart_rate', 'daily_steps'], -- 对象属性字段名称 array['ID_NUM', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC'], -- 对象属性字段的类型 5 -- 要聚类成为多少个分组 ) as clustering_outcome from ( select 'G1' as group_id, -- 加一个group_id,为了展示这是一个聚合函数 * from log )as sleep_health_group_data group by group_id ) select ca.group_id, t. person_cluster[1] as person_id, t. person_cluster[2] as cluster_id from clustering_agg as ca cross join unnest(clustering_outcome) as t(person_cluster)
返回结果
group_id
person_id
cluster_id
G1
266
1
G1
268
1
...
...
...