在商業運作中,群集可以協助我們發現相似的對象如使用者、商品、市場等,從而進行針對性的營銷策略,提高效率和收益。本文介紹彙總分類函數的基本文法和樣本。
背景資訊
日誌樣本和索引樣本(clustering_centroids函數和to_cluster_label函數)
索引如圖所示。更多資訊,請參見建立索引。
日誌範例如下:
{PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"} {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"} {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"} {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"} {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"} {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"} {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"} {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"} {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"} {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"} {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"} {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"} {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"} {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"} {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"} {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"} {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"} {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"} {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"} {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"} {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
日誌樣本和索引樣本(clustering函數)
索引如圖所示。更多資訊,請參見建立索引。
日誌範例如下:
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None 2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia 7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia 8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea 18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea 19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia 20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea 32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia 33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None 34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea 51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None 68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia 69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea 95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia 96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None
群集函數列表
群集是一種資料採礦技術,它可以將資料集中的對象自動地分為不同的組,使得同一組內的對象具有較高的相似性,而不同組之間的對象具有較低的相似性。
函數名稱 | 文法 | 說明 | 傳回值類型 |
clustering_centroids(array(array(double)) samples, integer num_of_clusters) | 通過群集對輸入的對象屬性樣本資料進行分類,使用者可以指定希望聚成的類組數量。在完成聚類處理後,函數將輸出每個類組資料的平均值(類心)。 | array(array(double)) | |
to_cluster_label(array(array(double)) centroids, array(double) attribute_vector) | 通過對象的屬性向量及clustering_centroidsFunction Compute的類中心,將對象分配至與其最接近的類組,並返回該類組的數位識別碼。 | long | |
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters) | 彙總函式,通過對提供的樣本進行群集,並返回一個二維字串數組,其中包含各對象及其對應的分類編號。 | array(array(varchar)) |
clustering_centroids函數
聚類是一種無監督學習方法,將資料集分成若干個互不重疊的類,使得同一個類內的資料點相似性較高,而不同類的資料點相似性較低。clustering_centroids函數從輸入的樣本二維數組和聚類個數中,返回類組資料的平均值(類心)。
array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)
參數 | 說明 |
| 一個二維數組,可以看成一個二維表格,每一行資料構成一個一維資料,表示一個對象的屬性向量。 |
| 期望聚和成類的個數。 參數返回各個聚類的類心。 |
使用樣本
查詢和分析語句
* | select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group
返回結果
返回由各個聚類的類心組成的二維數組,每個聚類的類心是描述一個屬性向量的一維數組。
entity_group
cluster_centroids
A
[[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]
to_cluster_label函數
根據clustering_centroids函數擷取類心後,使用to_cluster_label函數將對象歸類到最接近的類中,to_cluster_label函數的傳回值是對象的類組的數字編號。
long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)
參數 | 說明 |
| 函數clustering_centroids返回的類心(二維數組)中。 |
| 被歸類的資料集的屬性向量(每個資料點的一組特徵或屬性)。 |
使用樣本
查詢和分析語句
* | with centroids as ( select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group ) select t1.entity_id, t1.entity_group, to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label from log as t1 join centroids as t2 on t1.entity_group = t2.entity_group order by entity_id limit 100000
返回結果
cluster_label
欄位為類組的數字編號。entity_id
entity_group
cluster_label
0
A
3
1
A
3
2
A
3
...
...
...
clustering函數
彙總函式,通過自動對提供的樣本進行群集,並返回一個二維字串數組,其中包含各對象及其對應的分類編號。
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)
參數 | 說明 |
| 將一個對象的不同屬性轉換為字串後,拼裝成一個一維數組。屬性欄位的原始類型可以是數字,也可以是離散文本。 |
| 將屬性欄位名組合成一維數組,對象屬性需統一轉換為數字類型並進行歸一化處理,以便在各階段中使用。 |
| 將屬性欄位類型整合為一維數組,其中對象屬性統一轉換為數字類型並進行歸一化處理,這些步驟為後續階段的分析提供必要資訊。 屬性欄位類型:
重要
|
| 選擇性參數。彙總的類數,如果不指定則會自動選擇。 |
使用樣本
查詢和分析語句
用clustering彙總函式進行分析,然後再用unnest函數展開結果。更多資訊,請參見UNNEST子句。
* | with clustering_agg as ( select group_id, clustering( array[cast(person_id as varchar), cast(gender as varchar), cast(age as varchar), cast(occupation as varchar), cast(sleep_duration as varchar), cast(quality_of_sleep as varchar), cast(physical_activity_level as varchar), cast(stress_level as varchar), cast(bmi_category as varchar), cast(blood_pressure_systolic as varchar), cast(blood_pressure_diastolic as varchar), cast(heart_rate as varchar), cast(daily_steps as varchar)], -- 對象屬性內容數組,也包括對象ID欄位在內,方便後面unnest函數展開以後再提取這個欄位出來 array['person_id', 'gender', 'age', 'occupation', 'sleep_duration', 'quality_of_sleep', 'physical_activity_level', 'stress_level', 'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'heart_rate', 'daily_steps'], -- 對象屬性欄位名稱 array['ID_NUM', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC'], -- 對象屬性欄位的類型 5 -- 要聚類成為多少個分組 ) as clustering_outcome from ( select 'G1' as group_id, -- 加一個group_id,為了展示這是一個彙總函式 * from log )as sleep_health_group_data group by group_id ) select ca.group_id, t. person_cluster[1] as person_id, t. person_cluster[2] as cluster_id from clustering_agg as ca cross join unnest(clustering_outcome) as t(person_cluster)
返回結果
group_id
person_id
cluster_id
G1
266
1
G1
268
1
...
...
...