全部產品
Search
文件中心

Simple Log Service:群集函數

更新時間:Sep 21, 2024

在商業運作中,群集可以協助我們發現相似的對象如使用者、商品、市場等,從而進行針對性的營銷策略,提高效率和收益。本文介紹彙總分類函數的基本文法和樣本。

背景資訊

  • 日誌樣本和索引樣本(clustering_centroids函數和to_cluster_label函數)

    • 索引如圖所示。更多資訊,請參見建立索引

      image

    • 日誌範例如下:

      {PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"}
      {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"}
      {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"}
      {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"}
      {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"}
      {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"}
      {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"}
      {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"}
      {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"}
      {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"}
      {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"}
      {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"}
      {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"}
      {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"}
      {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"}
      {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"}
      {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"}
      {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"}
      {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"}
      {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"}
      {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
  • 日誌樣本和索引樣本(clustering函數)

    • 索引如圖所示。更多資訊,請參見建立索引

      image

    • 日誌範例如下:

      1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None
      2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia
      7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia
      8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea
      18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea
      19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia
      20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea
      32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia
      33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None
      34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea
      51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None
      68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia
      69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea
      95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia
      96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None

群集函數列表

群集是一種資料採礦技術,它可以將資料集中的對象自動地分為不同的組,使得同一組內的對象具有較高的相似性,而不同組之間的對象具有較低的相似性。

函數名稱

文法

說明

傳回值類型

clustering_centroids函數

clustering_centroids(array(array(double)) samples, integer num_of_clusters)

通過群集對輸入的對象屬性樣本資料進行分類,使用者可以指定希望聚成的類組數量。在完成聚類處理後,函數將輸出每個類組資料的平均值(類心)。

array(array(double))

to_cluster_label函數

to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

通過對象的屬性向量及clustering_centroidsFunction Compute的類中心,將對象分配至與其最接近的類組,並返回該類組的數位識別碼。

long

clustering函數

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

彙總函式,通過對提供的樣本進行群集,並返回一個二維字串數組,其中包含各對象及其對應的分類編號。

array(array(varchar))

clustering_centroids函數

聚類是一種無監督學習方法,將資料集分成若干個互不重疊的類,使得同一個類內的資料點相似性較高,而不同類的資料點相似性較低。clustering_centroids函數從輸入的樣本二維數組和聚類個數中,返回類組資料的平均值(類心)。

array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)

參數

說明

samples

一個二維數組,可以看成一個二維表格,每一行資料構成一個一維資料,表示一個對象的屬性向量。

num_of_clusters

期望聚和成類的個數。

參數返回各個聚類的類心。

使用樣本

  • 查詢和分析語句

    * | select entity_group,
        clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
    from log
    group by entity_group
  • 返回結果

    返回由各個聚類的類心組成的二維數組,每個聚類的類心是描述一個屬性向量的一維數組。

    entity_group

    cluster_centroids

    A

    [[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]

to_cluster_label函數

根據clustering_centroids函數擷取類心後,使用to_cluster_label函數將對象歸類到最接近的類中,to_cluster_label函數的傳回值是對象的類組的數字編號。

long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

參數

說明

centroids

函數clustering_centroids返回的類心(二維數組)中。

attribute_vector

被歸類的資料集的屬性向量(每個資料點的一組特徵或屬性)。

使用樣本

  • 查詢和分析語句

    * | with centroids as
    (
        select entity_group,
            clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
        from log
        group by entity_group
    )
    select t1.entity_id,
        t1.entity_group,
        to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label
    from log as t1
        join centroids as t2
        on t1.entity_group = t2.entity_group
    order by entity_id
    limit 100000
  • 返回結果

    cluster_label欄位為類組的數字編號。

    entity_id

    entity_group

    cluster_label

    0

    A

    3

    1

    A

    3

    2

    A

    3

    ...

    ...

    ...

clustering函數

彙總函式,通過自動對提供的樣本進行群集,並返回一個二維字串數組,其中包含各對象及其對應的分類編號。

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

參數

說明

attribute_vector

將一個對象的不同屬性轉換為字串後,拼裝成一個一維數組。屬性欄位的原始類型可以是數字,也可以是離散文本。

attribute_name_vector

將屬性欄位名組合成一維數組,對象屬性需統一轉換為數字類型並進行歸一化處理,以便在各階段中使用。

attribute_data_type_vector

將屬性欄位類型整合為一維數組,其中對象屬性統一轉換為數字類型並進行歸一化處理,這些步驟為後續階段的分析提供必要資訊。

屬性欄位類型:

  • ID_STR:對應欄位是聚類對象的字串類型的ID。

  • ID_NUM:對應欄位是聚類對象的數字類型的ID。

  • X_STR_CATEGORICAL:對應欄位是字串類型的離散型資料,例如某一欄位的值僅限於'Male'、'Female'或'Unknown'這幾個固定選項。

  • X_NUM_CATEGORICAL:對應欄位是離散型的資料,但是資料類型是數字離散型,例如某一欄位的值僅限於1, 2, 3, 4, 5這幾個固定選項。

  • X_NUMERIC:對應欄位是連續的數字類型資料。

重要

attribute_vectorattribute_name_vectorattribute_data_type_vector三個數組的長度必須保持一致。例如有10個索引欄位用於群集,10個索引欄位的值構成attribute_vector,10個索引欄位的名字構成attribute_name_vector,10個索引欄位對應的類型描述構成attribute_data_type_vector

num_of_clusters

選擇性參數。彙總的類數,如果不指定則會自動選擇。

使用樣本

  • 查詢和分析語句

    用clustering彙總函式進行分析,然後再用unnest函數展開結果。更多資訊,請參見UNNEST子句

    * | with clustering_agg as
    (
      select group_id,
      clustering(
        array[cast(person_id as varchar),
              cast(gender as varchar),
              cast(age as varchar),
              cast(occupation as varchar),
              cast(sleep_duration as varchar),
              cast(quality_of_sleep as varchar),
              cast(physical_activity_level as varchar),
              cast(stress_level as varchar),
              cast(bmi_category as varchar),
              cast(blood_pressure_systolic as varchar),
              cast(blood_pressure_diastolic as varchar),
              cast(heart_rate as varchar),
              cast(daily_steps as varchar)], -- 對象屬性內容數組,也包括對象ID欄位在內,方便後面unnest函數展開以後再提取這個欄位出來
        array['person_id', 'gender', 'age', 'occupation', 'sleep_duration',
              'quality_of_sleep', 'physical_activity_level', 'stress_level',
              'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic',
              'heart_rate', 'daily_steps'], -- 對象屬性欄位名稱
        array['ID_NUM',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC'],  -- 對象屬性欄位的類型
        5  -- 要聚類成為多少個分組
      ) as clustering_outcome
      from (
        select 'G1' as group_id, -- 加一個group_id,為了展示這是一個彙總函式
            * from  log
      )as sleep_health_group_data
      group by group_id
    )
    
    select ca.group_id,
    	t. person_cluster[1] as person_id,
    	t. person_cluster[2] as cluster_id
    from clustering_agg as ca
    	cross join unnest(clustering_outcome) as t(person_cluster)
  • 返回結果

    group_id

    person_id

    cluster_id

    G1

    266

    1

    G1

    268

    1

    ...

    ...

    ...