All Products
Search
Document Center

Simple Log Service:Cluster analysis functions

Last Updated:Oct 18, 2024

In business operations, cluster analysis helps identify similar objects such as users, commodities, and markets. You can formulate marketing strategies based on cluster analysis results to improve efficiency and profitability. This topic describes the syntax of cluster analysis functions. This topic also provides examples on how to use the functions.

Background information

  • Sample log and sample indexes for the clustering_centroids and to_cluster_label functions

    • The following figure shows the sample indexes. For more information, see Create indexes.

      image

    • The following code shows the sample log:

      {PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"}
      {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"}
      {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"}
      {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"}
      {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"}
      {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"}
      {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"}
      {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"}
      {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"}
      {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"}
      {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"}
      {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"}
      {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"}
      {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"}
      {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"}
      {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"}
      {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"}
      {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"}
      {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"}
      {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"}
      {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
  • Sample log and sample indexes for the clustering function

    • The following figure shows the sample indexes. For more information, see Create indexes.

      image

    • The following code shows the sample log:

      1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None
      2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia
      7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia
      8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea
      18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea
      19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia
      20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea
      32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia
      33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None
      34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea
      51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None
      68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia
      69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea
      95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia
      96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None

Functions

Cluster analysis is a data mining technique that automatically classifies objects in a dataset into different classes. This way, objects of the same class have a high degree of similarity, and objects of different classes have a low degree of similarity.

Function

Syntax

Description

Data type of the return value

clustering_centroids function

clustering_centroids(array(array(double)) samples, integer num_of_clusters)

Classifies the input sample data that describes object attributes. You can specify the number of classes that you want to obtain. After the sample data is classified, the clustering_centroids function returns the average value of each class. The average value of a class is called the centroid of the class.

array(array(double))

to_cluster_label function

to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

Assigns an object to the nearest class based on the attribute vector of the object and the centroids of existing classes and returns the ID of the class to which the object is assigned. The centroids are returned by the clustering_centroids function.

long

clustering function

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

Clusters and analyzes the input sample data and returns a two-dimensional string array that consists of objects and the class IDs of the objects.

array(array(varchar))

clustering_centroids function

Clustering is an unsupervised learning method that you can use to classify objects in a dataset into non-overlapping classes. This way, objects of the same class have a high degree of similarity, and objects of different classes have a low degree of similarity. The clustering_centroids function returns the centroids of classes based on the input sample data and the number of classes that you specify. The sample data is a two-dimensional array.

array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)

Parameter

Description

samples

A two-dimensional array. The array can be used as a two-dimensional table. Each row of data in the table is a one-dimensional array that specifies the attribute vector of an object.

num_of_clusters

The number of classes that you want to obtain.

Example

  • Query statement

    * | select entity_group,
        clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
    from log
    group by entity_group
  • Query and analysis results

    The function returns a two-dimensional array that consists of the centroids of classes. Each centroid of a class is a one-dimensional array that specifies the attribute vector of an object.

    entity_group

    cluster_centroids

    A

    [[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]

to_cluster_label function

The to_cluster_label function assigns an object to the nearest class based on the centroids of existing classes and returns the ID of the class to which the object is assigned. The centroids are returned by the clustering_centroids function.

long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

Parameter

Description

centroids

The centroids of classes returned by the clustering_centroids function. The centroids constitute a two-dimensional array.

attribute_vector

The attribute vector of the object that you manage. Each attribute vector specifies a set of characteristics or attributes of an object.

Example

  • Query statement

    * | with centroids as
    (
        select entity_group,
            clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
        from log
        group by entity_group
    )
    select t1.entity_id,
        t1.entity_group,
        to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label
    from log as t1
        join centroids as t2
        on t1.entity_group = t2.entity_group
    order by entity_id
    limit 100000
  • Query and analysis results

    The cluster_label field in the query and analysis results indicates the ID of a class.

    entity_id

    entity_group

    cluster_label

    0

    A

    3

    1

    A

    3

    2

    A

    3

    ...

    ...

    ...

clustering function

The clustering function clusters and analyzes the input sample data and returns a two-dimensional string array that consists of objects and the class IDs of the objects.

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

Parameter

Description

attribute_vector

A one-dimensional array that consists of multiple attribute fields of an object. Make sure that the field types are converted to string. The original field types can be numeric or discrete text.

attribute_name_vector

A one-dimensional array that consists of the names of the attribute fields. Make sure that the field types are converted to numeric and normalized. This facilitates subsequent data analysis.

attribute_data_type_vector

A one-dimensional array that consists of the data types of the attribute fields. Make sure that the field types are converted to numeric and normalized. This facilitates subsequent data analysis.

Supported data types of the attribute fields:

  • ID_STR: the ID of the string type for the object.

  • ID_NUM: the ID of the numeric type for the object.

  • X_STR_CATEGORICAL: the discrete data of the string type. For example, the valid values of a field are Male, Female, and Unknown.

  • X_NUM_CATEGORICAL: the discrete data of the numeric type. For example, the valid values of a field are 1, 2, 3, 4, and 5.

  • X_NUMERIC: the continuous data of the numeric type.

Important

The lengths of arrays specified by the attribute_vector, attribute_name_vector, and attribute_data_type_vector parameters must be the same. For example, if 10 indexed fields are used for cluster analysis, the attribute_vector parameter value consists of the values of the 10 indexed fields, the attribute_name_vector parameter value consists of the names of the 10 indexed fields, and the attribute_data_type_vector parameter value consists of the data types of the 10 indexed fields.

num_of_clusters

Optional. The number of classes that you want to obtain. If you do not specify a value for this parameter, the value is automatically specified.

Example

  • Query statement

    In the query statement, the clustering function clusters and analyzes the input sample data and the UNNEST clause expands the results of the function. For more information, see UNNEST clause.

    * | with clustering_agg as
    (
      select group_id,
      clustering(
        array[cast(person_id as varchar),
              cast(gender as varchar),
              cast(age as varchar),
              cast(occupation as varchar),
              cast(sleep_duration as varchar),
              cast(quality_of_sleep as varchar),
              cast(physical_activity_level as varchar),
              cast(stress_level as varchar),
              cast(bmi_category as varchar),
              cast(blood_pressure_systolic as varchar),
              cast(blood_pressure_diastolic as varchar),
              cast(heart_rate as varchar),
              cast(daily_steps as varchar)], -- The attribute fields of the object. The value is an array that contains the object ID field. You can extract the object ID field after the UNNEST clause expands the clustering results.
        array['person_id', 'gender', 'age', 'occupation', 'sleep_duration',
              'quality_of_sleep', 'physical_activity_level', 'stress_level',
              'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic',
              'heart_rate', 'daily_steps'], -- The names of the attribute fields.
        array['ID_NUM',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC'],  -- The data types of the attribute fields.
        5  -- The number of classes that you want to obtain.
      ) as clustering_outcome
      from (
        select 'G1' as group_id, -- The group_id field indicates that an aggregate function is used.
            * from  log
      )as sleep_health_group_data
      group by group_id
    )
    
    select ca.group_id,
    	t. person_cluster[1] as person_id,
    	t. person_cluster[2] as cluster_id
    from clustering_agg as ca
    	cross join unnest(clustering_outcome) as t(person_cluster)
  • Query and analysis results

    group_id

    person_id

    cluster_id

    G1

    266

    1

    G1

    268

    1

    ...

    ...

    ...