In business operations, cluster analysis helps identify similar objects such as users, commodities, and markets. You can formulate marketing strategies based on cluster analysis results to improve efficiency and profitability. This topic describes the syntax of cluster analysis functions. This topic also provides examples on how to use the functions.
Background information
Sample log and sample indexes for the clustering_centroids and to_cluster_label functions
The following figure shows the sample indexes. For more information, see Create indexes.
The following code shows the sample log:
{PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"} {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"} {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"} {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"} {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"} {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"} {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"} {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"} {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"} {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"} {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"} {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"} {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"} {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"} {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"} {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"} {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"} {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"} {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"} {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"} {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
Sample log and sample indexes for the clustering function
The following figure shows the sample indexes. For more information, see Create indexes.
The following code shows the sample log:
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None 2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None 4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea 6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia 7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia 8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None 14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None 17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea 18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea 19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia 20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None 31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea 32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia 33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None 34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None 49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea 51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None 53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None 55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None 61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None 63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None 67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None 68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia 69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None 71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None 75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None 81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea 83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None 85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None 89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None 93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None 94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea 95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia 96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None 98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None 99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None 101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None
Functions
Cluster analysis is a data mining technique that automatically classifies objects in a dataset into different classes. This way, objects of the same class have a high degree of similarity, and objects of different classes have a low degree of similarity.
Function | Syntax | Description | Data type of the return value |
clustering_centroids(array(array(double)) samples, integer num_of_clusters) | Classifies the input sample data that describes object attributes. You can specify the number of classes that you want to obtain. After the sample data is classified, the clustering_centroids function returns the average value of each class. The average value of a class is called the centroid of the class. | array(array(double)) | |
to_cluster_label(array(array(double)) centroids, array(double) attribute_vector) | Assigns an object to the nearest class based on the attribute vector of the object and the centroids of existing classes and returns the ID of the class to which the object is assigned. The centroids are returned by the clustering_centroids function. | long | |
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters) | Clusters and analyzes the input sample data and returns a two-dimensional string array that consists of objects and the class IDs of the objects. | array(array(varchar)) |
clustering_centroids function
Clustering is an unsupervised learning method that you can use to classify objects in a dataset into non-overlapping classes. This way, objects of the same class have a high degree of similarity, and objects of different classes have a low degree of similarity. The clustering_centroids function returns the centroids of classes based on the input sample data and the number of classes that you specify. The sample data is a two-dimensional array.
array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)
Parameter | Description |
| A two-dimensional array. The array can be used as a two-dimensional table. Each row of data in the table is a one-dimensional array that specifies the attribute vector of an object. |
| The number of classes that you want to obtain. |
Example
Query statement
* | select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group
Query and analysis results
The function returns a two-dimensional array that consists of the centroids of classes. Each centroid of a class is a one-dimensional array that specifies the attribute vector of an object.
entity_group
cluster_centroids
A
[[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]
to_cluster_label function
The to_cluster_label function assigns an object to the nearest class based on the centroids of existing classes and returns the ID of the class to which the object is assigned. The centroids are returned by the clustering_centroids function.
long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)
Parameter | Description |
| The centroids of classes returned by the clustering_centroids function. The centroids constitute a two-dimensional array. |
| The attribute vector of the object that you manage. Each attribute vector specifies a set of characteristics or attributes of an object. |
Example
Query statement
* | with centroids as ( select entity_group, clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids from log group by entity_group ) select t1.entity_id, t1.entity_group, to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label from log as t1 join centroids as t2 on t1.entity_group = t2.entity_group order by entity_id limit 100000
Query and analysis results
The
cluster_label
field in the query and analysis results indicates the ID of a class.entity_id
entity_group
cluster_label
0
A
3
1
A
3
2
A
3
...
...
...
clustering function
The clustering function clusters and analyzes the input sample data and returns a two-dimensional string array that consists of objects and the class IDs of the objects.
array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)
Parameter | Description |
| A one-dimensional array that consists of multiple attribute fields of an object. Make sure that the field types are converted to string. The original field types can be numeric or discrete text. |
| A one-dimensional array that consists of the names of the attribute fields. Make sure that the field types are converted to numeric and normalized. This facilitates subsequent data analysis. |
| A one-dimensional array that consists of the data types of the attribute fields. Make sure that the field types are converted to numeric and normalized. This facilitates subsequent data analysis. Supported data types of the attribute fields:
Important The lengths of arrays specified by the |
| Optional. The number of classes that you want to obtain. If you do not specify a value for this parameter, the value is automatically specified. |
Example
Query statement
In the query statement, the clustering function clusters and analyzes the input sample data and the UNNEST clause expands the results of the function. For more information, see UNNEST clause.
* | with clustering_agg as ( select group_id, clustering( array[cast(person_id as varchar), cast(gender as varchar), cast(age as varchar), cast(occupation as varchar), cast(sleep_duration as varchar), cast(quality_of_sleep as varchar), cast(physical_activity_level as varchar), cast(stress_level as varchar), cast(bmi_category as varchar), cast(blood_pressure_systolic as varchar), cast(blood_pressure_diastolic as varchar), cast(heart_rate as varchar), cast(daily_steps as varchar)], -- The attribute fields of the object. The value is an array that contains the object ID field. You can extract the object ID field after the UNNEST clause expands the clustering results. array['person_id', 'gender', 'age', 'occupation', 'sleep_duration', 'quality_of_sleep', 'physical_activity_level', 'stress_level', 'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'heart_rate', 'daily_steps'], -- The names of the attribute fields. array['ID_NUM', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_STR_CATEGORICAL', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC', 'X_NUMERIC'], -- The data types of the attribute fields. 5 -- The number of classes that you want to obtain. ) as clustering_outcome from ( select 'G1' as group_id, -- The group_id field indicates that an aggregate function is used. * from log )as sleep_health_group_data group by group_id ) select ca.group_id, t. person_cluster[1] as person_id, t. person_cluster[2] as cluster_id from clustering_agg as ca cross join unnest(clustering_outcome) as t(person_cluster)
Query and analysis results
group_id
person_id
cluster_id
G1
266
1
G1
268
1
...
...
...