Clustering Model Evaluation - Platform For AI - Alibaba Cloud Documentation Center

The Clustering Model Evaluation component is used to evaluate clustering models and generate evaluation metrics based on raw data and clustering results.

Limits

The report of this component is available only in the Machine Learning Studio console.

Background information

The Calinski-Harabasz index is also known as the variance ratio criterion (VRC). The following figure shows the formula used to calculate the VRC. Calculation formula of VRC

Parameter	Description
SS_B	The variance between clusters. The following figure shows the formula used to calculate the variance between clusters. Formula description: k: indicates the number of cluster center points. m_i: indicates the center point of the cluster. m: indicates the mean value of the input data.
SS_W	The variance within a cluster. The following figure shows the formula used to calculate the variance within a cluster. Formula description: k: indicates the number of cluster center points. x: indicates data points. c_i: indicates the ith cluster. m_i: indicates the center point of the cluster.
N	The total number of records.
k	The number of cluster center points.

Configure the component

You can use the following methods to configure the component parameters:

Method 1: Use Machine Learning Designer

Configure the component parameters on the pipeline configuration tab of Machine Learning Designer.

Tab	Parameter	Description
Fields Setting	Evaluation Columns	The columns that are selected from the input table for evaluation. The value of this parameter must be consistent with the feature columns in the model.
	Input Sparse Format	Specifies whether the input data is sparse. Sparse data is presented by using key-value pairs.
	KV Pair Delimiter	The delimiter that is used to separate key-value pairs. By default, commas (,) are used.
	KV Delimiter	The delimiter that is used to separate keys and values. By default, colons (:) are used.
Tuning	Cores	The number of cores. This parameter must be used together with the Memory Size per Core parameter. The value of this parameter must be a positive integer.
Tuning	Memory Size per Core	The memory size of each core. This parameter must be used together with the Cores parameter. Unit: MB.

Method 2: Use PAI commands

Configure the parameters of this component by using a Machine Learning Platform for AI (PAI) command. You can use the SQL Script component to call these commands. For more information, see SQL Script. The following table describes the parameters of the command.

PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	N/A
selectedColNames	No	The names of the columns that are selected from the input table for evaluation. Separate multiple columns with commas (,). The value of this parameter must be consistent with the feature columns in the model.	All columns
inputTablePartitions	No	The partitions selected from the input table for training. Specify this parameter in one of the following formats: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate these partitions with commas (,).	Full table
enableSparse	No	Specifies whether the input data is sparse. Valid values: true and false.	false
itemDelimiter	No	The delimiter that is used to separate sparse key-value pairs.	,
kvDelimiter	No	The delimiter that is used to separate sparse keys and values.	:
modelName	Yes	The name of the input clustering model.	N/A
outputTableName	Yes	The name of the output table.	N/A
lifecycle	No	The lifecycle of the output table.	N/A

Example

Execute the following SQL statements to generate test data:

create table if not exists pai_cluster_evaluation_test_input as
select * from
(
  select 1 as id, 1 as f0,2 as f3
  union all
  select 2 as id, 1 as f0,3 as f3
  union all
  select 3 as id, 1 as f0,4 as f3
  union all
  select 4 as id, 0 as f0,3 as f3
  union all
  select 5 as id, 0 as f0,4 as f3
)tmp;

Run the following PAI command to build a clustering model. A k-means clustering model is built in this example.

PAI -name kmeans
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DcenterCount=3
    -Dloop=10
    -Daccuracy=0.00001
    -DdistanceType=euclidean
    -DinitCenterMethod=random
    -Dseed=1
    -DmodelName=pai_kmeans_test_model
    -DidxTableName=pai_kmeans_test_idx

Run the following PAI command to submit the parameters configured for the Clustering Model Evaluation component:

PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;

View the output evaluation table pai_ft_cluster_evaluation_out and the following visualized graph. The following table describes the fields displayed in the graph.
Field
Description
count
The total number of returned entries.
centerCount
The number of cluster centers.
calinhara
The VRC.

Field	Description
count	The total number of returned entries.
centerCount	The number of cluster centers.
calinhara	The VRC.