All Products
Search
Document Center

Platform For AI:Best practice for running the K-means Clustering component

Last Updated:Nov 15, 2024

This topic describes how to run the K-means Clustering component and Clustering Model Evaluation components of Platform for AI (PAI) by submitting a hyperparameter tuning experiment based on MaxCompute resources to obtain an optimal hyperparameter combination for the K-means Clustering component algorithm.

Step 1: Prepare data

You can prepare test data and evaluation data by referring to the examples in the Clustering Model Evaluation topic.

The sample data pai_online_project.pai_kmeans_test_input and pai_online_project.pai_cluster_evaluation_test_input used in this example are from an open source data source. You can directly use the data.

Step 2: Create an experiment

  1. Go to the Create Experiment page. For more information, see Create an experiment.

  2. On the Create Experiment page, configure the parameters. The following tables describe the key parameters. For information about other parameters, see Create an experiment.

    • Execution Configurationsimage.png

      Parameter

      Description

      Metric Type

      Select MaxCompute.

      Command

      Configure the following commands and run the commands in sequence:

      • Command 1: Run the K-means Clustering component to build a clustering model by using the prepared test data. For information about how to configure the parameters, see the "Method 2: Run PAI commands" section in the K-means Clustering topic.

        pai -name kmeans
            -project algo_public
            -DinputTableName=pai_online_project.pai_kmeans_test_input
            -DselectedColNames=f0,f1
            -DappendColNames=f0,f1
            -DcenterCount=${centerCount}
            -Dloop=10
            -Daccuracy=0.01
            -DdistanceType=${distanceType}
            -DinitCenterMethod=random
            -Dseed=1
            -DmodelName=pai_kmeans_test_output_model_${exp_id}_${trial_id}
            -DidxTableName=pai_kmeans_test_output_idx_${exp_id}_${trial_id}
            -DclusterCountTableName=pai_kmeans_test_output_couter_${exp_id}_${trial_id}
            -DcenterTableName=pai_kmeans_test_output_center_${exp_id}_${trial_id};

        In the preceding code, ${centerCount} and ${distanceType} are the hyperparameter variables that you can define.

      • Command 2: Run the Clustering Model Evaluation component based on the clustering result generated by Command 1 to evaluate the performance of the clustering model. For information about how to configure the parameters, see the "Method 2: Use PAI commands" section in the Clustering Model Evaluation topic.

      • PAI -name cluster_evaluation
            -project algo_public
            -DinputTableName=pai_online_project.pai_cluster_evaluation_test_input
            -DselectedColNames=f0,f1
            -DmodelName=pai_kmeans_test_output_model_${exp_id}_${trial_id}
            -DoutputTableName=pai_ft_cluster_evaluation_out_${exp_id}_${trial_id};

      Hyperparameter

      The following section lists the constraint type and valid values of the hyperparameters:

      • centerCount:

        • Constraint Type: choice.

        • Valid Values: Click the image.png icon to add the following enumeration values: 2, 3, 4, and 5.

      • distanceType:

        • Constraint Type: choice.

        • Valid Values: Click the image.png icon to add the following enumeration values: euclidean, cosine, and cityblock.

      The system generates 12 hyperparameter combinations based on the preceding configuration and creates a trial for each hyperparameter combination. In each trial, the system runs the K-means Clustering component and Clustering Model Evaluation component by using the hyperparameter combination.

    • Trial Configuration

      Field

      Description

      Metric Type

      Select table.

      Method

      Select best.

      Metric Weight

      • Key: vrc

      • Value: 1

      Metric Source

      Set the parameter to select GET_JSON_OBJECT(summary, '$.calinhara') as vrc from pai_ft_cluster_evaluation_out_${exp_id}_${trial_id};.

      Optimization

      Select Maximize.

      Model Name

      Set the parameter to pai_kmeans_test_output_model_${exp_id}_${trial_id}.

    • Search Configurations

      Parameter

      Description

      Search Algorithm

      Select TPE.

      Maximum Trials

      Set the parameter to 6.

      Maximum Concurrent Trials

      Set the parameter to 3.

  3. Click Submit.

    The system starts creating an experiment . You can view the experiment on the AutoML page.

Step 3: View the experiment details and results

  1. On the AutoML page, click the name of the experiment to go to the Experiment Details page.

    On the Experiment Details page, you can view the execution progress and status of the trial.

    In this example, the system creates six trials based on the search algorithm and the maximum number of trials that you specified.

  2. On the Trials tab, you can view the trials that the system generated. You can also view the execution status, final metric, and hyperparameter combination of each trial.

    In this example, the Optimization parameter is set to Maximize. Therefore, the optimal hyperparameter combination is the one whose Final Metric is 59089. Optimal combination: centerCount: 2, distanceType: cityblock.