All Products
Search
Document Center

Platform For AI:Chi-square Goodness of Fit Test

Last Updated:May 17, 2024

This topic describes the Chi-square Goodness of Fit Test component provided by Machine Learning Designer. The Chi-square Goodness of Fit Test component is used in scenarios in which categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same.

Configure the component

You can configure the Chi-square Goodness of Fit Test component by using one of the following methods:

Method 1: Configure the component in Machine Learning Designer

Configure the component on the pipeline configuration tab of Machine Learning Designer in the Machine Learning Platform for AI console.

Parameter

Description

Input Column

The column on which you want to perform a chi-square test.

Class Probability

The class probability configuration. Specify this parameter in the Class:Probability format. The sum of all probabilities is 1.

Method 2: Run Machine Learning Platform for AI commands

Configure the component parameters by using a Machine Learning Platform for AI command. You can use the SQL Script component to run Machine Learning Platform for AI commands. For more information, see SQL Script. The following table describes the parameters of the command that is used to configure this component.

PAI -name chisq_test
    -project algo_public
    -DinputTableName=pai_chisq_test_input
    -DcolName=f0
    -DprobConfig=0:0.3,1:0.7
    -DoutputTableName=pai_chisq_test_output0
    -DoutputDetailTableName=pai_chisq_test_output0_detail

Parameter

Required

Description

Default value

inputTableName

Yes

The name of the input table.

None.

colName

Yes

The name of the column.

None.

outputTableName

Yes

The name of the output table.

None.

outputDetailTableName

Yes

The name of the output details table.

None.

inputTablePartitions

No

The partition that is selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

If you specify multiple partitions, separate them with commas (,).

By default, this parameter is left empty.

probConfig

No

The class probability configuration. Specify this parameter in the Class:Probability format. The sum of all probabilities is 1.

By default, this parameter is not specified, and all the probability values are the same.

Example

  • Test data

    create table pai_chisq_test_input as
    select * from
    (
      select '1' as f0,'2' as f1
      union all
      select '1' as f0,'3' as f1
      union all
      select '1' as f0,'4' as f1
      union all
      select '0' as f0,'3' as f1
      union all
      select '0' as f0,'4' as f1
    )tmp;
  • PAI command

    PAI -name chisq_test
        -project algo_public
        -DinputTableName=pai_chisq_test_input
        -DcolName=f0
        -DprobConfig=0:0.3,1:0.7
        -DoutputTableName=pai_chisq_test_output0
        -DoutputDetailTableName=pai_chisq_test_output0_detail
  • Output description

    • The output table that is specified by the outputTableName parameter is in the JSON format. The table contains only one row and one column.

      {
          "Chi-Square": {
              "comment": "Pearson's chi-square test",
              "df": 1,
              "p-value": 0.75,
              "value": 0.2380952380952381
          }
      }
    • The following table describes the columns in the output detail table that is specified by the outputDetailTableName parameter.

      column name

      comment

      colName

      The data source class.

      observed

      The observed frequency.

      expected

      The expected frequency.

      residuals

      The standard residuals, which are calculated by using the following expression: (Standard residuals = (Observed frequency - Expected frequency)/sqrt(Expected frequency).

    • Generated dataChi-square test