Chi-square Goodness of Fit Test - Platform For AI - Alibaba Cloud Documentation Center

This topic describes the Chi-square Goodness of Fit Test component provided by Machine Learning Designer. The Chi-square Goodness of Fit Test component is used in scenarios in which categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same.

Configure the component

You can configure the Chi-square Goodness of Fit Test component by using one of the following methods:

Method 1: Configure the component in Machine Learning Designer

Configure the component on the pipeline configuration tab of Machine Learning Designer in the Machine Learning Platform for AI console.

Parameter	Description
Input Column	The column on which you want to perform a chi-square test.
Class Probability	The class probability configuration. Specify this parameter in the `Class:Probability` format. The sum of all probabilities is 1.

Method 2: Run Machine Learning Platform for AI commands

Configure the component parameters by using a Machine Learning Platform for AI command. You can use the SQL Script component to run Machine Learning Platform for AI commands. For more information, see SQL Script. The following table describes the parameters of the command that is used to configure this component.

PAI -name chisq_test
    -project algo_public
    -DinputTableName=pai_chisq_test_input
    -DcolName=f0
    -DprobConfig=0:0.3,1:0.7
    -DoutputTableName=pai_chisq_test_output0
    -DoutputDetailTableName=pai_chisq_test_output0_detail

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	None.
colName	Yes	The name of the column.	None.
outputTableName	Yes	The name of the output table.	None.
outputDetailTableName	Yes	The name of the output details table.	None.
inputTablePartitions	No	The partition that is selected from the input table for training. The following formats are supported: Partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,).	By default, this parameter is left empty.
probConfig	No	The class probability configuration. Specify this parameter in the `Class:Probability` format. The sum of all probabilities is 1.	By default, this parameter is not specified, and all the probability values are the same.

Example

Test data

create table pai_chisq_test_input as
select * from
(
  select '1' as f0,'2' as f1
  union all
  select '1' as f0,'3' as f1
  union all
  select '1' as f0,'4' as f1
  union all
  select '0' as f0,'3' as f1
  union all
  select '0' as f0,'4' as f1
)tmp;

PAI command

PAI -name chisq_test
    -project algo_public
    -DinputTableName=pai_chisq_test_input
    -DcolName=f0
    -DprobConfig=0:0.3,1:0.7
    -DoutputTableName=pai_chisq_test_output0
    -DoutputDetailTableName=pai_chisq_test_output0_detail

Output description

The output table that is specified by the outputTableName parameter is in the JSON format. The table contains only one row and one column.

{
    "Chi-Square": {
        "comment": "Pearson's chi-square test",
        "df": 1,
        "p-value": 0.75,
        "value": 0.2380952380952381
    }
}

The following table describes the columns in the output detail table that is specified by the outputDetailTableName parameter.

column name	comment
colName	The data source class.
observed	The observed frequency.
expected	The expected frequency.
residuals	The standard residuals, which are calculated by using the following expression: `(Standard residuals = (Observed frequency - Expected frequency)/sqrt(Expected frequency)`.

Generated data