Normality Test - Platform For AI - Alibaba Cloud Documentation Center

A normality test is a special goodness-of-fit hypothesis test in statistical determination. A normality test determines whether the population follows normal distribution by using observations. This topic describes the Normality Test component provided by Machine Learning Studio.

The Normality Test component consists of the Anderson-Darling Test, Kolmogorov-Smirnov Test, and Q-Q Plot methods. You can select one or more test methods based on your business requirements.

An Anderson-Darling test compares the empirical distribution function of sample data with the expected normal distribution. If the difference is large, the test negates the hypothesis that the population has a normal distribution.
A Kolmogorov-Smirnov test compares the distribution of two observations.
A Q-Q plot tests the distribution of data by comparing the quantile of test sample data with the known distribution. If more than 1,000 samples are collected, the system uses these samples for calculation and generates a Q-Q plot. The data points in the plot do not necessarily cover all the samples.

Configure the component

You can configure the component by using one of the following methods:

Machine Learning Platform for AI (PAI) console

Tab	Parameter	Description
Fields Setting	Columns	N/A
Parameters Setting	Anderson-Darling Test	Valid values: Yes No Default value: Yes.
	Kolmogorov-Smirnov Test	Valid values: Yes No Default value: Yes.
	Use Q-Q Plot	Valid values: Yes No Default value: Yes.
Tuning	Computing Cores	The number of cores used in computing. The value must be a positive integer.
Tuning	Memory Size per Core (Unit: MB)	The memory size of each core.

PAI command

PAI -name normality_test
    -project algo_public
    -DinputTableName=test
    -DoutputTableName=test_out
    -DselectedColNames=col1,col2
    -Dlifecycle=1;

Parameter	Required	Description	Default value
inputTableName	Yes	The name of the input table.	No default value
outputTableName	Yes	The name of the output table.	No default value
selectedColNames	No	The columns selected from the input table. You can select multiple columns of the DOUBLE or BIGINT type.	No default value
inputTablePartitions	No	The name of the partition of the input table.	""
enableQQplot	No	Specifies whether to use a Q-Q plot. Valid values: true and false.	true
enableADtest	No	Specifies whether to perform an Anderson-Darling test. Valid values: true and false.	true
enableKStest	No	Specifies whether to perform a Kolmogorov-Smirnov test. Valid values: true and false.	true
lifecycle	No	The lifecycle of the output table. The value is an integer that is greater than or equal to -1.	-1
coreNum	No	This parameter is used with memSizePerCore. The value must be a positive integer. The system calculates the number of instances based on the amount of input data.	-1
memSizePerCore	No	The memory size of each core. Unit: MB. The value must be positive integer. Valid values: (100,64 × 1024). The system calculates the memory size based on the amount of input data.	-1

Example

Input data

    drop table if exists normality_test_input;
    create table normality_test_input as
    select
      *
    from
    (
      select 1 as x
        union all
      select 2 as x
        union all
      select 3 as x
        union all
      select 4 as x
        union all
      select 5 as x
        union all
      select 6 as x
        union all
      select 7 as x
        union all
      select 8 as x
        union all
      select 9 as x
        union all
      select 10 as x
    ) tmp;

PAI command

PAI -name normality_test
    -project algo_public
    -DinputTableName=normality_test_input
    -DoutputTableName=normality_test_output
    -DselectedColNames=x
    -Dlifecycle=1;

Input description
Input format: Select columns required for calculation. You can select multiple columns. The data type is DOUBLE or BIGINT.

Output description

Output format: A diagram and a result table are provided. The following table describes fields in the result table. The result table has two partitions:

The partition p=test lists the results of an Anderson-Darling or Kolmogorov-Smirnov test. Data is provided if the enableADtest or enableKStest parameter is set to true.
The partition p=plot lists the results of a Q-Q plot test. Data is provided if the enableQQplot parameter is set to true. The column p=test is reused. If the partition p=plot is used, the testvalue column records the original observation (x-axis of the Q-Q plot), and the pvalue column records the expected data that is normally distributed (y-axis of the Q-Q plot).

Column	Data type	Description
colName	STRING	The column name.
testname	STRING	The test name.
testvalue	DOUBLE	The test value or the x-axis of the Q-Q plot.
pvalue	DOUBLE	The p value or the y-axis of the Q-Q plot.
p	DOUBLE	The partition name.

Output table

+------------+------------+------------+------------+------------+
| colname    | testname   | testvalue  | pvalue     | p          |
+------------+------------+------------+------------+------------+
| x          | NULL       | 1.0        | 0.8173291742279805 | plot       |
| x          | NULL       | 2.0        | 2.470864450785345  | plot       |
| x          | NULL       | 3.0        | 3.5156067948020056 | plot       |
| x          | NULL       | 4.0        | 4.3632330349313095 | plot       |
| x          | NULL       | 5.0        | 5.128868067945126  | plot       |
| x          | NULL       | 6.0        | 5.871131932054874  | plot       |
| x          | NULL       | 7.0        | 6.6367669650686905 | plot       |
| x          | NULL       | 8.0        | 7.4843932051979944 | plot       |
| x          | NULL       | 9.0        | 8.529135549214654  | plot       |
| x          | NULL       | 10.0       | 10.182670825772018 | plot       |
| x          | Anderson_Darling_Test | 0.1411092332197832   | 0.9566579606430077 | test       |
| x          | Kolmogorov_Smirnov_Test | 0.09551932503797644 | 0.9999888659426232 | test       |
+------------+------------+------------+------------+------------+