The Two Sample T Test component is used to check whether the population means from two samples are significantly different from each other based on the principles of statistics. This topic describes how to configure parameters for the Two Sample T Test component provided by Machine Learning Designer (formerly known as Machine Learning Studio). This topic also provides an example on how to use the Two Sample T Test component.
Configure the component
You can use one of the following methods to configure the Two Sample T Test component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Two Sample T Test component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Sample 1 Column | The column that contains Sample 1. |
Sample 2 Column | The column that contains Sample 2. | |
Parameters Setting | T Test Type | The type of the T test that you want to perform. Valid values:
|
Alternative Hypothesis Type | The type of alternative hypothesis. Valid values:
| |
Confidence Level | The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999. | |
Hypothesized Mean | The hypothesized mean. Default value: 0. | |
Variances of Two Populations Are Equal | Specifies whether the variances of two populations are equal. Valid values: true and false. | |
Cores | The number of cores. The value must be a positive integer. This parameter must be used with the Memory Size Per Core parameter. Valid values: 1 to 9999. | |
Memory Size Per Core | The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
pai -name t_test
-project algo_public
-DxTableName=pai_t_test_all_type
-DxColName=col1_double
-DxTablePartitions=ds=2010/dt=1
-DyTableName=pai_t_test_all_type
-DyColName=col1_double
-DyTablePartitions=ds=2010/dt=1
-DoutputTableName=pai_t_test_out
-Dalternative=less
-Dmu=47
-DconfidenceLevel=0.95
-Dpaired=false
-DvarEqual=true
Parameter | Required | Description | Default value |
xTableName | Yes | The name of Input Table x. | N/A |
xTablePartitions | No | The one or more partitions in Input Table x that are used in the T test. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). | All partitions |
xColName | Yes | The column in Input Table x that is used in the T test. The value must be of the DOUBLE or INT type. | N/A |
yTableName | Yes | The name of Input Table y. | N/A |
yTablePartitions | No | The one or more partitions in Input Table y that are used in the T test. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). | All partitions |
yColName | Yes | The column in Input Table y that is used in the T test. The value must be of the DOUBLE or INT type. | N/A |
paired | No |
| false |
alternative | No | The type of alternative hypothesis. Valid values: two.sided, less, and greater. | two.sided |
mu | No | The hypothesized mean. The value must be of the DOUBLE type. | 0 |
varEqual | No | Specifies whether the variances of two populations are equal. Valid values: true and false. | false |
confidenceLevel | No | The confidence level of the test result. Valid values: 0.8, 0.9, 0.95, 0.99, 0.995, and 0.999. | 0.95 |
coreNum | No | The number of cores. The value must be a positive integer. This parameter must be used with the memSizePerCore parameter. Valid values: 1 to 9999. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: 1024 to 65536. | Determined by the system |
lifecycle | No | The lifecycle of the output table. | N/A |
If the input tables are regular tables but not partitioned tables, we recommend that you do not set the coreNum and memSizePerCore parameters. Instead, use the default values determined by the system. If you do not have sufficient computing resources, use the following code to calculate the amount of computing resources needed:
def CalcCoreNumAndMem(row,centerCount,kOneCoreDataSize=1024):
"""Calculate the number of cores and memory size of each core.
Args:
row: the number of rows in an input table.
centerCount: the number of columns in an input table.
kOneCoreDataSize: the amount of data that can be computed by each core. Unit: MB. The value must be a positive integer. Default value: 1024.
Return:
coreNum,memSizePerCore
Example:
coreNum,memSizePerCore = CalcCoreNumAndMem(1000,99,100,kOneCoreDataSize=2048)
"""
kMBytes = 1024.0 * 1024.0
# The number of cores involved in computing.
coreNum = max(1, int(row * 2 * 8 / kMBytes / kOneCoreDataSize))
# Memory size per core = Data amount.
memSizePerCore = max(1024,int(kOneCoreDataSize * 2))
return coreNum,memSizePerCore
Example
Test data
create table pai_test_input as select * from ( select 1 as f0,2 as f1 union all select 1 as f0,3 as f1 union all select 1 as f0,4 as f1 union all select 0 as f0,3 as f1 union all select 0 as f0,4 as f1 )tmp;
PAI command
pai -name t_test -project algo_public -DxTableName=pai_test_input -DxColName=f0 -DyTableName=pai_test_input -DyColName=f1 -DyTablePartitions=ds=2010/dt=1 -DoutputTableName=pai_t_test_out -Dalternative=less -Dmu=47 -DconfidenceLevel=0.95 -Dpaired=false -DvarEqual=true
Output
The output table is in the JSON format and contains only one row and one column.
{ "AlternativeHypthesis": "difference in means not equals to 0", "ConfidenceInterval": "(-2.5465, -0.4535)", "ConfidenceLevel": 0.95, "alpha": 0.05000000000000004, "df": 19, "mean of the differences": -1.5, "p": 0.008000000000000007, "t": -3 }