The Whole Table Statistics component is used to collect statistics of data in a table or only selected columns.
Configure the component
You can use one of the following methods to configure the Whole Table Statistics component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Whole Table Statistics component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
Fields Setting | Input Columns (All Columns by Default) | Click Select Column to select input columns. By default, all columns are selected. |
Tuning | Cores | The number of cores. |
Memory Size | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name stat_summary
-project algo_public
-DinputTableName=test_data
-DoutputTableName=test_summary_out
-DinputTablePartitions="ds='20160101'"
-DselectColNames=col0,col1,col2
-Dlifecycle=1
Parameter | Description | Required |
inputTableName | The name of the input table. | Yes |
outputTableName | The name of the output table. | Yes |
inputTablePartitions | The partitions in the input table. By default, all partitions are selected. | No |
selectColNames | The names of the columns that are selected for statistics. | No |
lifecycle | The lifecycle of the output table. By default, the output table has no lifecycle. Note The value must be a positive integer. | No |
coreNum | The number of cores. Note The value must be a positive integer. | No |
memSizePerCore | The memory size of each core. Unit: MB. Valid values: [1024,64 x 1024]. Note The value must be a positive integer. | No |
Output format
The following table describes all the columns in the output table.
Column | Description |
colname | The name of the column. |
datatype | The data type. |
totalcount | The total number. |
count | The number of non-NULL values. |
missingcount | The number of NULL values. |
nancount | The number of NaN values. |
positiveinfinitycount | The number of positive infinity values. |
negativeinfinitycount | The number of negative infinity values. |
min | The minimum value. |
max | The maximum value. |
mean | The average value. |
variance | The variance. |
standarddeviation | The standard deviation. |
standarderror | The standard error. |
skewness | The skewness. |
kurtosis | The kurtosis. |
moment2 | The second moment. |
moment3 | The third moment. |
moment4 | The fourth moment. |
centralmoment2 | The second central moment. |
centralmoment3 | The third central moment. |
centralmoment4 | The fourth central moment. |
sum | The sum. |
sum2 | The sum of squares. |
sum3 | The sum of cubes. |
sum4 | The sum of the fourth powers. |
Example
Input data
drop table if exists summary_test_input; create table summary_test_input as select * from ( select 'a' as col1, 1 as col2, 0.001 as col3 union all select 'b' as col1, 2 as col2, 100.01 as col3 ) tmp;
PAI command
PAI -name stat_summary -project algo_public -DinputTableName=summary_test_input -DoutputTableName=summary_test_input_out -DselectColNames=col1,col2,col3 -Dlifecycle=1;
Output result
| colname | datatype | totalcount | count | missingcount | nancount | positiveinfinitycount | negativeinfinitycount | min | max | mean | variance | standarddeviation | standarderror | skewness | kurtosis | moment2 | moment3 | moment4 | centralmoment2 | centralmoment3 | centralmoment4 | sum | sum2 | sum3 | sum4 | | col1 | string | 2 | 2 | 0 | 0 | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | | col2 | bigint | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 1.5 | 0.5 | 0.7071067811865476 | 0.5 | 0 | -2 | 2.5 | 4.5 | 8.5 | 0.25 | 0 | 0.0625 | 3 | 5 | 9 | 17 | | col3 | double | 2 | 2 | 0 | 0 | 0 | 0 | 0.001 | 100.01 | 50.0055 | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025 | 2.91038304567337e-11 | 6252250.303768232 | 100.011 | 10002.000101 | 1000300.030001001 | 100040006.0004 |