All Products
Search
Document Center

Platform For AI:Whole Table Statistics

Last Updated:May 17, 2024

The Whole Table Statistics component is used to collect statistics of data in a table or only selected columns.

Configure the component

You can use one of the following methods to configure the Whole Table Statistics component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Whole Table Statistics component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.

Tab

Parameter

Description

Fields Setting

Input Columns (All Columns by Default)

Click Select Column to select input columns. By default, all columns are selected.

Tuning

Cores

The number of cores.

Memory Size

The memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

 PAI -name stat_summary
-project algo_public
-DinputTableName=test_data
-DoutputTableName=test_summary_out
-DinputTablePartitions="ds='20160101'"
-DselectColNames=col0,col1,col2
-Dlifecycle=1

Parameter

Description

Required

inputTableName

The name of the input table.

Yes

outputTableName

The name of the output table.

Yes

inputTablePartitions

The partitions in the input table. By default, all partitions are selected.

No

selectColNames

The names of the columns that are selected for statistics.

No

lifecycle

The lifecycle of the output table. By default, the output table has no lifecycle.

Note

The value must be a positive integer.

No

coreNum

The number of cores.

Note

The value must be a positive integer.

No

memSizePerCore

The memory size of each core. Unit: MB. Valid values: [1024,64 x 1024].

Note

The value must be a positive integer.

No

Output format

The following table describes all the columns in the output table.

Column

Description

colname

The name of the column.

datatype

The data type.

totalcount

The total number.

count

The number of non-NULL values.

missingcount

The number of NULL values.

nancount

The number of NaN values.

positiveinfinitycount

The number of positive infinity values.

negativeinfinitycount

The number of negative infinity values.

min

The minimum value.

max

The maximum value.

mean

The average value.

variance

The variance.

standarddeviation

The standard deviation.

standarderror

The standard error.

skewness

The skewness.

kurtosis

The kurtosis.

moment2

The second moment.

moment3

The third moment.

moment4

The fourth moment.

centralmoment2

The second central moment.

centralmoment3

The third central moment.

centralmoment4

The fourth central moment.

sum

The sum.

sum2

The sum of squares.

sum3

The sum of cubes.

sum4

The sum of the fourth powers.

Example

  • Input data

    drop table if exists summary_test_input;
    create table summary_test_input as
    select
      *
    from
    (
      select 'a' as col1, 1 as col2, 0.001 as col3
        union all
      select 'b' as col1, 2 as col2, 100.01 as col3
    ) tmp;
  • PAI command

    PAI -name stat_summary
    -project algo_public
    -DinputTableName=summary_test_input
    -DoutputTableName=summary_test_input_out
    -DselectColNames=col1,col2,col3
    -Dlifecycle=1;
  • Output result

        | colname    | datatype   | totalcount | count      | missingcount | nancount   | positiveinfinitycount | negativeinfinitycount | min        | max        | mean       | variance   | standarddeviation | standarderror | skewness   | kurtosis   | moment2    | moment3    | moment4    | centralmoment2 | centralmoment3 | centralmoment4 | sum        | sum2       | sum3       | sum4       |
        | col1       | string     | 2          | 2          | 0            | 0          | 0                     | 0                     | NULL       | NULL       | NULL       | NULL       | NULL              | NULL          | NULL       | NULL       | NULL       | NULL       | NULL       | NULL           | NULL           | NULL           | NULL       | NULL       | NULL       | NULL       |
        | col2       | bigint     | 2          | 2          | 0            | 0          | 0                     | 0                     | 1          | 2          | 1.5        | 0.5        | 0.7071067811865476 | 0.5           | 0          | -2         | 2.5        | 4.5        | 8.5        | 0.25           | 0              | 0.0625         | 3          | 5          | 9          | 17         |
        | col3       | double     | 2          | 2          | 0            | 0          | 0                     | 0                     | 0.001      | 100.01     | 50.0055    | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025  | 2.91038304567337e-11 | 6252250.303768232 | 100.011    | 10002.000101 | 1000300.030001001 | 100040006.0004 |