All Products
Search
Document Center

Platform For AI:Whole Table Statistics

Last Updated:Dec 30, 2024

The Whole Table Statistics component is a method of basic statistical analysis on an entire dataset. It calculates and summarizes statistical metrics such as the mean, median, standard deviation, minimum, and maximum, and the number of missing values. The component helps quickly assess the overall characteristics and quality of the data. This provides basic information for further data cleansing, feature engineering, and model development.

Configure the component

You can use one of the following methods to configure the Whole Table Statistics component.

Method 1: Configure the component on the pipeline page

On the pipeline details page in Machine Learning Designer, add the Whole Table Statistics component to the pipeline and configure the parameters described in the following table.

Tab

Parameter

Description

Fields Setting

Input Columns (All Columns by Default)

Click Select Fields to select input columns for statistical analysis. By default, all columns are selected.

Tuning

Cores

The number of cores.

Memory Size

The memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.

 PAI -name stat_summary
-project algo_public
-DinputTableName=test_data
-DoutputTableName=test_summary_out
-DinputTablePartitions="ds='20160101'"
-DselectColNames=col0,col1,col2
-Dlifecycle=1

Parameter

Required

Description

inputTableName

Yes

The name of the input table.

outputTableName

Yes

The name of the output table.

inputTablePartitions

No

The partitions in the input table. By default, all partitions are selected.

selectColNames

No

The names of the columns that are selected for statistics.

lifecycle

No

The lifecycle of the output table. By default, the output table has no lifecycle.

Note

The value must be a positive integer.

coreNum

No

The number of cores.

Note

The value must be a positive integer.

memSizePerCore

No

The memory size of each core. Unit: MB. Valid values: [1024,64 x 1024].

Note

The value must be a positive integer.

Example

  1. On the pipeline details page, add the SQL Script component to the pipeline and click the component. On the Parameters Setting tab, clear Use Script Mode and Whether the system adds a create table statement, and enter the following SQL statements in the SQL Script text editor:

    drop table if exists summary_test_input;
    create table summary_test_input as
    select
      *
    from
    (
      select 'a' as col1, 1 as col2, 0.001 as col3
        union all
      select 'b' as col1, 2 as col2, 100.01 as col3
    ) tmp;
  2. Add the SQL Script component to the pipeline and click the component. On the Parameters Setting tab, clear Use Script Mode and Whether the system adds a create table statement, and enter the following SQL statements in the SQL Script text editor. Connect this component with the component added in Step 1.

    drop table if exists ${o1};
    PAI -name stat_summary
        -project algo_public
        -DinputTableName=summary_test_input
        -DoutputTableName=${o1}
        -DselectColNames=col1,col2,col3
        -Dlifecycle=1;
  3. In the upper-left corner of the pipeline details page, Click the image icon.

  4. After the pipeline is run, right-click the component added in Step 2, click View data, and then select SQL Script Output.

    | colname | datatype | totalcount | count | missingcount | nancount | positiveinfinitycount | negativeinfinitycount | min   | max    | mean    | variance          | standarddeviation  | standarderror     | skewness              | kurtosis           | moment2           | moment3           | moment4           | centralmoment2 | centralmoment3       | centralmoment4    | sum     | sum2         | sum3              | sum4           |
    | ------- | -------- | ---------- | ----- | ------------ | -------- | --------------------- | --------------------- | ----- | ------ | ------- | ----------------- | ------------------ | ----------------- | --------------------- | ------------------ | ----------------- | ----------------- | ----------------- | -------------- | -------------------- | ----------------- | ------- | ------------ | ----------------- | -------------- |
    | col1    | string   | 2          | 2     | 0            | 0        | 0                     | 0                     |       |        |         |                   |                    |                   |                       |                    |                   |                   |                   |                |                      |                   |         |              |                   |                |
    | col2    | bigint   | 2          | 2     | 0            | 0        | 0                     | 0                     | 1     | 2      | 1.5     | 0.5               | 0.7071067811865476 | 0.5               | 0                     | -2                 | 2.5               | 4.5               | 8.5               | 0.25           | 0                    | 0.0625            | 3       | 5            | 9                 | 17             |
    | col3    | double   | 2          | 2     | 0            | 0        | 0                     | 0                     | 0.001 | 100.01 | 50.0055 | 5000.900040500001 | 70.71704207968544  | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025  | 2.91038304567337e-11 | 6252250.303768232 | 100.011 | 10002.000101 | 1000300.030001001 | 100040006.0004 |

    Column

    Description

    colname

    The name of the column.

    datatype

    The data type.

    totalcount

    The total number.

    count

    The number of non-NULL values.

    missingcount

    The number of NULL values.

    nancount

    The number of NaN values.

    positiveinfinitycount

    The number of positive infinity values.

    negativeinfinitycount

    The number of negative infinity values.

    min

    The minimum value.

    max

    The maximum value.

    mean

    The average value.

    variance

    The variance.

    standarddeviation

    The standard deviation.

    standarderror

    The standard error.

    skewness

    The skewness.

    kurtosis

    The kurtosis.

    moment2

    The second moment.

    moment3

    The third moment.

    moment4

    The fourth moment.

    centralmoment2

    The second central moment.

    centralmoment3

    The third central moment.

    centralmoment4

    The fourth central moment.

    sum

    The sum.

    sum2

    The sum of squares.

    sum3

    The sum of cubes.

    sum4

    The sum of the fourth powers.