The Whole Table Statistics component is a method of basic statistical analysis on an entire dataset. It calculates and summarizes statistical metrics such as the mean, median, standard deviation, minimum, and maximum, and the number of missing values. The component helps quickly assess the overall characteristics and quality of the data. This provides basic information for further data cleansing, feature engineering, and model development.
Configure the component
You can use one of the following methods to configure the Whole Table Statistics component.
Method 1: Configure the component on the pipeline page
On the pipeline details page in Machine Learning Designer, add the Whole Table Statistics component to the pipeline and configure the parameters described in the following table.
Tab | Parameter | Description |
Fields Setting | Input Columns (All Columns by Default) | Click Select Fields to select input columns for statistical analysis. By default, all columns are selected. |
Tuning | Cores | The number of cores. |
Memory Size | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.
PAI -name stat_summary
-project algo_public
-DinputTableName=test_data
-DoutputTableName=test_summary_out
-DinputTablePartitions="ds='20160101'"
-DselectColNames=col0,col1,col2
-Dlifecycle=1
Parameter | Required | Description |
inputTableName | Yes | The name of the input table. |
outputTableName | Yes | The name of the output table. |
inputTablePartitions | No | The partitions in the input table. By default, all partitions are selected. |
selectColNames | No | The names of the columns that are selected for statistics. |
lifecycle | No | The lifecycle of the output table. By default, the output table has no lifecycle. Note The value must be a positive integer. |
coreNum | No | The number of cores. Note The value must be a positive integer. |
memSizePerCore | No | The memory size of each core. Unit: MB. Valid values: [1024,64 x 1024]. Note The value must be a positive integer. |
Example
On the pipeline details page, add the SQL Script component to the pipeline and click the component. On the Parameters Setting tab, clear Use Script Mode and Whether the system adds a create table statement, and enter the following SQL statements in the SQL Script text editor:
drop table if exists summary_test_input; create table summary_test_input as select * from ( select 'a' as col1, 1 as col2, 0.001 as col3 union all select 'b' as col1, 2 as col2, 100.01 as col3 ) tmp;
Add the SQL Script component to the pipeline and click the component. On the Parameters Setting tab, clear Use Script Mode and Whether the system adds a create table statement, and enter the following SQL statements in the SQL Script text editor. Connect this component with the component added in Step 1.
drop table if exists ${o1}; PAI -name stat_summary -project algo_public -DinputTableName=summary_test_input -DoutputTableName=${o1} -DselectColNames=col1,col2,col3 -Dlifecycle=1;
In the upper-left corner of the pipeline details page, Click the icon.
After the pipeline is run, right-click the component added in Step 2, click View data, and then select SQL Script Output.
| colname | datatype | totalcount | count | missingcount | nancount | positiveinfinitycount | negativeinfinitycount | min | max | mean | variance | standarddeviation | standarderror | skewness | kurtosis | moment2 | moment3 | moment4 | centralmoment2 | centralmoment3 | centralmoment4 | sum | sum2 | sum3 | sum4 | | ------- | -------- | ---------- | ----- | ------------ | -------- | --------------------- | --------------------- | ----- | ------ | ------- | ----------------- | ------------------ | ----------------- | --------------------- | ------------------ | ----------------- | ----------------- | ----------------- | -------------- | -------------------- | ----------------- | ------- | ------------ | ----------------- | -------------- | | col1 | string | 2 | 2 | 0 | 0 | 0 | 0 | | | | | | | | | | | | | | | | | | | | col2 | bigint | 2 | 2 | 0 | 0 | 0 | 0 | 1 | 2 | 1.5 | 0.5 | 0.7071067811865476 | 0.5 | 0 | -2 | 2.5 | 4.5 | 8.5 | 0.25 | 0 | 0.0625 | 3 | 5 | 9 | 17 | | col3 | double | 2 | 2 | 0 | 0 | 0 | 0 | 0.001 | 100.01 | 50.0055 | 5000.900040500001 | 70.71704207968544 | 50.00450000000001 | 2.327677906939552e-16 | -1.999999999999999 | 5001.000050500001 | 500150.0150005006 | 50020003.00020002 | 2500.45002025 | 2.91038304567337e-11 | 6252250.303768232 | 100.011 | 10002.000101 | 1000300.030001001 | 100040006.0004 |
Column
Description
colname
The name of the column.
datatype
The data type.
totalcount
The total number.
count
The number of non-NULL values.
missingcount
The number of NULL values.
nancount
The number of NaN values.
positiveinfinitycount
The number of positive infinity values.
negativeinfinitycount
The number of negative infinity values.
min
The minimum value.
max
The maximum value.
mean
The average value.
variance
The variance.
standarddeviation
The standard deviation.
standarderror
The standard error.
skewness
The skewness.
kurtosis
The kurtosis.
moment2
The second moment.
moment3
The third moment.
moment4
The fourth moment.
centralmoment2
The second central moment.
centralmoment3
The third central moment.
centralmoment4
The fourth central moment.
sum
The sum.
sum2
The sum of squares.
sum3
The sum of cubes.
sum4
The sum of the fourth powers.