Percentile is a measure used in statistics to calculate the percentile of data in the columns of a data table. When a set of data is ordered from the smallest to largest and is divided into 100 groups, the percentile indicates the value below which a given percentage of data falls.
Background information
- The system can calculate only the percentiles of data of the BIGINT, DOUBLE, or DATETIME type.
- Empty columns are skipped when the percentile is calculated. If all of the columns are empty, an error is returned.
- You can specify multiple columns of data in the colName parameter.
Configure the component
You can use one of the following methods to configure the Percentile component.
Method 1: Configure the component on the pipeline page
You can configure the parameters of the Percentile component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
Tab | Parameter | Description |
---|---|---|
Parameters Setting | Input Columns | Click Select Column to select input columns. |
Tuning | Number of Cores | The number of cores. |
Memory Size per Core | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name Percentile
-project algo_public
-DinputTableName=maple_test_percentile_3col_input
-DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
Parameter | Description | Required |
---|---|---|
inputTableName | The name of the input table. | Yes |
outputTableName | The name of the output table. | Yes |
colName | The names of columns to be calculated. By default, all columns are selected. Note Separate the names of multiple columns with commas (,). | No |
inputPartitions | The partitions in the input table. By default, all partitions are selected.
| No |
predictInputTableName | The name of the prediction table. After you set this parameter, the prediction result can be generated. | No |
predictInputTablePartitions | The partitions in the input prediction table. | No |
predictSelectedColNames | The names of the columns selected from the prediction table. By default, all the columns in the prediction table are selected. The column names must be the same as the column names in a training table. | No |
predictSelectedOriginalColNames | The names of the columns whose data you want to retain. By default, all columns are selected. Separate the names of multiple columns with commas (,). | No |
predictOutputTableName | The name of the output prediction table. This parameter is used with the predictInputTableName parameter. | No |
lifecycle | The lifecycle of the output table. By default, the output table has no lifecycle. Note The value must be a positive integer. | No |
coreNum | The number of cores. Valid values: [1,9999]. This parameter is used with the memSizePerCore parameter. Note The value must be a positive integer. | No |
memSizePerCore | The memory size of each core. Unit: MB. Valid values: [1024,64 × 1024]. Note The value must be a positive integer. | No |
Example
- Input table
col0:double (1000 rows) col1:bigint (100 rows) col2:bigint (300 rows) 962 88 Tue Oct 15 00:26:40 CST 1974 218 99 Thu Jan 04 20:53:20 CST 1973 565 44 Sat Mar 09 02:40:00 CST 1974 314 68 Mon Aug 11 22:40:00 CST 1975 583 13 Sat Aug 23 12:26:40 CST 1975 615 87 Tue May 25 14:13:20 CST 1971 70 53 Fri Mar 23 09:20:00 CST 1979 929 63 Mon Jul 03 16:26:40 CST 1972 249 48 Thu Mar 15 07:33:20 CST 1973 428 62 Wed Mar 17 03:33:20 CST 1971 119 1 Thu Jun 26 15:33:20 CST 1975 756 27 Mon Jan 30 17:20:00 CST 1978 490 75 Wed Dec 11 21:20:00 CST 1974 957 12 Sun Jul 05 12:26:40 CST 1970 80 22 Wed Oct 04 06:40:00 CST 1972 681 57 Wed Nov 03 15:06:40 CST 1971 13 95 Sat Sep 12 23:06:40 CST 1970 - PAI command
PAI -name Percentile -project algo_public -DinputTableName=maple_test_percentile_3col_input -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
- Output table
quantile:bigint col0:double col1:bigint col2:datetime 0 0.0 0 Thu Jan 01 08:00:00 CST 1970 1 9.0 0 Sat Jan 24 11:33:20 CST 1970 2 19.0 1 Sat Feb 28 04:53:20 CST 1970 3 29.0 2 Fri Apr 03 22:13:20 CST 1970 4 39.0 3 Fri May 08 15:33:20 CST 1970 5 49.0 4 Fri Jun 12 08:53:20 CST 1970 6 59.0 5 Fri Jul 17 02:13:20 CST 1970 7 69.0 6 Thu Aug 20 19:33:20 CST 1970 8 79.0 7 Thu Sep 24 12:53:20 CST 1970 9 89.0 8 Thu Oct 29 06:13:20 CST 1970 10 99.0 9 Wed Dec 02 23:33:20 CST 1970 11 109.0 10 Wed Jan 06 16:53:20 CST 1971 12 119.0 11 Wed Feb 10 10:13:20 CST 1971 13 129.0 12 Wed Mar 17 03:33:20 CST 1971 14 139.0 13 Tue Apr 20 20:53:20 CST 1971 15 149.0 14 Tue May 25 14:13:20 CST 1971 16 159.0 15 Tue Jun 29 07:33:20 CST 1971 ... ... ... ... 84 839.0 83 Thu Dec 15 10:13:20 CST 1977 85 849.0 84 Thu Jan 19 03:33:20 CST 1978 86 859.0 85 Wed Feb 22 20:53:20 CST 1978 87 869.0 86 Wed Mar 29 14:13:20 CST 1978 88 879.0 87 Wed May 03 07:33:20 CST 1978 89 889.0 88 Wed Jun 07 00:53:20 CST 1978 90 899.0 89 Tue Jul 11 18:13:20 CST 1978 91 909.0 90 Tue Aug 15 11:33:20 CST 1978 92 919.0 91 Tue Sep 19 04:53:20 CST 1978 93 929.0 92 Mon Oct 23 22:13:20 CST 1978 94 939.0 93 Mon Nov 27 15:33:20 CST 1978 95 949.0 94 Mon Jan 01 08:53:20 CST 1979 96 959.0 95 Mon Feb 05 02:13:20 CST 1979 97 969.0 96 Sun Mar 11 19:33:20 CST 1979 98 979.0 97 Sun Apr 15 12:53:20 CST 1979 99 989.0 98 Sun May 20 06:13:20 CST 1979 100 999.0 99 Sat Jun 23 23:33:20 CST 1979