Percentile is a measure used in statistics to calculate the percentile of data in the columns of a data table. When a set of data is ordered from the smallest to largest and is divided into 100 groups, the percentile indicates the value below which a given percentage of data falls.

Background information

  • The system can calculate only the percentiles of data of the BIGINT, DOUBLE, or DATETIME type.
  • Empty columns are skipped when the percentile is calculated. If all of the columns are empty, an error is returned.
  • You can specify multiple columns of data in the colName parameter.

Configure the component

You can use one of the following methods to configure the Percentile component.

Method 1: Configure the component on the pipeline page

You can configure the parameters of the Percentile component on the pipeline page of Machine Learning Designer of Machine Learning Platform for AI (PAI). Machine Learning Designer is formerly known as Machine Learning Studio. The following table describes the parameters.
TabParameterDescription
Parameters SettingInput ColumnsClick Select Column to select input columns.
TuningNumber of CoresThe number of cores.
Memory Size per CoreThe memory size of each core.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name Percentile
     -project algo_public
     -DinputTableName=maple_test_percentile_3col_input
     -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
ParameterDescriptionRequired
inputTableNameThe name of the input table. Yes
outputTableNameThe name of the output table. Yes
colNameThe names of columns to be calculated. By default, all columns are selected.
Note Separate the names of multiple columns with commas (,).
No
inputPartitionsThe partitions in the input table. By default, all partitions are selected.
  • Specify a single partition in the format of partition_name=value.
  • Specify multiple partitions in the format of name1=value1,name2=value2.
    Note Separate multiple partitions with commas (,).
  • Specify multi-level partitions in the format of name1=value1/name2=value2.
No
predictInputTableNameThe name of the prediction table. After you set this parameter, the prediction result can be generated. No
predictInputTablePartitionsThe partitions in the input prediction table. No
predictSelectedColNamesThe names of the columns selected from the prediction table. By default, all the columns in the prediction table are selected. The column names must be the same as the column names in a training table. No
predictSelectedOriginalColNamesThe names of the columns whose data you want to retain. By default, all columns are selected. Separate the names of multiple columns with commas (,). No
predictOutputTableNameThe name of the output prediction table. This parameter is used with the predictInputTableName parameter. No
lifecycleThe lifecycle of the output table. By default, the output table has no lifecycle.
Note The value must be a positive integer.
No
coreNumThe number of cores. Valid values: [1,9999]. This parameter is used with the memSizePerCore parameter.
Note The value must be a positive integer.
No
memSizePerCoreThe memory size of each core. Unit: MB. Valid values: [1024,64 × 1024].
Note The value must be a positive integer.
No

Example

  • Input table
    col0:double (1000 rows)col1:bigint (100 rows)col2:bigint (300 rows)
    96288Tue Oct 15 00:26:40 CST 1974
    21899Thu Jan 04 20:53:20 CST 1973
    56544Sat Mar 09 02:40:00 CST 1974
    31468Mon Aug 11 22:40:00 CST 1975
    58313Sat Aug 23 12:26:40 CST 1975
    61587Tue May 25 14:13:20 CST 1971
    7053Fri Mar 23 09:20:00 CST 1979
    92963Mon Jul 03 16:26:40 CST 1972
    24948Thu Mar 15 07:33:20 CST 1973
    42862Wed Mar 17 03:33:20 CST 1971
    1191Thu Jun 26 15:33:20 CST 1975
    75627Mon Jan 30 17:20:00 CST 1978
    49075Wed Dec 11 21:20:00 CST 1974
    95712Sun Jul 05 12:26:40 CST 1970
    8022Wed Oct 04 06:40:00 CST 1972
    68157Wed Nov 03 15:06:40 CST 1971
    1395Sat Sep 12 23:06:40 CST 1970
  • PAI command
     PAI -name Percentile
         -project algo_public
         -DinputTableName=maple_test_percentile_3col_input
         -DcolName=col0,col1,col2 -DoutputTableName=maple_test_percentile_3col_output;
  • Output table
    quantile:bigintcol0:doublecol1:bigintcol2:datetime
    00.00Thu Jan 01 08:00:00 CST 1970
    19.00Sat Jan 24 11:33:20 CST 1970
    219.01Sat Feb 28 04:53:20 CST 1970
    329.02Fri Apr 03 22:13:20 CST 1970
    439.03Fri May 08 15:33:20 CST 1970
    549.04Fri Jun 12 08:53:20 CST 1970
    659.05Fri Jul 17 02:13:20 CST 1970
    769.06Thu Aug 20 19:33:20 CST 1970
    879.07Thu Sep 24 12:53:20 CST 1970
    989.08Thu Oct 29 06:13:20 CST 1970
    1099.09Wed Dec 02 23:33:20 CST 1970
    11109.010Wed Jan 06 16:53:20 CST 1971
    12119.011Wed Feb 10 10:13:20 CST 1971
    13129.012Wed Mar 17 03:33:20 CST 1971
    14139.013Tue Apr 20 20:53:20 CST 1971
    15149.014Tue May 25 14:13:20 CST 1971
    16159.015Tue Jun 29 07:33:20 CST 1971
    ............
    84839.083Thu Dec 15 10:13:20 CST 1977
    85849.084Thu Jan 19 03:33:20 CST 1978
    86859.085Wed Feb 22 20:53:20 CST 1978
    87869.086Wed Mar 29 14:13:20 CST 1978
    88879.087Wed May 03 07:33:20 CST 1978
    89889.088Wed Jun 07 00:53:20 CST 1978
    90899.089Tue Jul 11 18:13:20 CST 1978
    91909.090Tue Aug 15 11:33:20 CST 1978
    92919.091Tue Sep 19 04:53:20 CST 1978
    93929.092Mon Oct 23 22:13:20 CST 1978
    94939.093Mon Nov 27 15:33:20 CST 1978
    95949.094Mon Jan 01 08:53:20 CST 1979
    96959.095Mon Feb 05 02:13:20 CST 1979
    97969.096Sun Mar 11 19:33:20 CST 1979
    98979.097Sun Apr 15 12:53:20 CST 1979
    99989.098Sun May 20 06:13:20 CST 1979
    100999.099Sat Jun 23 23:33:20 CST 1979