This topic describes the Box Plot component provided by Machine Learning Designer.
A box plot chart shows the distribution of a set of data. It shows the distribution features of raw data. It can also be used to compare the distribution features between multiple sets of data.
Limits
The visualized report of this component is available only in Machine Learning Studio.
Configure the component
You can configure the component by using one of the following methods:
Method 1: Using the Machine Learning Platform for AI console
Tab | Parameter | Description |
---|---|---|
Field Setting | Continuous Features | The column to represent the continuous feature. |
Enumeration Feature | The column to represent the enumeration feature.
Note Machine Learning Studio allows you to select only one field, whereas Machine Learning
Designer allows you to select multiple fields.
|
|
Stratified Samples | The number of adopted stratified samples. |
Method 2: Using Machine Learning Platform for AI (PAI) commands
PAI -name box_plot -project algo_public
-DinputTable="boxplot"
-DcontinueCols="age"
-DcategoryCol="y"
-DoutputTable="pai_temp_6075_97181_1"
-DsampleSize="1000"
-Dlifecycle="7";
Parameter | Required | Description | Default value |
---|---|---|---|
inputTable | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partition that is selected from the input table for training. The following formats
are supported:
Note If you specify multiple partitions, separate them with commas (,).
|
N/A |
outputTable | Yes | The name of the output table that stores the box plot chart and samples. | N/A |
continueCols | Yes | The column to represent the continuous feature. | N/A |
categoryCol | Yes | The column to represent the enumeration feature. | N/A |
sampleSize | No | The number of samples based on which the disturbance conditions of each feature are drawn. | 1000 |
lifecycle | No | The lifecycle of the output table. Unit: days. | 28 |
coreNum | No | The number of cores that are used in computing. The value of this parameter must be a positive integer. | Automatically allocated |
memSizePerCore | No | The memory size of each core. Valid values: 1 to 65536. Unit: MB. | Automatically allocated |
Examples
- Input data
create table boxplot as select age, y from bank_data limit 100;
age y 50 0 53 0 28 1 39 0 55 1 30 0 37 0 39 0 36 1 27 0 34 0 41 0 55 1 33 0 26 0 52 0 35 1 27 1 28 0 26 0 41 0 35 0 40 0 32 0 41 0 34 0 49 0 37 0 35 0 38 0 47 0 46 0 27 0 29 1 32 0 36 0 29 0 47 0 44 0 54 0 36 0 42 0 44 0 72 1 48 0 36 0 35 0 43 0 56 0 42 0 31 0 32 0 33 0 31 0 39 0 30 1 24 0 24 0 38 0 26 0 41 0 34 0 30 0 37 0 68 0 31 0 48 0 33 0 59 0 44 0 28 0 50 0 33 0 45 0 40 0 45 0 43 0 54 0 53 0 35 0 30 0 25 0 35 0 54 1 30 0 38 0 35 0 47 0 32 0 27 0 40 1 31 0 42 0 40 0 31 0 57 0 38 1 39 0 37 0 44 0 - Parameter settings
Specify the age column as the continuous feature column, and the y column as the enumeration feature column. Retain the default values of other parameters.
- Output
- Output description
Right-click Box Plot and choose to view the output. Parameters:
- percent_points: indicates the calculated percentile.
- percent_count: indicates the number of data entries in each interval. The intervals are divided by percentile.
- sample_list: The samples are selected from each stratum based on the sampling rate. The sampling rate is calculated by using the following formula: Sampling rate = Number of stratified samples/Total number of data entries. If the sampling rate is too low and the value of the number of samples in each stratum multiplied by the sampling rate is less than 10, a new sampling rate is recalculated.
- The following figure shows a box plot chart.
- The following figure shows the distribution of disturbance points.
- Output description