This topic describes the Split component provided by Machine Learning Designer. This component randomly splits data to generate datasets for training and testing.
Configure the component
You can configure the component by using one of the following methods:
Method 1: Use the Machine Learning Platform for AI (PAI) console
Configure the component on the pipeline configuration page of Machine Learning Designer
in the PAI console.
Tab | Parameter | Description |
---|---|---|
Parameters Setting | Splitting Method |
|
Splitting Fraction | Valid values: (0,1). | |
Random Seed | The random seed, which is automatically generated. | |
ID Column (Do Not Split Columns with the Same ID) | The ID column. Columns with the same ID are not split. Instead, they are randomly
allocated to output table 1 or output table 2.
Note This parameter is displayed only if you select Advanced Options. Only a single ID column can be selected.
|
|
Threshold Column | The threshold column. The content in this column is split based on a threshold. STRING-typed columns cannot be selected. | |
Threshold | The threshold used to split the column specified by Threshold Column. Data in output table 1 must be less than the threshold. Data in output table 2 must be greater than or equal to the threshold.
Important If you want to split data by threshold, the information specified when you set Splitting
Method to Split by Ratio must be cleared, such as the Splitting Fraction information.
|
|
Tuning | Cores | The number of cores. The system automatically allocates cores used for training based on the volume of input data. |
Memory Size per Core | The memory size of each core. The system automatically allocates the memory based on the volume of input data. Unit: MB. |
Method 2: Run a PAI command
Configure the component by running a PAI command. You can use the SQL Script component
to run PAI commands. For more information, see SQL Script.
PAI -name split -project algo_public
-DinputTableName=wbpc
-Doutput1TableName=wpbc_split1
-Doutput2TableName=wpbc_split2
-Dfraction=0.25;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTableName | Yes | The name of the input table. | None |
inputTablePartitions | No | The partition that is selected from the input table for training. The following formats
are supported:
Note If you specify multiple partitions, separate them with commas (,).
|
All partitions |
output1TableName | Yes | The name of output table 1. | None |
output1TablePartition | No | The name of the partition in output table 1. | Non-partitioned table |
output2TableName | Yes | The name of output table 2. | None |
output2TablePartition | No | The name of the partition in output table 2. | Non-partitioned table |
fraction | No | The percentage of the split data that is allocated to output table 1. Valid values: (0,1). | None |
randomSeed | No | The random seed. The value must be a positive integer. | Determined by the system |
idColName | No | The ID column. Columns with the same ID cannot be split. | None |
thresholdColName | No | The threshold column. STRING-typed columns cannot be selected. | None |
threshold | No | The threshold. | None |
lifecycle | No | The lifecycle of the output table. Valid values: [1,3650]. | None |
coreNum | No | The number of cores. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. Valid values: (1,65536). | Determined by the system |