The Stratified Sampling component stratifies the input data based on the values of a stratification column and randomly samples data in each stratum.
Configure the component
You can use one of the following methods to configure the Stratified Sampling component.
Method 1: Configure the component on the pipeline page
Configure the component parameters on the pipeline page of Machine Learning Designer.
Tab | Parameter | Description |
Fields Setting | Stratification Column | The column that is used for stratification. |
Parameters Setting | Sample Size | The value must be a positive integer. |
Sampling Fraction | The value must be a floating-point number. Valid values: (0,1). | |
Random Seed | The value is automatically generated by the system. The default value is 1234567. | |
Tuning | Cores | The value must be a positive integer. By default, the system determines the value. |
Memory Size per Core | The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name StratifiedSample
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DstrataColName="label"
-DsampleSize="A:200,B:300,C:500"
-DrandomSeed=1007
-Dlifecycle=30;
Parameter | Required | Description | Default value |
inputTableName | Yes | The name of the input table. | No default value |
inputTablePartitions | No | The partitions that are selected from the input table for training. The following formats are supported:
Note Separate multiple partitions with commas (,) | All partitions |
outputTableName | Yes | The name of the output table. | No default value |
strataColName | Yes | The name of the column that is used as the key for stratification. | No default value |
sampleSize | No | The number of samples.
Note
| No default value |
sampleRatio | No | The sampling proportion.
| No default value |
randomSeed | No | The random seed. The value must be a positive integer. | 123456 |
lifecycle | No | The lifecycle of the output table. Valid values: [1,3650]. | No default value |
coreNum | No | The number of cores used in computing. The value must be a positive integer. | Determined by the system |
memSizePerCore | No | The memory size of each core. Valid values: (1,65536). Unit: MB. | Determined by the system |