Stratified Sampling is a data sampling method that divides a dataset into multiple strata or groups based on specified grouping columns, and then independently performs random sampling within each group. This method ensures that each group is adequately represented in the sample, thereby enhancing the overall representativeness of the sample, particularly when dealing with imbalanced data issues. By this means, stratified sampling helps improve the accuracy and stability of model training.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Stratified Sampling component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Stratification Column | The column that is used for stratification. |
Parameters Setting | Sample Size | The value must be a positive integer. |
Sampling Fraction | The value must be a floating-point number. Valid values: (0,1). | |
Random Seed | The value is automatically generated by the system. The default value is 1234567. | |
Tuning | Cores | The value must be a positive integer. By default, the system determines the value. |
Memory Size per Core | The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name StratifiedSample
-project algo_public
-DinputTableName="test_input"
-DoutputTableName="test_output"
-DstrataColName="label"
-DsampleSize="A:200,B:300,C:500"
-DrandomSeed=1007
-Dlifecycle=30;
Parameter | Required | Default value | Description |
inputTableName | Yes | None | The name of the input table. |
inputTablePartitions | No | All partitions | The partitions that are selected from the input table for training. The following formats are supported:
Note Separate multiple partitions with commas (,). For example, name1=value1,value2. |
outputTableName | Yes | None | The name of the output table. |
strataColName | Yes | None | The name of the column that is used as the key for stratification. |
sampleSize | No | None | The number of samples.
Note
|
sampleRatio | No | None | The sampling proportion.
|
randomSeed | No | 123456 | The random seed. The value must be a positive integer. |
lifecycle | No | None | The lifecycle of the output table. Valid values: [1,3650]. |
coreNum | No | Determined by the system | The number of cores used in computing. The value must be a positive integer. |
memSizePerCore | No | Determined by the system | The memory size of each core. Valid values: (1,65536). Unit: MB. |