All Products
Search
Document Center

Platform For AI:Stratified Sampling

Last Updated:Nov 28, 2024

Stratified Sampling is a data sampling method that divides a dataset into multiple strata or groups based on specified grouping columns, and then independently performs random sampling within each group. This method ensures that each group is adequately represented in the sample, thereby enhancing the overall representativeness of the sample, particularly when dealing with imbalanced data issues. By this means, stratified sampling helps improve the accuracy and stability of model training.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Stratified Sampling component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Fields Setting

Stratification Column

The column that is used for stratification.

Parameters Setting

Sample Size

The value must be a positive integer.

Sampling Fraction

The value must be a floating-point number. Valid values: (0,1).

Random Seed

The value is automatically generated by the system. The default value is 1234567.

Tuning

Cores

The value must be a positive integer. By default, the system determines the value.

Memory Size per Core

The value must be a positive integer. Valid values: (1,65536). By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name StratifiedSample
    -project algo_public
    -DinputTableName="test_input"
    -DoutputTableName="test_output"
    -DstrataColName="label"
    -DsampleSize="A:200,B:300,C:500"
    -DrandomSeed=1007
    -Dlifecycle=30;

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

inputTablePartitions

No

All partitions

The partitions that are selected from the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

Separate multiple partitions with commas (,). For example, name1=value1,value2.

outputTableName

Yes

None

The name of the output table.

strataColName

Yes

None

The name of the column that is used as the key for stratification.

sampleSize

No

None

The number of samples.

  • If the value is a positive integer, it indicates the number of samples in each stratum.

  • If the value is a string, the string must be in the format of strata0:n0,strata1:n1. The value after a colon (:) indicates the number of samples that need to be configured for the stratum specified before the colon (:).

Note
  • If both the sampleSize and sampleRatio parameters are empty, an error is returned.

  • If both the sampleSize and sampleRatio parameters are specified, the sampleSize parameter takes precedence.

sampleRatio

No

None

The sampling proportion.

  • If the value is a number, it must be a floating-point number between 0 and 1, and the value indicates the sampling proportion of each stratum.

  • If the value is a string, the format must be strata0:r0,strata1:r1. The value after a colon (:) indicates the sampling proportion for the stratum specified before the colon (:).

randomSeed

No

123456

The random seed. The value must be a positive integer.

lifecycle

No

None

The lifecycle of the output table. Valid values: [1,3650].

coreNum

No

Determined by the system

The number of cores used in computing. The value must be a positive integer.

memSizePerCore

No

Determined by the system

The memory size of each core. Valid values: (1,65536). Unit: MB.