All Products
Search
Document Center

Platform For AI:Random Sampling

Last Updated:Nov 28, 2024

Random Sampling is a technique for extracting samples from an input dataset. It involves randomly selecting samples based on a specified ratio or quantity to generate a subset. Each sampling process is independent, ensuring that each sample has an equal probability of being selected, and the selection of one sample does not influence the selection of others. This method is commonly used to create training and testing datasets, ensuring the impartiality and representativeness of model evaluation, and it is particularly suitable for large-scale data processing.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Random Sampling component on the pipeline page and configure the following parameters:

Category

Parameter

Description

Parameters Setting

Sample Size

The value must be a positive integer.

Sampling Fraction

The value must be a floating-point number. Valid values: (0,1).

Sampling with Replacement

During the random sampling process, each selected sample is returned to the original dataset, allowing that sample to be selected again in subsequent samplings.

Random Seed

By default, the system determines the value.

Tuning

Cores

The value must be a positive integer. By default, the system determines the value.

Memory Size per Core

The value must be a positive integer. Unit: MB. Valid values: (1,65536). By default, the system determines the value.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.

PAI -name RandomSample
    -project algo_public
    -Dlifecycle="28"
    -DoutputTableName="test2"
    -Dreplace="false"
    -DsampleSize="500"
    -DinputPartitions="pt=20150501"
    -DinputTableName="bank_data_partition";

Parameter

Required

Default value

Description

inputTableName

Yes

None

The name of the input table.

inputTablePartitions

No

None

The partitions that are selected from the input table for training. The following formats are supported:

  • partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

Separate multiple partitions with commas (,) . For example, name1=value1,value2.

outputTableName

Yes

None

The name of the output table.

sampleSize

No

None

The number of samples.

Note
  • If both the sampleSize and sampleRatio parameters are empty, an error is returned.

  • If both the sampleSize and sampleRatio parameters are specified, the sampleSize parameter takes precedence.

sampleRatio

No

None

The sampling proportion. The value must be a floating-point number. Valid values: (0,1).

replace

No

false

Specifies whether to enable sampling with replacement. The value must be of the BOOLEAN type.

randomSeed

No

Determined by the system

The random seed. The value must be a positive integer.

lifecycle

No

None

The lifecycle of the output table. Valid values: [1,3650].

coreNum

No

Determined by the system

The number of cores used in computing. The value must be a positive integer.

memSizePerCore

No

Determined by the system

The memory size of each core. Valid values: (1,65536). Unit: MB.