Random Sampling is a technique for extracting samples from an input dataset. It involves randomly selecting samples based on a specified ratio or quantity to generate a subset. Each sampling process is independent, ensuring that each sample has an equal probability of being selected, and the selection of one sample does not influence the selection of others. This method is commonly used to create training and testing datasets, ensuring the impartiality and representativeness of model evaluation, and it is particularly suitable for large-scale data processing.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Random Sampling component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Parameters Setting | Sample Size | The value must be a positive integer. |
Sampling Fraction | The value must be a floating-point number. Valid values: (0,1). | |
Sampling with Replacement | During the random sampling process, each selected sample is returned to the original dataset, allowing that sample to be selected again in subsequent samplings. | |
Random Seed | By default, the system determines the value. | |
Tuning | Cores | The value must be a positive integer. By default, the system determines the value. |
Memory Size per Core | The value must be a positive integer. Unit: MB. Valid values: (1,65536). By default, the system determines the value. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name RandomSample
-project algo_public
-Dlifecycle="28"
-DoutputTableName="test2"
-Dreplace="false"
-DsampleSize="500"
-DinputPartitions="pt=20150501"
-DinputTableName="bank_data_partition";
Parameter | Required | Default value | Description |
inputTableName | Yes | None | The name of the input table. |
inputTablePartitions | No | None | The partitions that are selected from the input table for training. The following formats are supported:
Note Separate multiple partitions with commas (,) . For example, name1=value1,value2. |
outputTableName | Yes | None | The name of the output table. |
sampleSize | No | None | The number of samples. Note
|
sampleRatio | No | None | The sampling proportion. The value must be a floating-point number. Valid values: (0,1). |
replace | No | false | Specifies whether to enable sampling with replacement. The value must be of the BOOLEAN type. |
randomSeed | No | Determined by the system | The random seed. The value must be a positive integer. |
lifecycle | No | None | The lifecycle of the output table. Valid values: [1,3650]. |
coreNum | No | Determined by the system | The number of cores used in computing. The value must be a positive integer. |
memSizePerCore | No | Determined by the system | The memory size of each core. Valid values: (1,65536). Unit: MB. |