The Feature Anomaly Smoothing component can smooth anomalous features in input data to a specific interval. Sparse and dense data is supported.
Background information
- Z-Score
If a feature is in a normal distribution, the noise is distributed outside the range of -3×alpha to 3×alpha. Z-Score smooths the noise to the range of [-3×alpha,3×alpha].
For example, assume that for a feature in a normal distribution, the mean value is 0, and the standard deviation is 3. The feature value -10 is identified as anomalous and corrected to -3 × 3 + 0 (-9) based on the smoothing rule of Z-Score. In the same way, the feature value 10 is corrected to 3 × 3 + 0 (9).
- Percentile smoothing
Percentile smoothing is used to smooth the data distributed outside the range of [minPer, maxPer] to the minPer or maxPer quantile.
For example, assume that the feature value of age is in the range of 0 to 200. Set minPer to 0 and maxPer to 50%. Feature values outside the range of 0 to 100 are corrected to 0 or 100.
- Threshold smoothing
Threshold smoothing is used to smooth the data distributed outside the range of [minThresh, maxThresh] to the minThresh or maxThresh data point.
For example, assume that the feature value of age is in the range of 0 to 200. Set minThresh to 10 and maxThresh to 80. Feature values outside the range of 0 to 80 are corrected to 0 or 80.
- Boxplot smoothing
This method uses quartiles to smooth data to the range of minThresh=q1-1.5(q3-q1) to maxThresh=q3+1.5(q3-q1).
Configure the component
You can use one of the following methods to configure the Feature Anomaly Smoothing component.
Method 1: Configure the component on the pipeline page
Tab | Parameter | Description |
---|---|---|
Fields Setting | Smoothed Feature Columns | The feature columns that you want to smooth. |
Label Column | The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and the objective variables can be viewed. | |
Parameters Setting | Smoothing Method | The method that is used for smoothing. Valid values:
|
Confidence Interval | The confidence level. This parameter is required when the Smoothing Method parameter is set to Z-Score. | |
Minimum Threshold | The minimum threshold. The default value is -9999, which indicates that no minimum threshold is set. This parameter is required when the Smoothing Method parameter is set to Threshold Smoothing. | |
Maximum Threshold | The maximum threshold. The default value is -9999, which indicates that no maximum threshold is set. This parameter is required when the Smoothing Method parameter is set to Threshold Smoothing. | |
Minimum Percentile | The minimum percentile. This parameter is required when the Smoothing Method parameter is set to Percentile or Box Plot. | |
Maximum Percentile | The maximum percentile. This parameter is required when the Smoothing Method parameter is set to Percentile or Box Plot. |
Method 2: Use PAI commands
PAI -name fe_soften_runner -project algo_public
-DminThresh=5000
-Dlifecycle=28
-DsoftenMethod=min-max-thresh
-DsoftenCols=nr_employed
-DmaxThresh=6000
-DinputTable=pai_dense_10_1
-DoutputTable=pai_temp_2262_20381_1;
Parameter | Required | Description | Default value |
---|---|---|---|
inputTable | Yes | The name of the input table. | N/A |
inputTablePartitions | No | The partitions that are selected from the input table for training. Specify this parameter in the Partition_name=value format. To specify multi-level partitions, specify this parameter in the If you specify multiple partitions, separate them with commas (,). | All partitions in the input table |
outputTable | Yes | The output table after smoothing. | N/A |
labelCol | No | The label column. If this parameter is specified, the x-y histogram that displays the relationship between the features and the objective variables can be viewed. | Empty string |
categoryCols | No | The selected fields that are processed as enumerated features. | Empty string |
softenCols | Yes | The features that you want to smooth. Sparse features are automatically displayed by the system. | N/A |
softenMethod | No | The method that is used for smoothing. Valid values:
| ZScore |
softenTopN | No | If you do not set the softenCols parameter, the system automatically selects the top N features that require smoothing. The value must be a positive integer. | 10 |
cl | No | The confidence level. This parameter is required when the softenMethod parameter is set to ZScore. | 10 |
minPer | No | The minimum percentile. This parameter is required when the softenMethod parameter is set to min-max-per or boxplot. | 0.0 |
maxPer | No | The maximum percentile. This parameter is required when the softenMethod parameter is set to min-max-per or boxplot. | 1.0 |
minThresh | No | The minimum threshold. This parameter is required when the softenMethod parameter is set to min-max-thresh. | -9999 |
maxThresh | No | The maximum threshold. This parameter is required when the softenMethod parameter is set to min-max-thresh. | -9999 |
isSparse | No | Specifies whether features are sparse features in the key-value format. Valid values:
The default value is false, which indicates that features are dense. | false |
itemSpliter | No | The delimiter that is used to separate sparse key-value pairs. | , |
kvSpliter | No | The delimiter that is used to separate sparse keys and values. | : |
lifecycle | No | The lifecycle of the output table. The value must be a positive integer. | 7 |
coreNum | No | The number of cores. This parameter is used together with the memSizePerCore parameter. The value must be a positive integer. Valid values: [1,9999]. | Determined by the system |
memSizePerCore | No | The memory size of each core. Unit: MB. The value must be a positive integer. Valid values: [2048,64 × 1024]. | Determined by the system |
Examples
- Input data
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10;
nr_employed 5228.1 5195.8 4991.6 5099.1 5076.2 5228.1 5099.1 5099.1 5076.2 5099.1 - Parameter settingsOn the Fields Setting tab, set Smoothed Feature Columns to nr_employed. On the Parameters Setting tab, set Smoothing Method to Threshold Smoothing, Minimum Threshold to 5000, and Maximum Threshold to 6000. The following figure shows the configurations on the Parameters Setting tab.
- Execution results
nr_employed 5228.1 5195.8 5000.0 5099.1 5076.2 5228.1 5099.1 5099.1 5076.2 5099.1