Isolation Forest (iForest) uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. iForest is widely used to perform anomaly detection. This topic describes how to configure iForest to detect anomalies.
Configure the component
You can use one of the following methods to configure the parameters of iForest for anomaly detection.
Method 1: Configure the component in Machine Learning Designer
Configure the component parameters on the pipeline page of Machine Learning Designer.
Tab | Parameter | Description |
Field Setting | featureCols | If you have set the vectorCol or tensorCol parameter, this parameter is greyed out. The feature columns that are used for training. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm. |
groupCols | The array of the group column. | |
tensorCol | If you have set the vectorCol or featureCols parameter, this parameter is greyed out. The name of the tensor column. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm. | |
vectorCol | If you have set the tensorCol or featureCols parameter, this parameter is greyed out. The name of the vector column. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm. | |
Parameter Setting | predictionCol | The name of the prediction result column. |
maxOutlierNumPerGroup | The maximum number of outliers per group. | |
maxOutlierRatio | The maximum ratio of outliers that are detected by the algorithm. | |
maxSampleNumPerGroup | The maximum number of samples per group. | |
numTrees | The number of trees in the model. The default value is 100. | |
outlierThreshold | If the score exceeds the specified threshold, the data point is considered an anomalous point. | |
Column name of detail prediction information | The name of the prediction details column. | |
subsamplingSize | The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000. Default value: 256. | |
numThreads | The number of threads of the component. Default value: 1. | |
Execution Tuning | Number of Workers | The number of nodes. The value must be a positive integer. This parameter must be used with the Memory per worker parameter. Valid values: 1 to 9999. |
Memory per worker, unit MB | The memory size of each node. Unit: MB. The value must be a positive integer. You must specify the value from 1024 to 65536. |
Method 2: Use Python code
Configure the component parameters by using the PyAlink Script component. You can use the PyAlink script component to call Python code. For more information, see PyAlink Script.
Parameter | Required | Description | Default value |
predictionCol | Yes | The name of the prediction results column. | N/A |
featureCols | No | The array of the feature column. | Select All |
groupCols | No | The name of the group column. You can specify multiple columns. | N/A |
maxOutlierNumPerGroup | No | The maximum number of outliers per group. | N/A |
maxOutlierRatio | No | The maximum ratio of outliers that are detected by the algorithm. | N/A |
maxSampleNumPerGroup | No | The maximum number of samples per group. | N/A |
numTrees | No | The number of trees in the model. | 100 |
outlierThreshold | No | If the score exceeds the specified threshold, the data point is considered an anomalous point. | N/A |
predictionDetailCol | Yes | The name of the prediction details column. | N/A |
tensorCol | No | The name of the tensor column. | N/A |
vectorCol | No | The name of the vector column. | N/A |
subsamplingSize | No | The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000. | 256 |
numThreads | No | The number of threads of the component. | 1 |
The following sample Python code provides an example.
from pyalink.alink import *
import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])
dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')
outlierOp = IForestOutlierBatchOp()\
.setFeatureCols(["val"])\
.setOutlierThreshold(3.0)\
.setPredictionCol("pred")\
.setPredictionDetailCol("pred_detail")
outlierOp.print()