Use iForest Outlier to detect anomalies - Platform For AI

Isolation Forest (iForest) uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. iForest is widely used to perform anomaly detection. This topic describes how to configure iForest to detect anomalies.

Configure the component

You can use one of the following methods to configure the parameters of iForest for anomaly detection.

Method 1: Configure the component in Machine Learning Designer

Configure the component parameters on the pipeline page of Machine Learning Designer.

Tab	Parameter	Description
Field Setting	featureCols	If you have set the vectorCol or tensorCol parameter, this parameter is greyed out. The feature columns that are used for training. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.
	groupCols	The array of the group column.
	tensorCol	If you have set the vectorCol or featureCols parameter, this parameter is greyed out. The name of the tensor column. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.
	vectorCol	If you have set the tensorCol or featureCols parameter, this parameter is greyed out. The name of the vector column. Note The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.
Parameter Setting	predictionCol	The name of the prediction result column.
	maxOutlierNumPerGroup	The maximum number of outliers per group.
	maxOutlierRatio	The maximum ratio of outliers that are detected by the algorithm.
	maxSampleNumPerGroup	The maximum number of samples per group.
	numTrees	The number of trees in the model. The default value is 100.
	outlierThreshold	If the score exceeds the specified threshold, the data point is considered an anomalous point.
	Column name of detail prediction information	The name of the prediction details column.
	subsamplingSize	The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000. Default value: 256.
	numThreads	The number of threads of the component. Default value: 1.
Execution Tuning	Number of Workers	The number of nodes. The value must be a positive integer. This parameter must be used with the Memory per worker parameter. Valid values: 1 to 9999.
Execution Tuning	Memory per worker, unit MB	The memory size of each node. Unit: MB. The value must be a positive integer. You must specify the value from 1024 to 65536.

Method 2: Use Python code

Configure the component parameters by using the PyAlink Script component. You can use the PyAlink script component to call Python code. For more information, see PyAlink Script.

Parameter	Required	Description	Default value
predictionCol	Yes	The name of the prediction results column.	N/A
featureCols	No	The array of the feature column.	Select All
groupCols	No	The name of the group column. You can specify multiple columns.	N/A
maxOutlierNumPerGroup	No	The maximum number of outliers per group.	N/A
maxOutlierRatio	No	The maximum ratio of outliers that are detected by the algorithm.	N/A
maxSampleNumPerGroup	No	The maximum number of samples per group.	N/A
numTrees	No	The number of trees in the model.	100
outlierThreshold	No	If the score exceeds the specified threshold, the data point is considered an anomalous point.	N/A
predictionDetailCol	Yes	The name of the prediction details column.	N/A
tensorCol	No	The name of the tensor column.	N/A
vectorCol	No	The name of the vector column.	N/A
subsamplingSize	No	The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000.	256
numThreads	No	The number of threads of the component.	1

The following sample Python code provides an example.

from pyalink.alink import *
import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])

dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')

outlierOp = IForestOutlierBatchOp()\
    .setFeatureCols(["val"])\
    .setOutlierThreshold(3.0)\
    .setPredictionCol("pred")\
    .setPredictionDetailCol("pred_detail")

outlierOp.print()