All Products
Search
Document Center

Platform For AI:Use iForest Outlier to detect anomalies

Last Updated:Dec 11, 2024

Isolation Forest (iForest) uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. iForest is widely used to perform anomaly detection. This topic describes how to configure iForest to detect anomalies.

Configure the component

You can use one of the following methods to configure the parameters of iForest for anomaly detection.

Method 1: Configure the component in Machine Learning Designer

Configure the component parameters on the pipeline page of Machine Learning Designer.

Tab

Parameter

Description

Field Setting

featureCols

If you have set the vectorCol or tensorCol parameter, this parameter is greyed out.

The feature columns that are used for training.

Note

The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.

groupCols

The array of the group column.

tensorCol

If you have set the vectorCol or featureCols parameter, this parameter is greyed out.

The name of the tensor column.

Note

The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.

vectorCol

If you have set the tensorCol or featureCols parameter, this parameter is greyed out.

The name of the vector column.

Note

The featureCols, tensorCol, and vectorCol parameters are mutually exclusive. You can use only one of them to describe the input features of the algorithm.

Parameter Setting

predictionCol

The name of the prediction result column.

maxOutlierNumPerGroup

The maximum number of outliers per group.

maxOutlierRatio

The maximum ratio of outliers that are detected by the algorithm.

maxSampleNumPerGroup

The maximum number of samples per group.

numTrees

The number of trees in the model. The default value is 100.

outlierThreshold

If the score exceeds the specified threshold, the data point is considered an anomalous point.

Column name of detail prediction information

The name of the prediction details column.

subsamplingSize

The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000. Default value: 256.

numThreads

The number of threads of the component. Default value: 1.

Execution Tuning

Number of Workers

The number of nodes. The value must be a positive integer. This parameter must be used with the Memory per worker parameter. Valid values: 1 to 9999.

Memory per worker, unit MB

The memory size of each node. Unit: MB. The value must be a positive integer. You must specify the value from 1024 to 65536.

Method 2: Use Python code

Configure the component parameters by using the PyAlink Script component. You can use the PyAlink script component to call Python code. For more information, see PyAlink Script.

Parameter

Required

Description

Default value

predictionCol

Yes

The name of the prediction results column.

N/A

featureCols

No

The array of the feature column.

Select All

groupCols

No

The name of the group column. You can specify multiple columns.

N/A

maxOutlierNumPerGroup

No

The maximum number of outliers per group.

N/A

maxOutlierRatio

No

The maximum ratio of outliers that are detected by the algorithm.

N/A

maxSampleNumPerGroup

No

The maximum number of samples per group.

N/A

numTrees

No

The number of trees in the model.

100

outlierThreshold

No

If the score exceeds the specified threshold, the data point is considered an anomalous point.

N/A

predictionDetailCol

Yes

The name of the prediction details column.

N/A

tensorCol

No

The name of the tensor column.

N/A

vectorCol

No

The name of the vector column.

N/A

subsamplingSize

No

The number of rows that are sampled in each tree. The value must be a positive integer. Valid values: 2 to 100000.

256

numThreads

No

The number of threads of the component.

1

The following sample Python code provides an example.

from pyalink.alink import *
import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])

dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')

outlierOp = IForestOutlierBatchOp()\
    .setFeatureCols(["val"])\
    .setOutlierThreshold(3.0)\
    .setPredictionCol("pred")\
    .setPredictionDetailCol("pred_detail")

outlierOp.print()