The LOF Outlier component of Platform for AI (PAI) identifies samples as outliers based on the Local Outlier Factor (LOF) algorithm. This topic describes how to configure the LOF Outlier component.
Limits
You can use the LOF Outlier component based only on the computing resources of MaxCompute.
Configure the component
You can use one of the following methods to configure the LOF Outlier component.
Method 1: Configure the component in the PAI console
Configure the component on the pipeline page of Machine Learning Designer. The following table describes the parameters.
Tab | Parameter | Description |
Field Setting | featureCols | An array of the names of feature columns. |
groupCols | An array of the names of group columns. | |
tensorCol | The tensor column. | |
vectorCol | The name of the vector column. | |
Parameter Setting | Prediction Result Column | The name of the prediction result column. |
Distance Measurement Method | The distance measurement used for clustering. Default value: EUCLIDEAN. Valid values:
| |
maxOutlierNumPerGroup | The maximum number of outliers per group. | |
maxOutlierRatio | The maximum ratio of outliers that are detected by the LOF algorithm. | |
maxSampleNumPerGroup | The maximum number of samples per group. | |
numNeighbors | The number of adjacent data points that are used in an LOF diagram. Default value: 5. | |
outlierThreshold | If the score exceeds the specified threshold, an outlier is detected. | |
Column name of detail prediction information | The name of the prediction details column. | |
numThreads | The number of threads of the LOF Outlier component. Default value: 1. | |
Execute Tuning | Number of Workers | The number of worker nodes. The value must be a positive integer. This parameter must be used together with the Memory per worker parameter. Valid values: 1 to 9999. |
Memory per worker | The memory size of each worker node. Unit: MB. The value must be a positive integer. You must specify a value from 1024 to 65536. |
Method 2: Configure the component by using Python code
Configure the LOF Outlier component parameter by using the PyAlink Script component to call Python code. For more information, see the PyAlink script documentation.
Parameter | Required | Description | Default value |
predictionCol | Yes | The name of the prediction results column. | N/A |
distanceType | No | The distance measurement used for clustering. Valid values:
| EUCLIDEAN |
featureCols | No | An array of the names of feature columns. | Select All |
groupCols | No | The name of the group column. You can specify multiple columns. | N/A |
maxOutlierNumPerGroup | No | The maximum number of outliers per group. | N/A |
maxOutlierRatio | No | The maximum ratio of outliers that are detected by the LOF algorithm. | N/A |
maxSampleNumPerGroup | No | The maximum number of samples per group. | N/A |
outlierThreshold | No | If the score exceeds the specified threshold, the data point is considered an anomalous point. | N/A |
predictionDetailCol | No | The name of the prediction details column. | N/A |
tensorCol | No | The name of the tensor column. | N/A |
vectorCol | No | The name of the vector column. | N/A |
numNeighbors | No | The number of adjacent data points that are used in a LOF diagram. | 5 |
numThreads | No | The number of threads of the LOF Outlier component. | 1 |
Sample Python code:
import pandas as pd
df = pd.DataFrame([
[0.73, 0],
[0.24, 0],
[0.63, 0],
[0.55, 0],
[0.73, 0],
[0.41, 0]
])
dataOp = BatchOperator.fromDataframe(df, schemaStr='val double, label int')
outlierOp = LofOutlierBatchOp()\
.setFeatureCols(["val"])\
.setOutlierThreshold(3.0)\
.setPredictionCol("pred")\
.setPredictionDetailCol("pred_detail")
evalOp = EvalOutlierBatchOp()\
.setLabelCol("label")\
.setPredictionDetailCol("pred_detail")\
.setOutlierValueStrings(["1"])
metrics = dataOp\
.link(outlierOp)\
.link(evalOp)\
.collectMetrics()
print(metrics)