All Products
Search
Document Center

Platform For AI:DBSCAN Prediction

Last Updated:Nov 21, 2024

Density-Based Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm. A cluster is defined as a maximum set of densely connected points. The algorithm considers regions with high density to be clusters, and detects clusters of arbitrary shapes in spatial databases with noise. You can use the DBSCAN training model of the DBSCAN Prediction component to predict the clusters to which new points may belong. This topic describes how to configure the DBSCAN Prediction component.

Computing resources

The DBSCAN Prediction component supports the following computing resources:

  • MaxCompute

  • Flink

  • DLC

Configure the component in the PAI console

You can configure parameters for the DBSCAN Prediction component in the Machine Learning Platform for AI (PAI) console.

Tab

Parameter

Description

Field Setting

reservedCols

Whether to reserve the original column name.

Parameter Setting

predictionCol

The name of the prediction column.

predictionDetailCol

The name of the prediction details column.

numThreads

The number of threads used for DBSCAN clustering.

Execution Tuning

Choose Running Mode

MaxCompute

Use MaxCompute or Flink computing resources. For more information about how to configure the number of workers and the memory of workers, see Appendix: How to estimate resource usage.

Flink

DLC

Use DLC computing resources. Configure the resoueces based on the instructions on the page.

Appendix: How to estimate resource usage

You can refer to the following section to estimate resource usage.

  • How do I estimate the memory to be used by each node?

    The memory used by each node is approximately the model size times 30.

    For example, if the input model size is 1 GB, the memory of each node can be set to 30 GB.

  • How do I estimate the number of nodes that I need?

    The distributed training task speeds up and then slows down as the number of nodes increases due to communication overhead. If the task slows down, stop increasing the node quantity. This node quantity can be used.