Density-Based Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm. A cluster is defined as a maximum set of densely connected points. The algorithm can be used to identify clusters of arbitrary shapes in spatial datasets with noise. You can use the DBSCAN component to create clustering models. This topic describes how to configure the DBSCAN component.
Limits
The DBSCAN component can be used only in Machine Learning Designer of Machine Learning Platform for AI (PAI).
The supported computing engines are MaxCompute and Apache Flink.
Configure the component in the PAI console
You can configure parameters for the DBSCAN component in the PAI console.
Tab | Parameter | Description |
Field Setting | idCol | The name of the ID column. |
vectorCol | The name of the vector column. | |
Parameter Setting | epsilon | The longest distance between two neighboring data points. For more information, see the "Appendix 2: How to configure parameters" section of this topic. |
minPoints | The minimum number of data points within the neighborhood of a point for the point to be considered a core point. For more information, see the "Appendix 2: How to configure parameters" section of this topic. | |
predictionCol | The name of the prediction result column. | |
distanceType | The distance measurement used for clustering. Default value: EUCLIDEAN. Valid values:
| |
Execution Tuning | Number of Workers | The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. For more information, see the "Appendix: How to estimate resource usage" section of this topic. |
Memory per worker, unit MB | The memory size of each worker. Valid values: 1024 to 64 × 1024. Unit: MB. For more information, see the "Appendix: How to estimate resource usage" section of this topic. |
Appendix: How to estimate resource usage
You can refer to the following section to estimate resource usage.
How do I estimate the appropriate memory size for each worker?
You can calculate the appropriate memory of each worker by using the following formula: Input data size × 15.
For example, if the input data size is 1 GB, the memory of each worker can be set to 15 GB.
How do I estimate the appropriate worker quantity?
A larger number of workers causes higher overheads for cross-worker communication. Therefore, as you increase the number of workers, the distributed training task will first speed up but become slower after a specific number of workers. You can tune this parameter to find the optimal number.
How do I estimate the maximum amount of data that can be supported by the algorithm?
We recommend that you input less than 1 million data records of less than 200 dimensions.
NoteIf you want to perform clustering on a larger data volume, we recommend that you divide the data into groups and run the DBSCAN algorithm on each group.
Why is the ID of a core data point 2147483648?
This is because the core point is an outlier point that does not belong to any cluster.
Appendix 2: How to configure parameters
The DBSCAN component has two frequently used parameters: minPoints and epsilon.
If the number of observed clusters is too large and you want to reduce the number, we recommend that you prioritize increasing the minPoints value over decreasing the epsilon value.
If the number of observed clusters is too small and you want to increase the number, we recommend that you prioritize decreasing the minPoints value over increasing the epsilon value.