Skip to main content
Annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
def dbscan(
      dataframe: DataFrame,
      epsilon: Double,
      minPts: Int,
      geometry: String = null,
      includeOutliers: Boolean = true,
      useSpheroid: Boolean = false,
      isCoreColumnName: String = "isCore",
      clusterColumnName: String = "cluster"): DataFrame =

Parameters

dataframe
DataFrame
dataframe to cluster. Must contain at least one GeometryType column
epsilon
Double
minimum distance parameter of DBSCAN algorithm
minPts
Int
minimum number of points parameter of DBSCAN algorithm
geometry
String
name of the geometry columnThe dataframe should contain at least one GeometryType column. Rows must be unique. If one geometry column is present it will be used automatically. If two are present, the one named 'geometry' will be used. If more than one are present and neither is named 'geometry', the column name must be provided. The new column will be named 'cluster'.
includeOutliers
Boolean
whether to include outliers in the output. Default is false
useSpheroid
Boolean
whether to use a cartesian or spheroidal distance calculation. Default is false
isCoreColumnName
String
what the name of the column indicating if this is a core point should be. Default is “isCore”
clusterColumnName
String
what the name of the column indicating the cluster id should be. Default is “cluster”

Returns

The input DataFrame with the cluster label added to each row. Outlier will have a cluster value of -1 if included.

Usage Example

import org.apache.sedona.stats.clustering.DBSCAN

// Example usage
val result = DBSCAN.dbscan(dataframe, epsilon, minPts)