Skip to main content

dbscan

Annotates a DataFrame with a cluster label for each record using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.
dbscan(
  dataframe: DataFrame,
  epsilon: float,
  min_pts: int,
  geometry: Optional[str],
  include_outliers: bool,
  use_sphere: bool,
  is_core_column_name: str,
  cluster_column_name: str
)

Parameters

dataframe
DataFrame
required
Spark dataframe containing the geometries. The input DataFrame must contain at least one GeometryType column. All rows in the DataFrame must be unique.
epsilon
float
required
minimum distance parameter of DBSCAN algorithm
min_pts
int
required
minimum number of points parameter of DBSCAN algorithm
geometry
Optional[str]
name of the geometry columnIf not provided, the algorithm automatically selects the geometry column to use based on the following rules:
  • If only one GeometryType column exists, it is used.
  • If multiple GeometryType columns exist and one is named geometry, it is used.
  • If multiple GeometryType columns exist and none are named geometry, the column name must be explicitly provided as a parameter.
include_outliers
bool
required
whether to return outlier points. If True, outliers are returned with a cluster value of -1.
use_sphere
bool
required
whether to use a cartesian or sphere distance calculation. Default is false
is_core_column_name
string
required
what the name of the column indicating if this is a core point should be. Default is “isCore”
cluster_column_name
string
required
what the name of the column indicating the cluster id should be. Default is “cluster”

Returns

A PySpark DataFrame containing the cluster label for each row.

Usage Examples

from dbscan import *

# Example usage of dbscan
result = dbscan(
    dataframe=EXAMPLE_NAME,
    epsilon=EXAMPLE_FLOAT_VALUE,
    min_pts=EXAMPLE_INT_VALUE
)

get_knee_locator

Create a KneeLocator to select an epsilon value for passing into a DBSCAN execution. This function implements a common heuristic for epsilon selection by finding the “knee” of the k-distance plot. It operates as follows:
  1. Calculates the distance to the k-th nearest neighbor for a random sample of records (up to max_sample_size).
  2. The value k is set to the min_points value.
  3. These distances are sorted, and their values (y-axis) are plotted against their sorted index (x-axis).
  4. The resulting plot is fed into a kneed.KneeLocator object to find the point of maximum curvature (the “knee”).
The knee attribute of the returned KneeLocator object is the suggested epsilon value.
This method is a heuristic and is not foolproof. The calculated knee may not be optimal for all datasets. It is strongly recommended to visualize the knee plot (e.g., using the KneeLocator’s built-in plotting methods) to manually sanity-check the selected epsilon value before proceeding.

Parameters

get_knee_locator(
  dataframe: DataFrame,
  min_points: int,
  geometry: Optional[str],
  approximate_knn: bool,
  use_sphere: bool,
  max_sample_size: Optional[int]
) -> KneeLocator
dataframe
DataFrame
required
apache sedona dataframe containing the geometries. This should be the same dataframe you intend to
min_points
int
required
the min points parameter you intend to pass to the dbscan function. This will impact the epsilon
geometry
Optional[str]
name of the geometry column
approximate_knn
bool
required
whether to use approximate KNN. When false will use exact KNN join. Default is False
use_sphere
bool
required
whether to use a cartesian or sphere distance calculation. False will use Cartesian. Default
max_sample_size
Optional[int]
the maximum number of records from dataframe to use when calculating the knee. If the

Returns

KneeLocator
A KneeLocator object derived from the input DataFrame, downsampled to approximately max_sample_size records. Retrieve the recommended epsilon value with the return value’s knee_y property.