DBSCAN Python Module

dbscan

Annotates a DataFrame with a cluster label for each record using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

dbscan(
  dataframe: DataFrame,
  epsilon: float,
  min_pts: int,
  geometry: Optional[str],
  include_outliers: bool,
  use_sphere: bool,
  is_core_column_name: str,
  cluster_column_name: str
)

Parameters

dataframe

DataFrame

required

Spark dataframe containing the geometries. The input DataFrame must contain at least one GeometryType column. All rows in the DataFrame must be unique.

epsilon

float

required

minimum distance parameter of DBSCAN algorithm

min_pts

int

required

minimum number of points parameter of DBSCAN algorithm

geometry

Optional[str]

name of the geometry columnIf not provided, the algorithm automatically selects the geometry column to use based on the following rules:

If only one GeometryType column exists, it is used.
If multiple GeometryType columns exist and one is named geometry, it is used.
If multiple GeometryType columns exist and none are named geometry, the column name must be explicitly provided as a parameter.

include_outliers

bool

required

whether to return outlier points. If True, outliers are returned with a cluster value of -1.

use_sphere

bool

required

whether to use a cartesian or sphere distance calculation. Default is false

is_core_column_name

string

required

what the name of the column indicating if this is a core point should be. Default is “isCore”

cluster_column_name

string

required

what the name of the column indicating the cluster id should be. Default is “cluster”

Returns

A PySpark DataFrame containing the cluster label for each row.

Usage Examples

from dbscan import *

# Example usage of dbscan
result = dbscan(
    dataframe=EXAMPLE_NAME,
    epsilon=EXAMPLE_FLOAT_VALUE,
    min_pts=EXAMPLE_INT_VALUE
)

get_knee_locator

Create a KneeLocator to select an epsilon value for passing into a DBSCAN execution. This function implements a common heuristic for epsilon selection by finding the “knee” of the k-distance plot. It operates as follows:

Calculates the distance to the k-th nearest neighbor for a random sample of records (up to max_sample_size).
The value k is set to the min_points value.
These distances are sorted, and their values (y-axis) are plotted against their sorted index (x-axis).
The resulting plot is fed into a kneed.KneeLocator object to find the point of maximum curvature (the “knee”).

The knee attribute of the returned KneeLocator object is the suggested epsilon value.

This method is a heuristic and is not foolproof. The calculated knee may not be optimal for all datasets. It is strongly recommended to visualize the knee plot (e.g., using the KneeLocator’s built-in plotting methods) to manually sanity-check the selected epsilon value before proceeding.

Parameters

get_knee_locator(
  dataframe: DataFrame,
  min_points: int,
  geometry: Optional[str],
  approximate_knn: bool,
  use_sphere: bool,
  max_sample_size: Optional[int]
) -> KneeLocator

dataframe

DataFrame

required

apache sedona dataframe containing the geometries. This should be the same dataframe you intend to

min_points

int

required

the min points parameter you intend to pass to the dbscan function. This will impact the epsilon

geometry

Optional[str]

name of the geometry column

approximate_knn

bool

required

whether to use approximate KNN. When false will use exact KNN join. Default is False

use_sphere

bool

required

whether to use a cartesian or sphere distance calculation. False will use Cartesian. Default

max_sample_size

Optional[int]

the maximum number of records from dataframe to use when calculating the knee. If the

Returns

KneeLocator

A KneeLocator object derived from the input DataFrame, downsampled to approximately max_sample_size records. Retrieve the recommended epsilon value with the return value’s knee_y property.

Wherobots Cloud REST API

WherobotsDB

Vector tiles (PMTiles)

Havasu (Iceberg) table management

WherobotsAI

DBSCAN Python Module

dbscan

Parameters

Returns

Usage Examples

get_knee_locator

Parameters

Returns

Wherobots Cloud REST API

WherobotsDB

Vector tiles (PMTiles)

Havasu (Iceberg) table management

WherobotsAI

​dbscan

​Parameters

​Returns

​Usage Examples

​get_knee_locator

​Parameters

​Returns

dbscan

Parameters

Returns

Usage Examples

get_knee_locator

Parameters

Returns