Skip to content

dbscan

DBSCAN is a popular clustering algorithm for spatial data.

It identifies groups of data where enough records are close enough to each other. This implementation leverages spark, sedona and graphframes to support large scale datasets and various, heterogeneous geometric feature types.

dbscan(dataframe, epsilon, min_pts, geometry=None, include_outliers=False, use_spheroid=False)

Annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.

The dataframe should contain at least one GeometryType column. Rows must be unique. If one geometry column is present it will be used automatically. If two are present, the one named 'geometry' will be used. If more than one are present and neither is named 'geometry', the column name must be provided.

Parameters:

Name Type Description Default
dataframe DataFrame

apache sedona dataframe containing the geometries

required
epsilon float

minimum distance parameter of DBSCAN algorithm

required
min_pts int

minimum number of points parameter of DBSCAN algorithm

required
geometry Optional[str]

name of the geometry column

None
include_outliers bool

whether to return outlier points. If True, outliers are returned with a cluster value of -1. Default is False

False
use_spheroid

whether to use a cartesian or spheroidal distance calculation. Default is false

False

Returns:

Type Description

A PySpark DataFrame containing the cluster label for each row

get_knee_locator(dataframe, min_points, curve='convex', geometry=None, approximate_knn=False, use_spheroid=False, max_sample_size=DEFAULT_MAX_SAMPLE_SIZE)

Create a KneeLocator for the purposes of selecting an epsilon value for passing into a DBSCAN execution.

Finding the knee of the plot of (index, distance to kth nearest neighbor), where k = min_points is a common heuristic for selecting the epsilon parameter for DBSCAN. This function calculates the kth nearest neighbor distance for a random sample of max_sample_size records and feeds them into a KneeLocator object provided by the kneed lib. While often a good start, this method is not fool proof. It is recommended to visualize the knee plot to sanity check before moving forward with the provided epsilon value.

See https://kneed.readthedocs.io/en/stable/parameters.html

See https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd

Parameters:

Name Type Description Default
dataframe DataFrame

apache sedona dataframe containing the geometries. This should be the same dataframe you intend to pass to the dbscan function.

required
min_points int

the min points parameter you intend to pass to the dbscan function. This will impact the epsilon value that is calculated.

required
curve str

should be one of "convex" or "concave". If the line has a positive derivative, use "convex". If the line has a negative derivative, use "concave".

'convex'
geometry Optional[str]

name of the geometry column

None
approximate_knn bool

whether to use approximate KNN. When false will use exact KNN join. Default is False

False
use_spheroid bool

whether to use a cartesian or spheroidal distance calculation. False will use Cartesian. Default is false

False
max_sample_size Optional[int]

the maximum number of records from dataframe to use when calculating the knee. If the dataframe has more records than this, it will be downsampled to approximately this size. Default is 1 million. Records are collected to the driver to visualize/calculate the knee, so this is important for stability.

DEFAULT_MAX_SAMPLE_SIZE

Returns:

Type Description
KneeLocator

A KneeLocator object derived from the input DataFrame, downsampled to approximately max_sample_size records.

KneeLocator

Retrieve the recommended epsilon value with the return value's knee_y property.