dbscan
Annotates a DataFrame with a cluster label for each record using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.Parameters
Spark dataframe containing the geometries. The input DataFrame must contain at least one GeometryType column.
All rows in the DataFrame must be unique.
minimum distance parameter of DBSCAN algorithm
minimum number of points parameter of DBSCAN algorithm
name of the geometry columnIf not provided, the algorithm automatically selects the geometry column to use based on the following rules:
-
If only one
GeometryTypecolumn exists, it is used. -
If multiple
GeometryTypecolumns exist and one is named geometry, it is used. -
If multiple
GeometryTypecolumns exist and none are named geometry, the column name must be explicitly provided as a parameter.
whether to return outlier points. If True, outliers are returned with a cluster value of -1.
whether to use a cartesian or sphere distance calculation. Default is false
what the name of the column indicating if this is a core point should be. Default is “isCore”
what the name of the column indicating the cluster id should be. Default is “cluster”
Returns
A PySpark DataFrame containing the cluster label for each row.Usage Examples
get_knee_locator
Create aKneeLocator to select an epsilon value for passing into a DBSCAN execution.
This function implements a common heuristic for epsilon selection by finding the “knee” of the k-distance plot. It operates as follows:
- Calculates the distance to the k-th nearest neighbor for a random sample of records (up to
max_sample_size). - The value
kis set to themin_pointsvalue. - These distances are sorted, and their values (y-axis) are plotted against their sorted index (x-axis).
- The resulting plot is fed into a
kneed.KneeLocatorobject to find the point of maximum curvature (the “knee”).
knee attribute of the returned KneeLocator object is the suggested epsilon value.
Parameters
apache sedona dataframe containing the geometries. This should be the same dataframe you intend to
the min points parameter you intend to pass to the dbscan function. This will impact the epsilon
name of the geometry column
whether to use approximate KNN. When false will use exact KNN join. Default is False
whether to use a cartesian or sphere distance calculation. False will use Cartesian. Default
the maximum number of records from dataframe to use when calculating the knee. If the
Returns
KneeLocator
A KneeLocator object derived from the input DataFrame, downsampled to approximately max_sample_size records.
Retrieve the recommended epsilon value with the return value’s knee_y property.

