> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
> Use this file to discover all available pages before exploring further.

# DBSCAN Python Module

> DBSCAN is a density-based algorithm ideal for spatial data. It groups records that are closely packed together, marking records in low-density regions as noise. This implementation is built on Apache Spark, Apache Sedona, and GraphFrames to support large-scale datasets and heterogeneous GeometryType features.

## dbscan

Annotates a DataFrame with a cluster label for each record using
the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

```python theme={"system"}
dbscan(
  dataframe: DataFrame,
  epsilon: float,
  min_pts: int,
  geometry: Optional[str],
  include_outliers: bool,
  use_sphere: bool,
  is_core_column_name: str,
  cluster_column_name: str
)
```

### Parameters

<ParamField path="dataframe" type="DataFrame" required>
  Spark dataframe containing the geometries. The input DataFrame must contain at least one GeometryType column.
  All rows in the DataFrame must be unique.
</ParamField>

<ParamField path="epsilon" type="float" required>
  minimum distance parameter of DBSCAN algorithm
</ParamField>

<ParamField path="min_pts" type="int" required>
  minimum number of points parameter of DBSCAN algorithm
</ParamField>

<ParamField path="geometry" type="Optional[str]">
  name of the geometry column

  If not provided, the algorithm automatically selects the geometry column to use based on the following rules:

  * If only one `GeometryType` column exists, it is used.

  * If multiple `GeometryType` columns exist and one is named geometry, it is used.

  * If multiple `GeometryType` columns exist and none are named geometry, the column name must be explicitly provided as a parameter.
</ParamField>

<ParamField path="include_outliers" type="bool" required>
  whether to return outlier points. If True, outliers are returned with a cluster value of -1.
</ParamField>

<ParamField path="use_sphere" type="bool" required>
  whether to use a cartesian or sphere distance calculation. Default is false
</ParamField>

<ParamField path="is_core_column_name" type="string" required>
  what the name of the column indicating if this is a core point should be. Default is "isCore"
</ParamField>

<ParamField path="cluster_column_name" type="string" required>
  what the name of the column indicating the cluster id should be. Default is "cluster"
</ParamField>

### Returns

A PySpark DataFrame containing the cluster label for each row.

### Usage Examples

```python theme={"system"}
from dbscan import *

# Example usage of dbscan
result = dbscan(
    dataframe=EXAMPLE_NAME,
    epsilon=EXAMPLE_FLOAT_VALUE,
    min_pts=EXAMPLE_INT_VALUE
)
```

## get\_knee\_locator

Create a `KneeLocator` to select an epsilon value for passing into a DBSCAN execution.

This function implements a common heuristic for `epsilon` selection by finding the "knee" of the k-distance plot. It operates as follows:

1. Calculates the distance to the k-th nearest neighbor for a random sample of records (up to `max_sample_size`).
2. The value `k` is set to the `min_points` value.
3. These distances are sorted, and their values (y-axis) are plotted against their sorted index (x-axis).
4. The resulting plot is fed into a `kneed.KneeLocator` object to find the point of maximum curvature (the "knee").

The `knee` attribute of the returned `KneeLocator` object is the suggested `epsilon` value.

<Warning>This method is a heuristic and is not foolproof. The calculated knee may not be optimal for all datasets.
It is **strongly recommended** to visualize the knee plot (e.g., using the `KneeLocator`'s built-in plotting
methods) to manually sanity-check the selected `epsilon` value before proceeding.</Warning>

### Parameters

```python theme={"system"}
get_knee_locator(
  dataframe: DataFrame,
  min_points: int,
  geometry: Optional[str],
  approximate_knn: bool,
  use_sphere: bool,
  max_sample_size: Optional[int]
) -> KneeLocator
```

<ParamField path="dataframe" type="DataFrame" required>
  apache sedona dataframe containing the geometries. This should be the same dataframe you intend to
</ParamField>

<ParamField path="min_points" type="int" required>
  the min points parameter you intend to pass to the dbscan function. This will impact the epsilon
</ParamField>

<ParamField path="geometry" type="Optional[str]">
  name of the geometry column
</ParamField>

<ParamField path="approximate_knn" type="bool" required>
  whether to use approximate KNN. When false will use exact KNN join. Default is False
</ParamField>

<ParamField path="use_sphere" type="bool" required>
  whether to use a cartesian or sphere distance calculation. False will use Cartesian. Default
</ParamField>

<ParamField path="max_sample_size" type="Optional[int]">
  the maximum number of records from dataframe to use when calculating the knee. If the
</ParamField>

### Returns

<ResponseField name="KneeLocator">
  A KneeLocator object derived from the input DataFrame, downsampled to approximately max\_sample\_size records.
  Retrieve the recommended epsilon value with the return value's knee\_y property.
</ResponseField>
