dbscan

DBSCAN is a popular clustering algorithm for spatial data.

It identifies groups of data where enough records are close enough to each other. This implementation leverages spark, sedona and graphframes to support large scale datasets and various, heterogeneous geometric feature types.

`dbscan(dataframe, epsilon, min_pts, geometry=None, include_outliers=False, use_spheroid=False)`

Annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.

The dataframe should contain at least one GeometryType column. Rows must be unique. If one geometry column is present it will be used automatically. If two are present, the one named 'geometry' will be used. If more than one are present and neither is named 'geometry', the column name must be provided.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	apache sedona dataframe containing the geometries	required
`epsilon`	`float`	minimum distance parameter of DBSCAN algorithm	required
`min_pts`	`int`	minimum number of points parameter of DBSCAN algorithm	required
`geometry`	`Optional[str]`	name of the geometry column	`None`
`include_outliers`	`bool`	whether to return outlier points. If True, outliers are returned with a cluster value of -1. Default is False	`False`
`use_spheroid`		whether to use a cartesian or spheroidal distance calculation. Default is false	`False`

Returns:

Type	Description
	A PySpark DataFrame containing the cluster label for each row

`get_knee_locator(dataframe, min_points, curve='convex', geometry=None, approximate_knn=False, use_spheroid=False, max_sample_size=DEFAULT_MAX_SAMPLE_SIZE)`

Create a KneeLocator for the purposes of selecting an epsilon value for passing into a DBSCAN execution.

Finding the knee of the plot of (index, distance to kth nearest neighbor), where k = min_points is a common heuristic for selecting the epsilon parameter for DBSCAN. This function calculates the kth nearest neighbor distance for a random sample of max_sample_size records and feeds them into a KneeLocator object provided by the kneed lib. While often a good start, this method is not fool proof. It is recommended to visualize the knee plot to sanity check before moving forward with the provided epsilon value.

See https://kneed.readthedocs.io/en/stable/parameters.html

See https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	apache sedona dataframe containing the geometries. This should be the same dataframe you intend to pass to the dbscan function.	required
`min_points`	`int`	the min points parameter you intend to pass to the dbscan function. This will impact the epsilon value that is calculated.	required
`curve`	`str`	should be one of "convex" or "concave". If the line has a positive derivative, use "convex". If the line has a negative derivative, use "concave".	`'convex'`
`geometry`	`Optional[str]`	name of the geometry column	`None`
`approximate_knn`	`bool`	whether to use approximate KNN. When false will use exact KNN join. Default is False	`False`
`use_spheroid`	`bool`	whether to use a cartesian or spheroidal distance calculation. False will use Cartesian. Default is false	`False`
`max_sample_size`	`Optional[int]`	the maximum number of records from dataframe to use when calculating the knee. If the dataframe has more records than this, it will be downsampled to approximately this size. Default is 1 million. Records are collected to the driver to visualize/calculate the knee, so this is important for stability.	`DEFAULT_MAX_SAMPLE_SIZE`

Returns:

Type	Description
`KneeLocator`	A KneeLocator object derived from the input DataFrame, downsampled to approximately max_sample_size records.
`KneeLocator`	Retrieve the recommended epsilon value with the return value's knee_y property.