> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Spatial Joins in Wherobots

<Tip>
  The following content is a read-only preview of an executable Jupyter notebook.

  To run this notebook interactively:

  1. Go to [**Wherobots Cloud**](https://cloud.wherobots.com).
  2. Start a runtime.
  3. Open the notebook.
  4. In the Jupyter Launcher:
     1. Click **File > Open Path**.
     2. Paste the following path to access this notebook: `examples/Getting_Started/Part_4_Spatial_Joins.ipynb`
     3. Click **Enter**.
</Tip>

This notebook will guide you through performing spatial joins in Wherobots using Python and the DataFrame API — giving you a hands-on understanding of how to combine datasets based on their spatial relationships.

## What you will learn

This notebook will teach you to:

* Perform standard spatial joins — identifying features within other geometries
* Execute nearest neighbor joins — finding the closest feature between datasets

## Loading datasets for a spatial join

Spatial joins use the relationship of two columns with `geometry` types, examining their relationship in space. For example, you could join line-shaped delivery route data with point coordinates of customer locations to optimize logistics, or join building polygons with flood zone polygons to analyze risk.

In this example, we are going to combine data from two tables in the [Wherobots Open Data catalog](https://docs.wherobots.com/latest/tutorials/spatial-catalog/introduction/).

* Polygons that are [administrative boundaries of US cities and towns from the Overture Maps Foundation](https://cloud.wherobots.com/data-hub?catalogId=2rk2zjbg7pl6f8lb7xkzv\&namespace=overture_maps_foundation\&table=divisions_division_area) (OMF)
* Points that are [places of interest from Foursquare](https://cloud.wherobots.com/data-hub?catalogId=2rk2zjbg7pl6f8lb7xkzv\&namespace=foursquare\&table=places)

We will start by running two SQL queries to find the localities (cities and towns) and Foursquare points of interest in the US that are not marked as closed. We will load them into Sedona DataFrames ([docs](https://docs.wherobots.com/latest/references/wherobotsdb/vector-data/DataFrameAPI/)) which will serve as the inputs for the spatial join operations in this notebook.

```python theme={"system"}
from sedona.spark import *
from pyspark.sql.functions import expr
from pyspark.sql.functions import col

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
```

```python theme={"system"}
query = '''
SELECT 
    *
FROM
    wherobots_open_data.overture_maps_foundation.divisions_division_area
WHERE
    subtype = 'locality'
    AND country = 'US'
'''

localities_df = sedona.sql(query)
localities_df.show()
```

```python theme={"system"}
query = '''
SELECT
    fsq_place_id,
    name,
    fsq_category_labels[0] as category,
    region,
    postcode,
    geom
FROM
    wherobots_open_data.foursquare.places
WHERE
    date_closed IS NOT NULL
    AND country = 'US'
'''

pois_df = sedona.sql(query)
pois_df.show(truncate=False)
```

## Points within polygons with ST\_Intersects

With both datasets loaded, we can now join them based on their spatial relationship.
In this case, we want to find which places of interest (points) fall within each administrative boundary (polygons).

We use the `ST_Intersects` function to check if a point's geometry is inside or directly on a boundary's geometry.

The spatial join keeps only the pairs of points and polygons where their geometries intersect, and the resulting points DataFrame will include columns for each administrative boundary that it intersects.

```python theme={"system"}
pois_in_cities_df = pois_df.alias("p") \
    .join(localities_df.alias("l"), expr("ST_Intersects(l.geometry, p.geom)"))

pois_in_cities_df.show()
```

## Spatial join and aggregate points within polygons

After performing a spatial join, a common analysis is to count how many points fall within each polygon. We can perform this in a single operation by combining the spatial join with a `groupBy` and the `COUNT` aggregation.

This query joins the polygons and points, groups the results by the polygon ID, and counts the matching points.

```python theme={"system"}
pois_count_df = localities_df.alias("l") \
    .join(pois_df.alias("p"), expr("ST_Intersects(l.geometry, p.geom)")) \
    .groupBy("l.id") \
    .agg(expr("COUNT(*) as point_count"))

pois_count_df.show(10)
```

## Nearest-neighbor spatial join

In some cases, you may want to find the closest feature from another dataset — such as identifying the nearest city for each point of interest.

> A **nearest neighbor join** finds the closest points or polygons based on geographic proximity. ([Docs: K-Nearest Neighbor Joins](https://docs.wherobots.com/latest/references/wherobotsdb/vector-data/NearestNeighbourSearching/))

Wherobots has two functions for finding nearest neighbors using a k-nearest-neighbors approach. `ST_KNN` returns the exact nearest neighbors, while `ST_AKNN` trades off some accuracy for speed by using approximate algorithms. In this case, we will join the point data to the polygons using `ST_KNN` to find the 4 polygons with centroids nearest to each point.

We are passing four parameters to this function:

* **R**: Table of query geometry, which are our points of interest
* **S**: Tabke of object geometry, the centroids of the localities
* **k**: The number of neighbors to find for each object in the query geometry
* **use\_sphere**: A boolean whether to use a spherical model instead of a planar distance model

```python theme={"system"}
knn_df = pois_df.alias("p") \
    .join(
        localities_df.alias("l"),
        expr("ST_KNN(p.geom, l.geometry, 4, false)")
    )
```

```python theme={"system"}
knn_result_df = knn_df.select(
    expr("p.fsq_place_id"),
    expr("p.geom"),
    expr("l.id as locality_id")
)

knn_result_df.show(10, truncate=False)
```
