Skip to main content
The following content is a read-only preview of an executable Jupyter notebook.To run this notebook interactively:
  1. Go to Wherobots Cloud.
  2. Start a runtime.
  3. Open the notebook.
  4. In the Jupyter Launcher:
    1. Click File > Open Path.
    2. Paste the following path to access this notebook: examples/Getting_Started/Part_4_Spatial_Joins.ipynb
    3. Click Enter.
This notebook will guide you through performing spatial joins in Wherobots using Python and the DataFrame API — giving you a hands-on understanding of how to combine datasets based on their spatial relationships.

What you will learn

This notebook will teach you to:
  • Perform standard spatial joins — identifying features within other geometries
  • Execute nearest neighbor joins — finding the closest feature between datasets

Loading datasets for a spatial join

Spatial joins use the relationship of two columns with geometry types, examining their relationship in space. For example, you could join line-shaped delivery route data with point coordinates of customer locations to optimize logistics, or join building polygons with flood zone polygons to analyze risk. In this example, we are going to combine data from two tables in the Wherobots Open Data catalog. We will start by running two SQL queries to find the localities (cities and towns) and Foursquare points of interest in the US that are not marked as closed. We will load them into Sedona DataFrames (docs) which will serve as the inputs for the spatial join operations in this notebook.
from sedona.spark import *
from pyspark.sql.functions import expr
from pyspark.sql.functions import col

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
query = '''
SELECT 
    *
FROM
    wherobots_open_data.overture_maps_foundation.divisions_division_area
WHERE
    subtype = 'locality'
    AND country = 'US'
'''

localities_df = sedona.sql(query)
localities_df.show()
query = '''
SELECT
    fsq_place_id,
    name,
    fsq_category_labels[0] as category,
    region,
    postcode,
    geom
FROM
    wherobots_open_data.foursquare.places
WHERE
    date_closed IS NOT NULL
    AND country = 'US'
'''

pois_df = sedona.sql(query)
pois_df.show(truncate=False)

Points within polygons with ST_Intersects

With both datasets loaded, we can now join them based on their spatial relationship. In this case, we want to find which places of interest (points) fall within each administrative boundary (polygons). We use the ST_Intersects function to check if a point’s geometry is inside or directly on a boundary’s geometry. The spatial join keeps only the pairs of points and polygons where their geometries intersect, and the resulting points DataFrame will include columns for each administrative boundary that it intersects.
pois_in_cities_df = pois_df.alias("p") \
    .join(localities_df.alias("l"), expr("ST_Intersects(l.geometry, p.geom)"))

pois_in_cities_df.show()

Spatial join and aggregate points within polygons

After performing a spatial join, a common analysis is to count how many points fall within each polygon. We can perform this in a single operation by combining the spatial join with a groupBy and the COUNT aggregation. This query joins the polygons and points, groups the results by the polygon ID, and counts the matching points.
pois_count_df = localities_df.alias("l") \
    .join(pois_df.alias("p"), expr("ST_Intersects(l.geometry, p.geom)")) \
    .groupBy("l.id") \
    .agg(expr("COUNT(*) as point_count"))

pois_count_df.show(10)

Nearest-neighbor spatial join

In some cases, you may want to find the closest feature from another dataset — such as identifying the nearest city for each point of interest.
A nearest neighbor join finds the closest points or polygons based on geographic proximity. (Docs: K-Nearest Neighbor Joins)
Wherobots has two functions for finding nearest neighbors using a k-nearest-neighbors approach. ST_KNN returns the exact nearest neighbors, while ST_AKNN trades off some accuracy for speed by using approximate algorithms. In this case, we will join the point data to the polygons using ST_KNN to find the 4 polygons with centroids nearest to each point. We are passing four parameters to this function:
  • R: Table of query geometry, which are our points of interest
  • S: Tabke of object geometry, the centroids of the localities
  • k: The number of neighbors to find for each object in the query geometry
  • use_sphere: A boolean whether to use a spherical model instead of a planar distance model
knn_df = pois_df.alias("p") \
    .join(
        localities_df.alias("l"),
        expr("ST_KNN(p.geom, l.geometry, 4, false)")
    )
knn_result_df = knn_df.select(
    expr("p.fsq_place_id"),
    expr("p.geom"),
    expr("l.id as locality_id")
)

knn_result_df.show(10, truncate=False)