How to Write Effective Spatial Queries

This guide covers best practices for writing spatial SQL queries in Wherobots, including proper setup, common pitfalls, and debugging techniques. The examples use Overture Maps Foundation data available through the Wherobots Open Data catalog but the principles apply to any spatial queries you write in Wherobots.

Key principles

The following principles will help ensure your spatial queries run successfully in Wherobots:

Always initialize the Sedona context

Every spatial query session in Wherobots must begin with proper Sedona context initialization:

from sedona.spark import *
from pyspark.sql.functions import expr

# Initialize Sedona Context
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Understand table schema before querying

Always verify the structure of tables you’re querying, especially for complex data types like STRUCT and ARRAY columns.

Many common query errors stem from assumptions about table schema. Always check the actual column names and data types before writing queries.

Use spatial predicates for geographic filtering

Leverage spatial functions like ST_Intersects for precise geographic filtering instead of text-based approaches.

Use `wkls` for geographic boundaries

The wkls library provides accurate WKT representations for countries, states, and regions. Use it to define geographic boundaries in spatial queries instead of hard-coding coordinates:

import wkls
usa_wkt = wkls.us.wkt()

wkls comes pre-installed in WherobotsDB, so you do not need to install or import it separately if you’re using Wherobots Cloud.

Validate category values when filtering

When filtering on criteria that have a set of predefined valid values, always verify the exact values used in the dataset. This helps ensure accurate results and prevents zero-result queries caused by mismatched category names. For example, to find all baseball stadiums in the Overture dataset, you might assume that the category is ballpark or baseball_field, but the correct category is actually baseball_stadium. The following query returns all of the baseball-related categories in the places_place dataset to help you understand the correct values to filter on:

# Find all baseball-related categories
sedona.sql("""
SELECT DISTINCT categories.primary
FROM wherobots_open_data.overture_maps_foundation.places_place
WHERE categories.primary LIKE '%baseball%' OR categories.primary LIKE '%stadium%'
""").show()

Common pitfalls and solutions

The following sections highlight common mistakes when writing spatial queries in Wherobots and how to avoid them.

Incorrect database names

Using the wrong database or table name is a common source of errors. While you might assume the database is overture, the correct database for the Overture Maps Foundation data is overture_maps_foundation:

Incorrect
Correct

-- Wrong database name
SELECT COUNT(*)
FROM wherobots_open_data.overture.places_place
WHERE categories LIKE '%ballpark%'

-- Correct database name
SELECT COUNT(*)
FROM wherobots_open_data.overture_maps_foundation.places_place
WHERE categories.primary = 'baseball_stadium'

Mishandling STRUCT columns

The categories column in many Overture datasets is a STRUCT with primary (string) and alternate (array of strings) fields. Attempting to filter on categories as if it were a simple string will lead to errors or zero results.

Incorrect
Correct

-- Treating STRUCT as string
WHERE categories LIKE '%ballpark%'

-- Properly accessing STRUCT fields
WHERE (
  categories.primary = 'baseball_stadium'
  OR ARRAY_CONTAINS(categories.alternate, 'baseball_stadium')
)

Using non-existent columns

Assuming column names without checking the schema can lead to errors: You must make sure to use the correct column names as defined in the dataset schema. For example, there is no region column in the places_place dataset, so filtering on region = 'US' will not work. Instead, use the wkls library to get the correct geometry for the U.S. and use spatial predicates to filter:

Incorrect
Correct

-- Non-existent 'region' column
WHERE region = 'US'

-- Use spatial intersection with actual geometry
WHERE ST_Intersects(ST_GeomFromWKT('{usa_wkt}'), geometry)

Unsupported functions

Some SQL functions that work in other databases may not be available in Apache Spark SQL.

Incorrect
Correct

-- UNNEST() is not supported in Spark SQL
FROM UNNEST(addresses) AS addr
WHERE addr.country = 'US'

-- Use spatial predicates instead
WHERE ST_Intersects(ST_GeomFromWKT('{usa_wkt}'), geometry)

Complete example: counting baseball stadiums in the U.S.

The following example brings the above best practices together into a complete, executable query:

from sedona.spark import *
from pyspark.sql.functions import expr
import wkls

# Initialize Sedona Context
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

# Get accurate U.S. boundary geometry using the wkls library
usa_wkt = wkls.us.wkt()

# Execute the spatial query
result = sedona.sql(f'''
SELECT COUNT(*) as stadium_count
FROM wherobots_open_data.overture_maps_foundation.places_place
WHERE (
  categories.primary = 'baseball_stadium'
  OR ARRAY_CONTAINS(categories.alternate, 'baseball_stadium')
)
AND ST_Intersects(ST_GeomFromWKT('{usa_wkt}'), geometry)
''')

# Extract and print the result
count = result.collect()[0][0]
print(f"Number of baseball stadiums in the U.S.: {count}")

Why this query works

Understand why this query is structured the way it is:

Proper context initialization

Sets up Sedona for spatial operations.

Accurate geometry

Uses wkls.us.wkt() for precise U.S. boundaries.

Correct STRUCT handling

Accesses categories.primary and categories.alternate with dot notation.

Spatial precision

Uses ST_Intersects instead of text-based filtering.

Proper result extraction

Uses .collect()[0][0] to get the scalar count value.

Debugging steps for query issues

If you encounter errors or unexpected results, follow these debugging steps:

Verify table schema

Make sure you’re using the correct database and table names, and check the schema to understand column types:

# Check table structure
sedona.sql("DESCRIBE wherobots_open_data.overture_maps_foundation.places_place").show()

Examine sample data

Return a small sample of records to confirm that the data contains the expected fields and values:

# Look at sample records to understand data structure
sedona.sql("""
SELECT categories, addresses
FROM wherobots_open_data.overture_maps_foundation.places_place
LIMIT 5
""").show(truncate=False)

Test categories structure

Understand how the categories STRUCT is organized to ensure you’re accessing the correct fields:

# Understand how categories are structured
sedona.sql("""
SELECT categories.primary, categories.alternate
FROM wherobots_open_data.overture_maps_foundation.places_place
WHERE categories.primary IS NOT NULL
LIMIT 10
""").show()

Validate spatial operations

Confirm that your spatial predicates are working as expected by testing them with known geometries:

import wkls

# Test spatial intersection without filtering first
usa_wkt = wkls.us.wkt()
sedona.sql(f"""
SELECT COUNT(*) as total_places_in_us
FROM wherobots_open_data.overture_maps_foundation.places_place
WHERE ST_Intersects(ST_GeomFromWKT('{usa_wkt}'), geometry)
""").show()

Best Practices Summary

Review the following best practices and common mistakes to ensure that your spatial queries run smoothly in Wherobots:

Do
Don't

The following best practices will help ensure your spatial queries run successfully in Wherobots:

Use the correct, fully-qualified database names (e.g., overture_maps_foundation).
Access STRUCT fields with dot notation (e.g., categories.primary).
Use ARRAY_CONTAINS to search within array fields.
Use spatial predicates like ST_Intersects for geographic filtering.
Always initialize the Sedona context before running spatial operations.
Use .collect()[0][0] to extract a single scalar result.
Leverage the wkls library for accurate geographic boundaries.
Test queries incrementally to isolate issues.

The following common mistakes can lead to query errors or incorrect results:

Assume column names without checking the table schema.
Treat STRUCT columns as plain strings (e.g., WHERE categories LIKE ...).
Use functions unsupported in Spark SQL, such as UNNEST().
Rely on text-based geographic filtering instead of spatial predicates.
Forget to initialize the Sedona context.
Use incorrect argument order in spatial functions.

Getting Started

Spatial Queries

WherobotsDB

Data Connections

RasterFlow

Advanced Topics

WherobotsAI

How to Write Effective Spatial Queries

Key principles

Always initialize the Sedona context

Understand table schema before querying

Use spatial predicates for geographic filtering

Use `wkls` for geographic boundaries

Validate category values when filtering

Common pitfalls and solutions

Incorrect database names

Mishandling STRUCT columns

Using non-existent columns

Unsupported functions

Complete example: counting baseball stadiums in the U.S.

Why this query works

Debugging steps for query issues

Verify table schema

Examine sample data

Test categories structure

Validate spatial operations

Best Practices Summary

Getting Started

Spatial Queries

WherobotsDB

Data Connections

RasterFlow

Advanced Topics

WherobotsAI

​Key principles

​Always initialize the Sedona context

​Understand table schema before querying

​Use spatial predicates for geographic filtering

​Use wkls for geographic boundaries

​Validate category values when filtering

​Common pitfalls and solutions

​Incorrect database names

​Mishandling STRUCT columns

​Using non-existent columns

​Unsupported functions

​Complete example: counting baseball stadiums in the U.S.

​Why this query works

​Debugging steps for query issues

​Verify table schema

​Examine sample data

​Test categories structure

​Validate spatial operations

​Best Practices Summary

Key principles

Always initialize the Sedona context

Understand table schema before querying

Use spatial predicates for geographic filtering

Use `wkls` for geographic boundaries

Validate category values when filtering

Common pitfalls and solutions

Incorrect database names

Mishandling STRUCT columns

Using non-existent columns

Unsupported functions

Complete example: counting baseball stadiums in the U.S.

Why this query works

Debugging steps for query issues

Verify table schema

Examine sample data

Test categories structure

Validate spatial operations

Best Practices Summary