Skip to main content
Private Preview
The following content is a read-only preview of an executable Jupyter notebook.To run this notebook interactively:
  1. Go to the Wherobots Model Hub.
  2. Select the specific notebook you wish to run.
  3. Click Run Model in Notebook.
This notebook demonstrates how to bring your own rasters (BYOR) into RasterFlow by querying a STAC catalog and creating a GTI (GDAL Raster Tile Index). You will learn how to query the NAIP collection, extract Cloud-Optimized GeoTIFF (COG) URLs, and build a mosaic using RasterFlow’s build_gti_mosaic function.

What you will learn

This notebook will teach you to:
  • Query a STAC catalog to discover available imagery for an area of interest
  • Create a GTI (GDAL Raster Tile Index) from STAC items
  • Build a seamless mosaic from multiple image tiles using RasterFlow’s build_gti_mosaic
  • Visualize the resulting mosaic using xarray and hvplot
  • (Optional) Run road detection using the ChesapeakeRSC model

NAIP (National Agriculture Imagery Program)

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental United States. Key characteristics:
  • Resolution: 1 meter ground sample distance
  • Bands: 4-band imagery (Red, Green, Blue, NIR) in a single COG file
  • Coverage: Continental United States
  • Update cycle: Typically every 2-3 years per state
We will use the Element84 Earth Search STAC catalog to query NAIP imagery for Montgomery County, Maryland.

Setup and Imports

import os

import geopandas as gpd
import wkls
from leafmap import Map

Selecting an Area of Interest (AOI)

We will use Montgomery County, Maryland as our AOI. This county is well-covered by NAIP imagery and provides a good example for demonstrating the BYOR workflow.
# Generate a geometry for Montgomery County, MD using Well-Known Locations (https://github.com/wherobots/wkls)
gdf = gpd.read_file(wkls['us']['md']['montgomery county'].geojson())

# Save the geometry to a parquet file in the user's S3 path
aoi_path = os.getenv("USER_S3_PATH") + "montgomery_county_md.parquet"
gdf.to_parquet(aoi_path)

print(f"AOI saved to: {aoi_path}")
print(f"Bounding box: {gdf.total_bounds}")
print(f"CRS: {gdf.crs}")

# Display the AOI boundary on an interactive map with basemap
minx, miny, maxx, maxy = gdf.total_bounds
center_lat = (miny + maxy) / 2
center_lon = (minx + maxx) / 2

m = Map(center=(center_lat, center_lon), zoom=10)
m.add_gdf(
    gdf, 
    layer_name="Montgomery County, MD", 
    zoom_to_layer=True,
    style={"color": "red", "fillColor": "lightblue", "fillOpacity": 0.3, "weight": 2}
)
m

Loading the NAIP STAC collection

We use Wherobots’ built-in STAC reader to load the NAIP collection from Element84’s Earth Search catalog. This catalog provides direct S3 URLs to the imagery files.
from pyspark.sql import functions as F
from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
# Get the actual polygon geometry as WKT for spatial filter
aoi_wkt = gdf.geometry.iloc[0].wkt

print(f"AOI geometry type: {gdf.geometry.iloc[0].geom_type}")

# Load NAIP collection from Element84 Earth Search
naip_df = (
    sedona.read.format("stac")
    .option("itemsLimitMax", "200")
    .option("itemsLimitPerRequest", "50")
    .load("https://earth-search.aws.element84.com/v1/collections/naip")
)

naip_df.createOrReplaceTempView("naip_raw")

print(f"\nLoaded NAIP STAC collection")
print(f"Schema:")
naip_df.printSchema()

Applying spatial and temporal filters

We filter the NAIP collection to only include tiles that:
  • Intersect with our AOI (Montgomery County)
  • Were captured in 2017 (the most recent year with 1-meter resolution NAIP imagery available for Maryland so we can test the ChesapeakeRSC model)
# Filter by:
# - Date range: 2017 (full year)
# - Spatial: Intersects with Montgomery County polygon boundary

filtered_df = sedona.sql(f"""
    SELECT 
        id,
        geometry,
        datetime,
        assets
    FROM naip_raw
    WHERE datetime BETWEEN '2017-01-01' AND '2017-12-31'
      AND ST_Intersects(geometry, ST_GeomFromText('{aoi_wkt}'))
""")

filtered_df.createOrReplaceTempView("naip_filtered")

tile_count = filtered_df.count()
print(f"Found {tile_count} NAIP tiles for Montgomery County, MD in 2017")
filtered_df.select("id", "datetime").show(10, truncate=False)

Exploring the STAC asset structure

Before creating the GTI, we need to understand the structure of the STAC assets. NAIP stores all 4 bands (RGBIR) in a single COG file, so we only need one row per tile in our GTI.
# Examine the assets structure
# NAIP has a single 'image' asset containing all 4 bands (RGBIR)

sample_df = sedona.sql("SELECT id, assets FROM naip_filtered LIMIT 1")
sample = sample_df.first()

if sample:
    print(f"Tile ID: {sample['id']}")
    print(f"\nAvailable asset keys:")
    for key in sorted(sample['assets'].keys()):
        print(f"  - {key}")
    
    # Check structure of the 'image' asset (4-band COG)
    print(f"\n--- Structure of 'image' asset ---")
    image_asset = sample['assets'].get('image')
    if image_asset:
        print(f"Type: {type(image_asset)}")
        print(f"Content: {image_asset}")
else:
    print("No samples found - check filters")
# Show the S3 URL for the 4-band COG

sample_df = sedona.sql("""
    SELECT 
        id,
        assets['image'].href as image_url
    FROM naip_filtered 
    LIMIT 5
""")

sample_df.show(truncate=False)

Creating the GTI (GDAL Raster Tile Index)

The GTI is a spatial index that maps tile geometries to their corresponding COG URLs. For more information on GTI, see the reference documentation. Since NAIP has a single 4-band COG per tile, we create one row per tile.
# Create the GTI
# NAIP has a single 4-band COG per tile, so we just need one row per tile

gti_df = sedona.sql("""
    SELECT 
        id as tile_id,
        ST_AsText(geometry) as geometry_wkt,
        geometry,
        datetime,
        assets['image'].href as url
    FROM naip_filtered
    WHERE assets['image'] IS NOT NULL
""")

# Filter out any rows where URL is null
gti_df = gti_df.filter(F.col("url").isNotNull())

gti_df.createOrReplaceTempView("gti")

row_count = gti_df.count()
print(f"GTI has {row_count} rows (one row per NAIP tile)")
gti_df.select("tile_id", "datetime", "url").show(10, truncate=80)

Saving the GTI as GeoParquet

We convert the Spark DataFrame to a GeoDataFrame and save it as GeoParquet for use with RasterFlow.
from shapely import wkt

# Select columns for the GTI (excluding the Sedona geometry, using WKT instead)
gti_export_df = gti_df.select(
    "tile_id",
    "geometry_wkt",
    "datetime",
    "url"
)

# Convert Spark DataFrame to Pandas
gti_pandas = gti_export_df.toPandas()

# Convert WKT geometry to Shapely geometry and create GeoDataFrame
gti_pandas['geometry'] = gti_pandas['geometry_wkt'].apply(wkt.loads)
gti_gdf = gpd.GeoDataFrame(gti_pandas, geometry='geometry', crs="EPSG:4326")

# Drop the WKT column (no longer needed)
gti_gdf = gti_gdf.drop(columns=['geometry_wkt'])

# Save to S3 as GeoParquet
gti_path = os.getenv("USER_S3_PATH") + "naip_gti.parquet"
gti_gdf.to_parquet(gti_path)

print(f"GTI saved to: {gti_path}")
print(f"\nGTI Schema:")
print(gti_gdf.dtypes)
print(f"\nSample rows:")
gti_gdf.head()

Verifying the GTI

Before building the mosaic, we verify that the GTI was saved correctly and visualize the tile footprints.
# Reload and verify the GTI
gti_verify = gpd.read_parquet(gti_path)

print(f"GTI contains {len(gti_verify)} rows")
print(f"Unique tiles: {gti_verify['tile_id'].nunique()}")
print(f"Date range: {gti_verify['datetime'].min()} to {gti_verify['datetime'].max()}")

print(f"\nSample URLs:")
for i, row in gti_verify.head(3).iterrows():
    print(f"  {row['tile_id']}: {row['url'][:70]}...")
# Visualize the GTI footprints with the AOI boundary on an interactive map

# Calculate center from the AOI
minx, miny, maxx, maxy = gdf.total_bounds
center_lat = (miny + maxy) / 2
center_lon = (minx + maxx) / 2

m = Map(center=(center_lat, center_lon), zoom=10)

# Add NAIP tile footprints
m.add_gdf(
    gti_verify, 
    layer_name="NAIP Tile Footprints", 
    style={"color": "blue", "fillColor": "lightblue", "fillOpacity": 0.3, "weight": 1}
)

# Add AOI boundary on top
m.add_gdf(
    gdf, 
    layer_name="Montgomery County, MD", 
    zoom_to_layer=True,
    style={"color": "red", "fillColor": "none", "fillOpacity": 0, "weight": 3}
)

print(f"Showing {len(gti_verify)} NAIP tile footprints over Montgomery County, MD")
m

Building a mosaic with RasterFlow

Now we use RasterFlow’s build_gti_mosaic function to create a seamless 4-band mosaic from the NAIP tiles.
Note: The naip-analytic S3 bucket is requester-pays, so we set requester_pays=True.
Expected runtime: ~5-10 minutes depending on the number of tiles.
from rasterflow_remote import RasterflowClient

client = RasterflowClient()

# Build a 4-band mosaic (Red, Green, Blue, NIR)
store = client.build_gti_mosaic(
    # Path to the GTI GeoParquet file
    gti=os.getenv("USER_S3_PATH") + "naip_gti.parquet",
    
    # Path to the AOI GeoParquet file
    aoi=os.getenv("USER_S3_PATH") + "montgomery_county_md.parquet",
    
    # Output bands (NAIP has 4 bands: r, g, b, ir)
    # Note: We use 'ir' to match the ChesapeakeRSC model's expected input bands
    bands=["r", "g", "b", "ir"],
    
    # Column in the GTI containing the COG URLs
    location_field="url",
    
    # Output CRS (UTM Zone 18N for Maryland)
    crs_epsg=32618,
    
    # NAIP bucket is requester-pays
    requester_pays=True,
    
    # NAIP uses 0 as nodata value
    nodata=0.0,
)

print(f"Mosaic store: {store}")

Visualizing the mosaic

We load the Zarr store and visualize a small RGB subset of the mosaic.
Note: We visualize a ~2km x 2km preview area near the center of the AOI. Rendering the full county at 1-meter resolution would be too large for efficient display.
import hvplot.xarray
import s3fs
import xarray as xr
import zarr
from pyproj import Transformer

# Open the Zarr store
fs = s3fs.S3FileSystem(profile="default", asynchronous=True)
zstore = zarr.storage.FsspecStore(fs, path=store[5:])
ds = xr.open_zarr(zstore)

# Create a transformer to convert from lat/lon to the mosaic CRS (UTM 18N)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:32618", always_xy=True)

# Define a small preview area (~2km x 2km) near the center of the AOI
# Using the full AOI at 1m resolution would be too large to render efficiently
preview_center_lon = (minx + maxx) / 2
preview_center_lat = (miny + maxy) / 2
preview_size = 0.01  # ~1km in each direction (total ~2km x 2km)

preview_min_lon = preview_center_lon - preview_size
preview_max_lon = preview_center_lon + preview_size
preview_min_lat = preview_center_lat - preview_size
preview_max_lat = preview_center_lat + preview_size

# Transform preview bounding box coordinates from lat/lon to UTM meters
(min_x, max_x), (min_y, max_y) = transformer.transform(
    [preview_min_lon, preview_max_lon],
    [preview_min_lat, preview_max_lat]
)

# Slice the dataset to the preview bounding box
# y=slice(max_y, min_y) handles the standard "North-to-South" image orientation
ds_subset = ds.sel(
    x=slice(min_x, max_x),
    y=slice(max_y, min_y)
)

# Select RGB bands and create array for visualization
# NAIP bands are named: r, g, b, ir
rgb = ds_subset.sel(band=["r", "g", "b"])["variables"]

# Handle time dimension if present
if "time" in rgb.dims:
    rgb = rgb.isel(time=0)

# Normalize values for display (NAIP values typically 0-255)
rgb_normalized = rgb / 255.0
rgb_normalized = rgb_normalized.clip(0, 1)

# Visualize as RGB composite
rgb_normalized.hvplot.rgb(
    x="x",
    y="y",
    bands="band",
    data_aspect=1,
    xaxis=False,
    yaxis=False,
    title="NAIP Mosaic (RGB) - Preview",
    frame_width=600,
)

(Optional) Running inference with ChesapeakeRSC

Now that we have built the NAIP mosaic, we can optionally run a semantic segmentation model to detect roads. The ChesapeakeRSC model was trained on NAIP imagery from Maryland and can detect roads even when occluded by tree cover. The model outputs two classes:
  • background
  • road
Expected runtime: ~10-20 minutes depending on the mosaic size.
# # Uncomment to run road detection on the NAIP mosaic
# 
# from rasterflow_remote import InferenceActorEnum
# from rasterflow_remote.data_models import MergeModeEnum
# 
# # ChesapeakeRSC model - direct URL (no HuggingFace auth required)
# MODEL_PATH = "https://huggingface.co/wherobots/chesapeakersc-pt2/resolve/main/chesapeakersc-ep.pt2"
# 
# prediction_store = client.predict_mosaic(
#     store=store,
#     model_path=MODEL_PATH,
#     patch_size=512,
#     clip_size=64,
#     device="cuda",
#     features=["r", "g", "b", "ir"],  # Must match the bands in our mosaic
#     labels=["background", "road"],
#     actor=InferenceActorEnum.SEMANTIC_SEGMENTATION_PYTORCH,
#     max_batch_size=256,
#     merge_mode=MergeModeEnum.WEIGHTED_AVERAGE,
# )
# 
# print(f"Prediction store: {prediction_store}")

(Optional) Visualizing the road detection results

We can visualize the road detection output by loading the prediction Zarr store and displaying the road class probability.
# # Uncomment to visualize road detection results
# 
# # Open the prediction Zarr store
# pred_zstore = zarr.storage.FsspecStore(fs, path=prediction_store[5:])
# pred_ds = xr.open_zarr(pred_zstore)
# 
# # Slice to the same preview area as the RGB visualization
# pred_subset = pred_ds.sel(
#     x=slice(min_x, max_x),
#     y=slice(max_y, min_y)
# )
# 
# # Select the 'road' band
# road_probs = pred_subset.sel(band="road")["variables"]
# 
# # Handle time dimension if present
# if "time" in road_probs.dims:
#     road_probs = road_probs.isel(time=0)
# 
# # Visualize road probability
# road_probs.hvplot(
#     x="x",
#     y="y",
#     cmap="Reds",
#     data_aspect=1,
#     xaxis=False,
#     yaxis=False,
#     title="ChesapeakeRSC Road Detection - Preview",
#     frame_width=600,
#     clabel="Road Probability",
# )