Documentation Index
Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
Use this file to discover all available pages before exploring further.
Private Preview
The following content is a read-only preview of an executable Jupyter notebook.To run this notebook interactively:
- Go to Wherobots Cloud.
- Start a runtime.
- Open the notebook.
- In the Jupyter Launcher:
- Click File > Open Path.
- Paste the following path to access this notebook:
examples/Analyzing_Data/RasterFlow_Bring_Your_Own_Rasters_NAIP.ipynb
- Click Enter.
This notebook demonstrates how to bring your own rasters (BYOR) into RasterFlow by querying a STAC catalog and creating a GTI (GDAL Raster Tile Index). You will learn how to query the NAIP collection, extract Cloud-Optimized GeoTIFF (COG) URLs, and build a mosaic using RasterFlow’s build_gti_mosaic function.
What you will learn
This notebook will teach you to:
- Query a STAC catalog to discover available imagery for an area of interest
- Create a GTI (GDAL Raster Tile Index) from STAC items
- Build a seamless mosaic from multiple image tiles using RasterFlow’s
build_gti_mosaic
- Visualize the resulting mosaic and inference outputs
- (Optional) Run road detection using the ChesapeakeRSC model
NAIP (National Agriculture Imagery Program)
The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental United States.
Key characteristics:
- Resolution: 1 meter ground sample distance
- Bands: 4-band imagery (Red, Green, Blue, NIR) in a single COG file
- Coverage: Continental United States
- Update cycle: Typically every 2-3 years per state
We will use the Element84 Earth Search STAC catalog to query NAIP imagery for Montgomery County, Maryland.
Setup and Imports
import os
import geopandas as gpd
import wkls
from ipyleaflet import basemaps
from leafmap import Map
Selecting an Area of Interest (AOI)
We will use Montgomery County, Maryland as our AOI. This county is well-covered by NAIP imagery and provides a good example for demonstrating the BYOR workflow.
# Generate a geometry for Montgomery County, MD using Well-Known Locations (https://github.com/wherobots/wkls)
gdf = gpd.read_file(wkls['us']['md']['montgomery county'].geojson())
# Save the geometry to a parquet file in the user's S3 path
aoi_path = os.getenv("USER_S3_PATH") + "montgomery_county_md.parquet"
gdf.to_parquet(aoi_path)
print(f"AOI saved to: {aoi_path}")
print(f"Bounding box: {gdf.total_bounds}")
print(f"CRS: {gdf.crs}")
# Display the AOI boundary on an interactive map with basemap
minx, miny, maxx, maxy = gdf.total_bounds
center_lat = (miny + maxy) / 2
center_lon = (minx + maxx) / 2
m = Map(center=(center_lat, center_lon), zoom=10, basemap=basemaps.Esri.WorldImagery)
m.add_gdf(
gdf,
layer_name="Montgomery County, MD",
zoom_to_layer=True,
style={"color": "red", "fillColor": "lightblue", "fillOpacity": 0.3, "weight": 2}
)
m
Loading the NAIP STAC collection
We use Wherobots’ built-in STAC reader to load the NAIP collection from Element84’s Earth Search catalog. This catalog provides direct S3 URLs to the imagery files.
from pyspark.sql import functions as F
from sedona.spark import *
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
# Get the actual polygon geometry as WKT for spatial filter
aoi_wkt = gdf.geometry.iloc[0].wkt
print(f"AOI geometry type: {gdf.geometry.iloc[0].geom_type}")
# Load NAIP collection from Element84 Earth Search
naip_df = (
sedona.read.format("stac")
.option("itemsLimitMax", "200")
.option("itemsLimitPerRequest", "50")
.load("https://earth-search.aws.element84.com/v1/collections/naip")
)
naip_df.createOrReplaceTempView("naip_raw")
print(f"\nLoaded NAIP STAC collection")
print(f"Schema:")
naip_df.printSchema()
Applying spatial and temporal filters
We filter the NAIP collection to only include tiles that:
- Intersect with our AOI (Montgomery County)
- Were captured in 2017 (the most recent year with 1-meter resolution NAIP imagery available for Maryland so we can test the ChesapeakeRSC model)
# Filter by:
# - Date range: 2017 (full year)
# - Spatial: Intersects with Montgomery County polygon boundary
filtered_df = sedona.sql(f"""
SELECT
id,
geometry,
datetime,
assets
FROM naip_raw
WHERE datetime BETWEEN '2017-01-01' AND '2017-12-31'
AND ST_Intersects(geometry, ST_GeomFromText('{aoi_wkt}'))
""")
filtered_df.createOrReplaceTempView("naip_filtered")
tile_count = filtered_df.count()
print(f"Found {tile_count} NAIP tiles for Montgomery County, MD in 2017")
filtered_df.select("id", "datetime").show(10, truncate=False)
Exploring the STAC asset structure
Before creating the GTI, we need to understand the structure of the STAC assets. NAIP stores all 4 bands (RGBIR) in a single COG file, so we only need one row per tile in our GTI.
# Examine the assets structure
# NAIP has a single 'image' asset containing all 4 bands (RGBIR)
sample_df = sedona.sql("SELECT id, assets FROM naip_filtered LIMIT 1")
sample = sample_df.first()
if sample:
print(f"Tile ID: {sample['id']}")
print(f"\nAvailable asset keys:")
for key in sorted(sample['assets'].keys()):
print(f" - {key}")
# Check structure of the 'image' asset (4-band COG)
print(f"\n--- Structure of 'image' asset ---")
image_asset = sample['assets'].get('image')
if image_asset:
print(f"Type: {type(image_asset)}")
print(f"Content: {image_asset}")
else:
print("No samples found - check filters")
# Show the S3 URL for the 4-band COG
sample_df = sedona.sql("""
SELECT
id,
assets['image'].href as image_url
FROM naip_filtered
LIMIT 5
""")
sample_df.show(truncate=False)
Creating the GTI (GDAL Raster Tile Index)
The GTI is a spatial index that maps tile geometries to their corresponding COG URLs. For more information on GTI, see the reference documentation.
Since NAIP has a single 4-band COG per tile, we create one row per tile.
# Create the GTI
# NAIP has a single 4-band COG per tile, so we just need one row per tile
gti_df = sedona.sql("""
SELECT
id as tile_id,
ST_AsText(geometry) as geometry_wkt,
geometry,
datetime,
assets['image'].href as url
FROM naip_filtered
WHERE assets['image'] IS NOT NULL
""")
# Filter out any rows where URL is null
gti_df = gti_df.filter(F.col("url").isNotNull())
gti_df.createOrReplaceTempView("gti")
row_count = gti_df.count()
print(f"GTI has {row_count} rows (one row per NAIP tile)")
gti_df.select("tile_id", "datetime", "url").show(10, truncate=80)
Saving the GTI as GeoParquet
We convert the Spark DataFrame to a GeoDataFrame and save it as GeoParquet for use with RasterFlow.
from shapely import wkt
# Select columns for the GTI (excluding the Sedona geometry, using WKT instead)
gti_export_df = gti_df.select(
"tile_id",
"geometry_wkt",
"datetime",
"url"
)
# Convert Spark DataFrame to Pandas
gti_pandas = gti_export_df.toPandas()
# Convert WKT geometry to Shapely geometry and create GeoDataFrame
gti_pandas['geometry'] = gti_pandas['geometry_wkt'].apply(wkt.loads)
gti_gdf = gpd.GeoDataFrame(gti_pandas, geometry='geometry', crs="EPSG:4326")
# Drop the WKT column (no longer needed)
gti_gdf = gti_gdf.drop(columns=['geometry_wkt'])
# Save to S3 as GeoParquet
gti_path = os.getenv("USER_S3_PATH") + "naip_gti.parquet"
gti_gdf.to_parquet(gti_path)
print(f"GTI saved to: {gti_path}")
print(f"\nGTI Schema:")
print(gti_gdf.dtypes)
print(f"\nSample rows:")
gti_gdf.head()
Verifying the GTI
Before building the mosaic, we verify that the GTI was saved correctly and visualize the tile footprints.
# Reload and verify the GTI
gti_verify = gpd.read_parquet(gti_path)
print(f"GTI contains {len(gti_verify)} rows")
print(f"Unique tiles: {gti_verify['tile_id'].nunique()}")
print(f"Date range: {gti_verify['datetime'].min()} to {gti_verify['datetime'].max()}")
print(f"\nSample URLs:")
for i, row in gti_verify.head(3).iterrows():
print(f" {row['tile_id']}: {row['url'][:70]}...")
# Visualize the GTI footprints with the AOI boundary on an interactive map
# Calculate center from the AOI
minx, miny, maxx, maxy = gdf.total_bounds
center_lat = (miny + maxy) / 2
center_lon = (minx + maxx) / 2
m = Map(center=(center_lat, center_lon), zoom=10, basemap=basemaps.Esri.WorldImagery)
# Add NAIP tile footprints
m.add_gdf(
gti_verify,
layer_name="NAIP Tile Footprints",
style={"color": "blue", "fillColor": "lightblue", "fillOpacity": 0.3, "weight": 1}
)
# Add AOI boundary on top
m.add_gdf(
gdf,
layer_name="Montgomery County, MD",
zoom_to_layer=True,
style={"color": "red", "fillColor": "none", "fillOpacity": 0, "weight": 3}
)
print(f"Showing {len(gti_verify)} NAIP tile footprints over Montgomery County, MD")
m
Building a mosaic with RasterFlow
Now we use RasterFlow’s build_gti_mosaic function to create a seamless 4-band mosaic from the NAIP tiles.
Note: The naip-analytic S3 bucket is requester-pays, so we set requester_pays=True.
Expected runtime: ~5-10 minutes depending on the number of tiles.
from rasterflow_remote import RasterflowClient
client = RasterflowClient()
# Build a 4-band mosaic (Red, Green, Blue, NIR)
mosaic_index = client.build_gti_mosaic(
# Path to the GTI GeoParquet file
gti=os.getenv("USER_S3_PATH") + "naip_gti.parquet",
# Path to the AOI GeoParquet file
aoi=os.getenv("USER_S3_PATH") + "montgomery_county_md.parquet",
# Output bands (NAIP has 4 bands: r, g, b, ir)
# Note: We use 'ir' to match the ChesapeakeRSC model's expected input bands
bands=["r", "g", "b", "ir"],
# Column in the GTI containing the COG URLs
location_field="url",
# Group tiles for time dimension by any column in the gti index
time_column="datetime",
# Output CRS (UTM Zone 18N for Maryland)
crs_epsg=32618,
# NAIP bucket is requester-pays
requester_pays=True,
# NAIP uses 0 as nodata value
nodata=0.0,
)
# This example AOI has one geometry, so we select the first (only) mosaic location.
mosaic_store = mosaic_index.first_row_mosaic
print(f"Mosaic index URI: {mosaic_index.uri}")
print(f"Mosaic store URI: {mosaic_store}")
Visualizing outputs
If RasterFlow is enabled for your organization, you can visualize the mosaic and inference outputs from this notebook using cloud.wherobots.com/map. Zarr, GeoParquet, and other geospatial data formats are supported, so you can inspect either the mosaic output or the road-probability output from the preceding workflow steps.
(Optional) Running inference with ChesapeakeRSC
Now that we have built the NAIP mosaic, we can optionally run a semantic segmentation model to detect roads. The ChesapeakeRSC model was trained on NAIP imagery from Maryland and can detect roads even when occluded by tree cover.
The model outputs two classes:
Expected runtime: ~10-20 minutes depending on the mosaic size.
# # Uncomment to run road detection on the NAIP mosaic
#
# from rasterflow_remote import MosaicToMosaicActorEnum
# from rasterflow_remote.data_models import MergeModeEnum
#
# # ChesapeakeRSC model - direct URL (no HuggingFace auth required)
# MODEL_PATH = "https://huggingface.co/wherobots/chesapeakersc-pt2/resolve/main/chesapeakersc-ep.pt2"
#
# prediction_store = client.predict_mosaic(
# store=mosaic_store,
# model_path=MODEL_PATH,
# patch_size=512,
# clip_size=64,
# device="cuda",
# features=["r", "g", "b", "ir"], # Must match the bands in our mosaic
# labels=["background", "road"],
# actor=MosaicToMosaicActorEnum.SEMANTIC_SEGMENTATION_PYTORCH,
# max_batch_size=256,
# merge_mode=MergeModeEnum.WEIGHTED_AVERAGE,
# )
#
# print(f"Prediction index URI: {prediction_store.uri}")
You can also open the road-probability output from the preceding inference workflow using cloud.wherobots.com/map if RasterFlow is enabled for your organization.