- Write DataFrames as managed tables in WherobotsDB
- Apply spatial clustering to improve query performance on large datasets
- Export geospatial tables to Geoparquet for sharing or external processing
- Confirm export success and inspect output
- Use both Python and SQL workflows for data export
Working with tabular data in Wherobots
Wherobots supports loading structured tabular data directly from cloud storage. In this example, we are working with the GDELT dataset — a global event database published as daily CSV files on AWS S3. When working with your own data, you can:- Load data into Wherobots Cloud managed storage (Docs)
- Connect directly to cloud storage like AWS S3 (Docs)
Creating a managed table from raw data
We can now convert the DataFrame into an Iceberg table.- The
CREATE DATABASEcommand creates theorg_catalog.gdeltdatabase if it doesn’t already exist. CREATE OR REPLACE TABLEmakes a managed table by selecting data from the temporary view.- We also created a geometry column using the latitude and longitude fields from the CSV.
- We use
ST_Pointto create a point geometry from the latitude and longitude ST_SetSRIDsets the spatial reference system to EPSG:4326 (WGS 84).
- We use
LIMIT 10000reduces the size of the data since this is a tutorial exercise and the typical GDELT daily data set has hundreds of millions of rows.- Wherobots is built to scale to petabyte-sized, planetary spatial workloads. (Docs: Runtimes)
Writing efficient GeoParquet with metadata
When exporting spatial data for downstream use, the GeoParquet format offers an efficient, interoperable way to store vector data with embedded spatial metadata. GeoParquet builds on the Parquet columnar format, adding metadata for geometries, coordinate reference systems (CRS), and bounding boxes.For more on GeoParquet, see the GeoParquet specification
Partitioning and adding bounding boxes
Before writing the data, we optimize it for efficient storage and querying:- GeoHash partitioning — We compute a GeoHash for each geometry and partition the data accordingly. This organizes the dataset spatially, improving query performance for spatial ranges.
- Bounding box metadata — We add a bounding box for each geometry, allowing readers to perform fast spatial filtering without loading the full dataset.
Writing the GeoParquet file
We write the data using the GeoParquet format with key options:geoparquet.version— Specifies the format version (recommended:1.1.0)geoparquet.covering— Defines the spatial covering method (we usebbox)geoparquet.crs— Passes the PROJJSON metadata for the CRS (optional)compression— We applysnappycompression for efficient storage
geohash and sorted within partitions to improve downstream query performance:
This writes a partitioned, metadata-rich GeoParquet dataset ready for scalable spatial analysis.

