Skip to content

Run geospatial queries

WherobotsDB provides various APIs to work with raster data, provided below are some functions.

The whole catalog of Raster functions provided by Spatial SQL can be found here

Raster Manipulation

Coordinate translation

WherobotsDB allows you to translate coordinates as per your needs. It can translate pixel locations to world coordinates and vice versa.

PixelAsPoint

Use RS_PixelAsPoint to translate pixel coordinates to world location.

SELECT RS_PixelAsPoint(rast, 450, 400) FROM rasterDf

Output:

POINT (-13063342 3992403.75)

World to Raster Coordinate

Use RS_WorldToRasterCoord to translate world location to pixel coordinates. To just get X coordinate use RS_WorldToRasterCoordX and for just Y coordinate use RS_WorldToRasterCoordY.

SELECT RS_WorldToRasterCoord(rast, -1.3063342E7, 3992403.75)

Output:

POINT (450 400)

Pixel Manipulation

Use RS_Values to fetch values for a specified array of Point Geometries. The coordinates in the point geometry are indicative of real-world location.

SELECT RS_Values(rast, Array(ST_Point(-13063342, 3992403.75), ST_Point(-13074192, 3996020)))

Output:

[132.0, 148.0]

To change values over a grid or area defined by a geometry, we will use RS_SetValues.

SELECT RS_SetValues(
        rast, 1, 250, 260, 3, 3,
        Array(10, 12, 17, 26, 28, 37, 43, 64, 66)
    )

Follow the links to get more information on how to use the functions appropriately.

Band Manipulation

WherobotsDB provides APIs to select specific bands from a raster image and create a new raster. For example, to select 2 bands from a raster, you can use the RS_Band API to retrieve the desired multi-band raster.

Let's use a multi-band raster for this example. The process of loading and converting it to raster type is the same.

SELECT RS_Band(colorRaster, Array(1, 2))

Let's say you have many one band rasters and want to add a band to the raster to perform map algebra operations. You can do so using RS_AddBand function.

SELECT RS_AddBand(raster1, raster2, 1, 2)

This will result in raster1 having raster2's specified band.

Resample raster data

WherobotsDB allows you to resample raster data using different interpolation methods like the nearest neighbor, bilinear, and bicubic to change the cell size or align raster grids, using RS_Resample.

SELECT RS_Resample(rast, 50, -50, -13063342, 3992403.75, true, "bicubic")

For more information please follow the link.

Execute map algebra operations

Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.

The Normalized Difference Vegetation Index (NDVI) is a simple graphical indicator that can be used to analyze remote sensing measurements from a space platform and assess whether the target being observed contains live green vegetation or not.

NDVI = (NIR - Red) / (NIR + Red)

where NIR is the near-infrared band and Red is the red band.

SELECT RS_MapAlgebra(raster, 'D', 'out = (rast[3] - rast[0]) / (rast[3] + rast[0]);') as ndvi FROM raster_table

For more information please refer to Map Algebra API.

Interoperability between raster and vector data

Geometry As Raster

WherobotsDB allows you to rasterize a geometry by using RS_AsRaster.

SELECT RS_AsRaster(
        ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'),
        RS_MakeEmptyRaster(1, 'b', 4, 6, 1, -1, 1),
        'b', 230
    )

The image created is as below for the vector:

Rasterized vector

Note

The vector coordinates are buffed up to showcase the output, the real use case, may or may not match the example.

Spatial range query

WherobotsDB provides raster predicates to do a range query using a geometry window, for example let's use RS_Intersects.

SELECT rast FROM rasterDf WHERE RS_Intersect(rast, ST_GeomFromWKT('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))'))

Spatial join query

WherobotsDB's raster predicates can also do a spatial join using the raster column and geometry column, using the same function as above.

SELECT r.rast, g.geom FROM rasterDf r, geomDf g WHERE RS_Interest(r.rast, g.geom)

Note

These range and join queries will filter rasters using the provided geometric boundary and the spatial boundary of the raster.

WherobotsDB offers more raster predicates to do spatial range queries and spatial join queries. Please refer to raster predicates docs.

Collecting raster Dataframes and working with them locally in Python

WherobotsDB allows collecting Dataframes with raster columns and working with them locally in Python. The raster objects are represented as SedonaRaster objects in Python, which can be used to perform raster operations.

df_raster = sedona.read.format("binaryFile").load("/path/to/raster.tif").selectExpr("RS_FromGeoTiff(content) as rast")
rows = df_raster.collect()
raster = rows[0].rast
raster  # <sedona.raster.sedona_raster.InDbSedonaRaster at 0x1618fb1f0>

You can retrieve the metadata of the raster by accessing the properties of the SedonaRaster object.

raster.width        # width of the raster
raster.height       # height of the raster
raster.affine_trans # affine transformation matrix
raster.crs_wkt      # coordinate reference system as WKT

You can get a numpy array containing the band data of the raster using the as_numpy or as_numpy_masked method. The band data is organized in CHW order.

raster.as_numpy()        # numpy array of the raster
raster.as_numpy_masked() # numpy array with nodata values masked as nan

If you want to work with the raster data using rasterio, you can retrieve a rasterio.DatasetReader object using the as_rasterio method.

ds = raster.as_rasterio()  # rasterio.DatasetReader object
# Work with the raster using rasterio
band1 = ds.read(1)         # read the first band

Writing Python UDF to work with raster data

You can write Python UDFs to work with raster data in Python. The UDFs can take SedonaRaster objects as input and return any Spark data type as output. This is an example of a Python UDF that calculates the mean of the raster data.

from pyspark.sql.types import DoubleType

def mean_udf(raster):
    return float(raster.as_numpy().mean())

sedona.udf.register("mean_udf", mean_udf, DoubleType())
df_raster.withColumn("mean", expr("mean_udf(rast)")).show()
+--------------------+------------------+
|                rast|              mean|
+--------------------+------------------+
|GridCoverage2D["g...|1542.8092886117788|
+--------------------+------------------+

It is much trickier to write an UDF that returns a raster object, since WherobotsDB does not support serializing Python raster objects yet. However, you can write a UDF that returns the band data as an array and then construct the raster object using RS_MakeRaster. This is an example of a Python UDF that creates a mask raster based on the first band of the input raster.

from pyspark.sql.types import ArrayType, DoubleType
import numpy as np

def mask_udf(raster):
    band1 = raster.as_numpy()[0,:,:]
    mask = (band1 < 1400).astype(np.float64)
    return mask.flatten().tolist()

sedona.udf.register("mask_udf", band_udf, ArrayType(DoubleType()))
df_raster.withColumn("mask", expr("mask_udf(rast)")).withColumn("mask_rast", expr("RS_MakeRaster(rast, 'I', mask)")).show()
+--------------------+--------------------+--------------------+
|                rast|                mask|           mask_rast|
+--------------------+--------------------+--------------------+
|GridCoverage2D["g...|[0.0, 0.0, 0.0, 0...|GridCoverage2D["g...|
+--------------------+--------------------+--------------------+

Working with out-db raster

SedonaRaster will automatically construct a rasterio Env using Hadoop S3A configurations when loading out-db rasters, so out-db rasters should be loaded without credential problems. However, if you are working with out-db rasters in a subprocess, SedonaRaster will fail to infer Hadoop S3A configurations. To make subprocesses pick up Hadoop S3A configurations and properly load out-db rasters, you have to export S3 configs to environment variables using sedona.raster.gdal_conf.export_gdal_conf_to_env:

from sedona.raster import gdal_conf
gdal_conf.export_gdal_conf_to_env()

# ... launch subprocesses and work with out-db rasters

Performance optimization

When working with large raster datasets, refer to the documentation on storing raster geometries in Parquet format for recommendations to optimize performance.