Run geospatial queries
WherobotsDB provides various APIs to work with raster data, provided below are some functions.
The whole catalog of Raster functions provided by Spatial SQL can be found here
Raster Manipulation¶
Coordinate translation¶
WherobotsDB allows you to translate coordinates as per your needs. It can translate pixel locations to world coordinates and vice versa.
PixelAsPoint¶
Use RS_PixelAsPoint to translate pixel coordinates to world location.
SELECT RS_PixelAsPoint(rast, 450, 400) FROM rasterDf
Output:
POINT (-13063342 3992403.75)
World to Raster Coordinate¶
Use RS_WorldToRasterCoord to translate world location to pixel coordinates. To just get X coordinate use RS_WorldToRasterCoordX and for just Y coordinate use RS_WorldToRasterCoordY.
SELECT RS_WorldToRasterCoord(rast, -1.3063342E7, 3992403.75)
Output:
POINT (450 400)
Pixel Manipulation¶
Use RS_Values to fetch values for a specified array of Point Geometries. The coordinates in the point geometry are indicative of real-world location.
SELECT RS_Values(rast, Array(ST_Point(-13063342, 3992403.75), ST_Point(-13074192, 3996020)))
Output:
[132.0, 148.0]
To change values over a grid or area defined by a geometry, we will use RS_SetValues.
SELECT RS_SetValues(
rast, 1, 250, 260, 3, 3,
Array(10, 12, 17, 26, 28, 37, 43, 64, 66)
)
Follow the links to get more information on how to use the functions appropriately.
Band Manipulation¶
WherobotsDB provides APIs to select specific bands from a raster image and create a new raster. For example, to select 2 bands from a raster, you can use the RS_Band API to retrieve the desired multi-band raster.
Let's use a multi-band raster for this example. The process of loading and converting it to raster type is the same.
SELECT RS_Band(colorRaster, Array(1, 2))
Let's say you have many one band rasters and want to add a band to the raster to perform map algebra operations. You can do so using RS_AddBand function.
SELECT RS_AddBand(raster1, raster2, 1, 2)
This will result in raster1
having raster2
's specified band.
Resample raster data¶
WherobotsDB allows you to resample raster data using different interpolation methods like the nearest neighbor, bilinear, and bicubic to change the cell size or align raster grids, using RS_Resample.
SELECT RS_Resample(rast, 50, -50, -13063342, 3992403.75, true, "bicubic")
For more information please follow the link.
Execute map algebra operations¶
Map algebra is a way to perform raster calculations using mathematical expressions. The expression can be a simple arithmetic operation or a complex combination of multiple operations.
The Normalized Difference Vegetation Index (NDVI) is a simple graphical indicator that can be used to analyze remote sensing measurements from a space platform and assess whether the target being observed contains live green vegetation or not.
NDVI = (NIR - Red) / (NIR + Red)
where NIR is the near-infrared band and Red is the red band.
SELECT RS_MapAlgebra(raster, 'D', 'out = (rast[3] - rast[0]) / (rast[3] + rast[0]);') as ndvi FROM raster_table
For more information please refer to Map Algebra API.
Interoperability between raster and vector data¶
Geometry As Raster¶
WherobotsDB allows you to rasterize a geometry by using RS_AsRaster.
SELECT RS_AsRaster(
ST_GeomFromWKT('POLYGON((150 150, 220 260, 190 300, 300 220, 150 150))'),
RS_MakeEmptyRaster(1, 'b', 4, 6, 1, -1, 1),
'b', 230
)
The image created is as below for the vector:
Note
The vector coordinates are buffed up to showcase the output, the real use case, may or may not match the example.
Spatial range query¶
WherobotsDB provides raster predicates to do a range query using a geometry window, for example let's use RS_Intersects.
SELECT rast FROM rasterDf WHERE RS_Intersect(rast, ST_GeomFromWKT('POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))'))
Spatial join query¶
WherobotsDB's raster predicates can also do a spatial join using the raster column and geometry column, using the same function as above.
SELECT r.rast, g.geom FROM rasterDf r, geomDf g WHERE RS_Interest(r.rast, g.geom)
Note
These range and join queries will filter rasters using the provided geometric boundary and the spatial boundary of the raster.
WherobotsDB offers more raster predicates to do spatial range queries and spatial join queries. Please refer to raster predicates docs.
Collecting raster Dataframes and working with them locally in Python¶
WherobotsDB allows collecting Dataframes with raster columns and working with them locally in Python.
The raster objects are represented as SedonaRaster
objects in Python, which can be used to perform raster operations.
df_raster = sedona.read.format("binaryFile").load("/path/to/raster.tif").selectExpr("RS_FromGeoTiff(content) as rast")
rows = df_raster.collect()
raster = rows[0].rast
raster # <sedona.raster.sedona_raster.InDbSedonaRaster at 0x1618fb1f0>
You can retrieve the metadata of the raster by accessing the properties of the SedonaRaster
object.
raster.width # width of the raster
raster.height # height of the raster
raster.affine_trans # affine transformation matrix
raster.crs_wkt # coordinate reference system as WKT
You can get a numpy array containing the band data of the raster using the as_numpy
or as_numpy_masked
method. The
band data is organized in CHW order.
raster.as_numpy() # numpy array of the raster
raster.as_numpy_masked() # numpy array with nodata values masked as nan
If you want to work with the raster data using rasterio
, you can retrieve a rasterio.DatasetReader
object using the
as_rasterio
method.
ds = raster.as_rasterio() # rasterio.DatasetReader object
# Work with the raster using rasterio
band1 = ds.read(1) # read the first band
Writing Python UDF to work with raster data¶
You can write Python UDFs to work with raster data in Python. The UDFs can take SedonaRaster
objects as input and
return any Spark data type as output. This is an example of a Python UDF that calculates the mean of the raster data.
from pyspark.sql.types import DoubleType
def mean_udf(raster):
return float(raster.as_numpy().mean())
sedona.udf.register("mean_udf", mean_udf, DoubleType())
df_raster.withColumn("mean", expr("mean_udf(rast)")).show()
+--------------------+------------------+
| rast| mean|
+--------------------+------------------+
|GridCoverage2D["g...|1542.8092886117788|
+--------------------+------------------+
It is much trickier to write an UDF that returns a raster object, since WherobotsDB does not support serializing Python raster
objects yet. However, you can write a UDF that returns the band data as an array and then construct the raster object using
RS_MakeRaster
. This is an example of a Python UDF that creates a mask raster based on the first band of the input raster.
from pyspark.sql.types import ArrayType, DoubleType
import numpy as np
def mask_udf(raster):
band1 = raster.as_numpy()[0,:,:]
mask = (band1 < 1400).astype(np.float64)
return mask.flatten().tolist()
sedona.udf.register("mask_udf", band_udf, ArrayType(DoubleType()))
df_raster.withColumn("mask", expr("mask_udf(rast)")).withColumn("mask_rast", expr("RS_MakeRaster(rast, 'I', mask)")).show()
+--------------------+--------------------+--------------------+
| rast| mask| mask_rast|
+--------------------+--------------------+--------------------+
|GridCoverage2D["g...|[0.0, 0.0, 0.0, 0...|GridCoverage2D["g...|
+--------------------+--------------------+--------------------+
Working with out-db raster¶
SedonaRaster
will automatically construct a rasterio Env using Hadoop S3A configurations when loading out-db rasters, so out-db
rasters should be loaded without credential problems. However, if you are working with out-db rasters in a subprocess, SedonaRaster
will fail to infer Hadoop S3A configurations. To make subprocesses pick up Hadoop S3A configurations and properly load out-db rasters,
you have to export S3 configs to environment variables using sedona.raster.gdal_conf.export_gdal_conf_to_env
:
from sedona.raster import gdal_conf
gdal_conf.export_gdal_conf_to_env()
# ... launch subprocesses and work with out-db rasters
Performance optimization¶
When working with large raster datasets, refer to the documentation on storing raster geometries in Parquet format for recommendations to optimize performance.