Skip to content

Load data from external storage

A raster is a regular grid of numeric values with geo-referencing metadata that describes the location and resolution of the grid. Raster data is commonly used to represent terrain, elevation, and satellite imagery. Havasu supports raster data type and allows users to store and query raster data. This tutorial describes how to use raster data type and raster functions in WherobotsDB for manipulating raster data.

In-DB raster vs. Out-DB raster

WherobotsDB supports two types of raster data: In-DB raster and Out-DB raster. In-DB raster stores the complete raster data in WherobotsDB. Out-DB rasters only holds the geo-referencing information of the raster, as well as the path to the actual raster data. The actual raster data is stored in remote storage, such as S3. Out-DB rasters are useful when the raster data is too large to be stored in parquet files, or when the raster data is already stored in some remote storage.

In-DB and Out-DB Rasters In-DB and Out-DB Rasters

In WherobotsDB, we recommend Out-DB raster for your raster data, as it is more efficient and scalable. In this tutorial, we will show you how to create an Out-DB raster table and an In-DB raster table.

Create an Out-DB Raster type column

Out-DB Rasters has the same data type with In-DB Rasters, they are all represented as raster primitive type in Havasu and can be loaded as RasterUDT objects in Sedona. If user has a table containing a raster column, it can have both In-DB and Out-DB rasters in that column.

Use RS_FromPath to create Out-DB Raster

User can create an out-db raster using RS_FromPath function:

df = sedona.sql("SELECT RS_FromPath('s3://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")
val df = sedona.sql("SELECT RS_FromPath('s3://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast")
Dataset<Row> df = sedona.sql("SELECT RS_FromPath('s3://wherobots-examples/data/eurosat_small/AnnualCrop/AnnualCrop_10.tif') as rast");

WherobotsDB allows adding multiple out-db rasters to one object. This example ensures the out-db raster will retrieve the raster only when required for computation:

df = sedona.read.format("binaryFile")
           .option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*")
           .load("s3://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))
val df = sedona.read.format("binaryFile")
           .option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*")
           .load("s3://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))
Dataset<Row> df = sedona.read.format("binaryFile")
           .option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*")
           .load("s3://wherobots-examples/data/eurosat_small/AnnualCrop/*.tif")
           .drop("content").withColumn("rast", expr("RS_FromPath(path)"))

The result of RS_FromPath is an Out-db raster object, which can be stored in a raster column in a Havasu table.

Retrieve the path of Out-DB Rasters

The rasters path of out-db rasters can be retrieved using RS_BandPath function. It will give the path of the raster file in remote storage. If the argument is an in-db raster, RS_BandPath will return NULL.

df.selectExpr("RS_BandPath(rast)").show()
df.selectExpr("RS_BandPath(rast)").show()
df.selectExpr("RS_BandPath(rast)").show();
+--------------------+
|   rs_bandpath(rast)|
+--------------------+
|s3://wherobots-e...|
+--------------------+

Divide and explode Out-DB Rasters

Large Out-DB rasters can be subdivided into tiles using RS_TileExplode function:

df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show();

The rasters RS_TileExplode produces are also Out-DB rasters as long as the input raster is an Out-DB raster. These tiles reference different parts of the same raster file in remote storage, so they are very cheap to create.

df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show()
df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)").show();
+---+---+--------------------+
|  x|  y|                rast|
+---+---+--------------------+
|  0|  0|OutDbGridCoverage...|
|  1|  0|OutDbGridCoverage...|
|  0|  1|OutDbGridCoverage...|
|  1|  1|OutDbGridCoverage...|
+---+---+--------------------+

Create Out-DB Raster from STAC GeoJSON files

This method is useful when dealing with large raster datasets stored in external storage systems like Amazon S3, as it allows the Havasu table to store only the path to the raster file along with geo-referencing metadata.

The following examples shows how to read a STAC file using the GeoJSON reader, load raster data from the STAC file and write it as an out-db raster to a Havasu table.

# Read STAC file using GeoJSON reader
df = sedona.read.format("geojson").load("stac-file-path")

# Read GeoTIFF files as out-db rasters using RS_FromPath
df_rast = df.withColumn("rast", expr("RS_FromPath(assets.analytic.href)"))

# Writing the data into Havasu table
df_rast.writeTo("wherobots.test_db.test_table").append()
// Read STAC file using GeoJSON reader
val df = sedona.read.format("geojson").load("stac-file-path")

// Read GeoTIFF files as out-db rasters using RS_FromPath
val df_rast = df.withColumn("rast", expr("RS_FromPath(assets.analytic.href)"))

// Write the out-db raster data into a Havasu table
df_rast.writeTo("wherobots.test_db.test_table").append()
// Read STAC file using GeoJSON reader
Dataset<Row> df = sedona.read.format("geojson").load("stac-file-path");

// Read GeoTIFF files as out-db rasters using RS_FromPath
Dataset<Row> df_rast = df.withColumn("rast", expr("RS_FromPath(assets.analytic.href)"));

// Write the out-db raster data into a Havasu table
df_rast.selectExpr("rast", "path AS data").writeTo("wherobots.test_db.test_table").append();

Supported Out-DB Storages

Havasu supports any storages that Hadoop supports, as long as the storage has an implementation of org.apache.hadoop.fs.FileSystem on the classpath of the Wherobots Cloud. In the above examples, we accessed out-db rasters stored on S3 using the S3A file system provided by Hadoop AWS. Following is a list of storages that are tested on Wherobots Cloud:

Storage Scheme File System Class
Local file:// org.apache.hadoop.fs.LocalFileSystem
S3 s3:// org.apache.hadoop.fs.s3a.S3AFileSystem

Create an In-DB Raster type column

All raster operations in Spatial SQL require Raster type objects. Therefore, this should be the next step after loading the data.

Load binary raster data

Assume we have a single raster data file called rasterData.tiff, at Path.

Use the following code to load the data and create a WherobotsDB Dataframe.

var rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
Dataset<Row> rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
rawDf = sedona.read.format("binaryFile").load(path_to_raster_data)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

The output will look like this:

|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|

For multiple raster data files use the following code to load the data from path and create raw DataFrame.

Note

The above code works too for loading multiple raster data files. if the raster files are in separate directories and the option also makes sure that only .tif or .tiff files are being loaded.

var rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
Dataset<Row> rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder);
rawDf.createOrReplaceTempView("rawdf");
rawDf.show();
rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

The output will look like this:

|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/Download/ra...|2023-09-06 16:24:...|209199|[4D 4D 00 2A 00 0...|
|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
|file:/Download/ra...|2023-09-06 16:24:...|174803|[49 49 2A 00 08 0...|
|file:/Download/ra...|2023-09-06 16:24:...|  6619|[49 49 2A 00 08 0...|

The content column in the raster table is still in the raw form, binary form.

Note

Recursive file lookup will also work similarly for raster datasets in .asc file formats.

From Geotiff

SELECT RS_FromGeoTiff(content) AS rast, modificationTime, length, path FROM rawdf

To verify this, use the following code to print the schema of the DataFrame:

rasterDf.printSchema()

The output will be like this:

root
 |-- rast: raster (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- path: string (nullable = true)

From Arc Grid

The raster data is loaded the same way as tiff file, but the raster data is stored with the extension .asc, ASCII format. The following code creates a Raster type objects from binary data:

SELECT RS_FromArcInfoAsciiGrid(content) AS rast, modificationTime, length, path FROM rawdf

Raster's metadata

WherobotsDB has a function to get the metadata for the raster, and also a function to get the world file of the raster.

Metadata

This function will return an array of metadata, it will have all the necessary information about the raster, Please refer to RS_MetaData.

SELECT RS_MetaData(rast) FROM rasterDf

Output for the following function will be:

[-1.3095817809482181E7, 4021262.7487925636, 512.0, 517.0, 72.32861272132695, -72.32861272132695, 0.0, 0.0, 3857.0, 1.0]

The first two elements of the array represent the real-world geographic coordinates (like longitude/latitude) of the raster image's top left pixel, while the next two elements represent the pixel dimensions of the raster.

World File

There are two kinds of georeferences, GDAL and ESRI seen in world files. For more information please refer to RS_GeoReference.

SELECT RS_GeoReference(rast, "ESRI") FROM rasterDf

The Output will be as follows:

72.328613
0.000000
0.000000
-72.328613
-13095781.645176
4021226.584486

World files are used to georeference and geolocate images by establishing an image-to-world coordinate transformation that assigns real-world geographic coordinates to the pixels of the image.