Skip to content

Raster Loaders

Note

Sedona loader are available in Scala, Java and Python and have the same APIs.

The raster loader of Sedona leverages Spark built-in binary data source and works with several RS constructors to produce Raster type. Each raster is a row in the resulting DataFrame and stored in a Raster format.

By default, these functions uses lon/lat order.

Loading raster using the raster loader

The raster loader reads raster data from binary files as out-of-database (out-db) rasters then splits that raster data into smaller tiles.

var rawDf = sedona.read.format("raster").load("/FILE-PATH/*.tif")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
Dataset<Row> rawDf = sedona.read().format("raster").load("/FILE-PATH/*.tif")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()
rawDf = sedona.read.format("raster").load("/FILE-PATH/*.tif")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

The output will look like this:

+--------------------+---+---+
|                rast|  x|  y|
+--------------------+---+---+
|OutDbGridCoverage...|  0|  0|
|OutDbGridCoverage...|  1|  0|
|OutDbGridCoverage...|  2|  0|
...

The output contains the following columns:

  • rast: The raster data in Raster format. This is an out-db raster tile that references to the original raster data file.
  • x: The 0-based x-coordinate of the tile. This column only presents when retile is not disabled.
  • y: The 0-based y-coordinate of the tile. This column only presents when retile is not disabled.

The size of the tile is determined by the internal tiling scheme of the raster data. Using the Cloud Optimized GeoTIFF (COG) format for raster data is recommended since doing so usually organizes the pixel data as square tiles.

You can also disable automatic tiling using option("retile", "false"), or specify the tile size manually using options such as option("tileWidth", "256") and option("tileHeight", "256").

The options for the raster loader are as follows:

  • retile: Enables tiling. Default is true.
  • tileWidth: The width of the tile. If not specified, the size of internal tiles will be used.
  • tileHeight: The height of the tile. If not specified, will use tileWidth if tileWidth is explicitly set, otherwise the size of internal tiles will be used.

Note

If the internal tiling scheme of raster data does not conform to tiling, the raster loader will throw an error. You can disable automatic tiling using option("retile", "false"), or specify the tile size manually to workaround this issue. A better solution is to translate the raster data into COG format using gdal_translate or other tools.

The raster loader also works with Spark generic source file options, such as option("pathGlobFilter", "*.tif*") and option("recursiveFileLookup", "true"). For instance, you can load all the .tif files recursively in a directory using

sedona.read.format("raster").option("recursiveFileLookup", "true").option("pathGlobFilter", "*.tif*").load(path_to_raster_data_folder)

The DataFrame loaded by the raster loader will be automatically repartitioned by default, this is for evenly distributing the workload of processing raster tiles to the entire cluster. The number of partitions is proportional to the number of executor CPU cores in the cluster. You can disable auto repartitioning by setting the Spark session configuration spark.wherobots.raster.load.autoRepartition to false. If you want to manually specify the number of partitions, you can set the Spark session configuration spark.wherobots.raster.load.numPartitions to the desired number of partitions.

Loading raster using binaryFile loader (Deprecated)

Step 1: Load raster to a binary DataFrame

You can load any type of raster data using the code below. Then use the RS constructors below to create a Raster DataFrame.

sedona.read.format("binaryFile").load("/some/path/*.asc")

Step 2: Create a raster type column

After loading the raster data files using binaryFile loader, you can either use RS_FromPath to load the raster as an out-db raster, or use one of RS_FromGeoTiff, RS_FromArcInfoAsciiGrid and RS_FromNetCDF to load the binary data of the raster file as an in-db raster.

Loading raster files as out-db raster using RS_FromPath

We can drop the content binary column to avoid reading the content of the file entirely when using RS_FromPath to load out-db rasters.

var df = sedona.read.format("binaryFile").load("/some/path/*.tiff").drop("content")
df = df.withColumn("raster", f.expr("RS_FromPath(path)"))

Loading raster content as in-db raster using RS_FromGeoTiff

We'll use the content binary column to load in-db raster. This requires loading the entire raster file into memory.

var df = sedona.read.format("binaryFile").load("/FILE-PATH/*.tiff")
df = df.withColumn("raster", f.expr("RS_FromGeoTiff(content)"))

Raster Loading Functions

RS_FromArcInfoAsciiGrid

Introduction: Returns a raster geometry from an Arc Info Ascii Grid file.

Format: RS_FromArcInfoAsciiGrid(asc: ARRAY[Byte])

SQL example:

var df = sedona.read.format("binaryFile").load("/some/path/*.asc")
df = df.withColumn("raster", f.expr("RS_FromArcInfoAsciiGrid(content)"))

RS_FromGeoTiff

Introduction: Returns a raster geometry from a GeoTiff file.

Format: RS_FromGeoTiff(content: ARRAY[Byte], autoRescale: Boolean = true)

  • content is a byte array that contains the content of the GeoTiff file.
  • autoRescale is an optional parameter that specifies whether to rescale the pixel values using the scale and offset values in the GeoTiff file. The default value is true.

SQL example:

var df = sedona.read.format("binaryFile").load("/some/path/*.tiff")
df = df.withColumn("raster", f.expr("RS_FromGeoTiff(content)"))

RS_FromPath

You can load rasters from paths. Rasters loaded in this way are called "out-db" rasters. Out-db rasters hold references to raster files instead of holding the actual pixel data.

Out-db rasters can be used interchangeably with ordinary rasters. The only difference is that out-db rasters will load raster files in a deferred manner. Pixel data won't be loaded until pixel values were accessed by functions such as RS_Value or RS_BandAsArray. It is more appropriate to load large raster files as out-db rasters.

Introduction: Returns an out-db raster from path to image file. Currently, it supports loading GeoTiff files (*.tiff or *.tif) and Arc Info Ascii Grid files (*.asc). Additional parameters for configuring the Hadoop file system can be passed in as a ; delimited string. For example, fs.s3a.access.key=xxx;fs.s3a.secret.key=xxx. To load GeoTiff files without automatic rescaling, please add raster.reader.auto-rescale=false to the parameters.

RS_FromPath will load the metadata of the raster file immediately when eagerLoadMetadata is set to true, and report any errors encountered reading the raster file, otherwise it will only keep the path to raster file without loading it, until the metadata of the raster is actually needed. The default value of eagerLoadMetadata is false.

Format: RS_FromPath(path: String)

Format: RS_FromPath(path: String, params: String)

Format: RS_FromPath(path: String, params: String, eagerLoadMetadata: Boolean)

SQL example:

var df = sedona.read.format("binaryFile").load("/some/path/*.tiff")

// Load out-db rasters from path
df = df.selectExpr("path", "RS_FromPath(path) as rast")

// Load out-db rasters with custom Hadoop file system parameters
df = df.selectExpr("path", "RS_FromPath(path, 'fs.s3a.access.key=xxx;fs.s3a.secret.key=xxx') as rast")

RS_MakeEmptyRaster

Introduction: Returns an empty raster geometry. Every band in the raster is initialized to 0.0.

Format:

RS_MakeEmptyRaster(numBands: Integer, bandDataType: String = 'D', width: Integer, height: Integer, upperleftX: Double, upperleftY: Double, cellSize: Double)
  • NumBands: The number of bands in the raster. If not specified, the raster will have a single band.
  • BandDataType: Optional parameter specifying the data types of all the bands in the created raster. Accepts one of:
    1. "D" - 64 bits Double
    2. "F" - 32 bits Float
    3. "I" - 32 bits signed Integer
    4. "S" - 16 bits signed Short
    5. "US" - 16 bits unsigned Short
    6. "B" - 8 bits unsigned Byte
  • Width: The width of the raster in pixels.
  • Height: The height of the raster in pixels.
  • UpperleftX: The X coordinate of the upper left corner of the raster, in terms of the CRS units.
  • UpperleftY: The Y coordinate of the upper left corner of the raster, in terms of the CRS units.
  • Cell Size (pixel size): The size of the cells in the raster, in terms of the CRS units.

It uses the default Cartesian coordinate system.

Format:

RS_MakeEmptyRaster(numBands: Integer, bandDataType: String = 'D', width: Integer, height: Integer, upperleftX: Double, upperleftY: Double, scaleX: Double, scaleY: Double, skewX: Double, skewY: Double, srid: Integer)
  • NumBands: The number of bands in the raster. If not specified, the raster will have a single band.
  • BandDataType: Optional parameter specifying the data types of all the bands in the created raster. Accepts one of:
    1. "D" - 64 bits Double
    2. "F" - 32 bits Float
    3. "I" - 32 bits signed Integer
    4. "S" - 16 bits signed Short
    5. "US" - 16 bits unsigned Short
    6. "B" - 8 bits Byte
  • Width: The width of the raster in pixels.
  • Height: The height of the raster in pixels.
  • UpperleftX: The X coordinate of the upper left corner of the raster, in terms of the CRS units.
  • UpperleftY: The Y coordinate of the upper left corner of the raster, in terms of the CRS units.
  • ScaleX: The scaling factor of the cells on the X axis
  • ScaleY: The scaling factor of the cells on the Y axis
  • SkewX: The skew of the raster on the X axis, effectively tilting them in the horizontal direction
  • SkewY: The skew of the raster on the Y axis, effectively tilting them in the vertical direction
  • SRID: The SRID of the raster. Use 0 if you want to use the default Cartesian coordinate system. Use 4326 if you want to use WGS84.

For more information about ScaleX, ScaleY, SkewX, SkewY, please refer to the Affine Transformations section.

Note

If any other value than the accepted values for the bandDataType is provided, RS_MakeEmptyRaster defaults to double as the data type for the raster.

SQL example 1 (with 2 bands):

SELECT RS_MakeEmptyRaster(2, 10, 10, 0.0, 0.0, 1.0)

Output:

+--------------------------------------------+
|rs_makeemptyraster(2, 10, 10, 0.0, 0.0, 1.0)|
+--------------------------------------------+
|                        GridCoverage2D["g...|
+--------------------------------------------+

SQL example 2 (with 2 bands and dataType):

SELECT RS_MakeEmptyRaster(2, 'I', 10, 10, 0.0, 0.0, 1.0) - Create a raster with integer datatype

Output:

+--------------------------------------------+
|rs_makeemptyraster(2, 10, 10, 0.0, 0.0, 1.0)|
+--------------------------------------------+
|                        GridCoverage2D["g...|
+--------------------------------------------+

SQL example 3 (with 2 bands, scale, skew, and SRID):

SELECT RS_MakeEmptyRaster(2, 10, 10, 0.0, 0.0, 1.0, -1.0, 0.0, 0.0, 4326)

Output:

+------------------------------------------------------------------+
|rs_makeemptyraster(2, 10, 10, 0.0, 0.0, 1.0, -1.0, 0.0, 0.0, 4326)|
+------------------------------------------------------------------+
|                                              GridCoverage2D["g...|
+------------------------------------------------------------------+

SQL example 4 (with 2 bands, scale, skew, and SRID):

SELECT RS_MakeEmptyRaster(2, 'F', 10, 10, 0.0, 0.0, 1.0, -1.0, 0.0, 0.0, 4326) - Create a raster with float datatype

Output:

+------------------------------------------------------------------+
|rs_makeemptyraster(2, 10, 10, 0.0, 0.0, 1.0, -1.0, 0.0, 0.0, 4326)|
+------------------------------------------------------------------+
|                                              GridCoverage2D["g...|
+------------------------------------------------------------------+

RS_MakeRaster

Introduction: Creates a raster from the given array of pixel values. The width, height, geo-reference information, and the CRS will be taken from the given reference raster. The data type of the resulting raster will be DOUBLE and the number of bands of the resulting raster will be data.length / (refRaster.width * refRaster.height).

Format: RS_MakeRaster(refRaster: Raster, bandDataType: String, data: ARRAY[Double])

  • refRaster: The reference raster from which the width, height, geo-reference information, and the CRS will be taken.
  • bandDataType: The data type of the bands in the resulting raster. Please refer to the RS_MakeEmptyRaster function for the accepted values.
  • data: The array of pixel values. The size of the array cannot be 0, and should be multiple of width * height of the reference raster.

SQL example:

WITH r AS (SELECT RS_MakeEmptyRaster(2, 3, 2, 0.0, 0.0, 1.0, -1.0, 0.0, 0.0, 4326) AS rast)
SELECT RS_AsMatrix(RS_MakeRaster(rast, 'D', ARRAY(1, 2, 3, 4, 5, 6))) FROM r

Output:

+------------------------------------------------------------+
|rs_asmatrix(rs_makeraster(rast, D, array(1, 2, 3, 4, 5, 6)))|
+------------------------------------------------------------+
||1.0  2.0  3.0|\n|4.0  5.0  6.0|\n                          |
+------------------------------------------------------------+

RS_AsInDB

Introduction: Converts an out-of-database (out-db) raster to an in-database (in-db) raster, facilitating raster data management within the database.

This function is useful for scenarios where raster data initially stored outside the database needs to be managed within the database, enhancing data integrity and access efficiency.

Format:

RS_AsInDB(raster: Raster)

SQL example:

SELECT path, raster_outdb, RS_AsInDB(raster_outdb) As raster FROM Table

Output:

+----------------------+------------------------------------------------------------+-------------------------------------------------------+
|path                  |raster_outdb                                                |raster                                                 |
+----------------------+------------------------------------------------------------+-------------------------------------------------------+
|/Users/.../test1.tiff |OutDbGridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|GridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|
|/Users/.../test2.tiff |OutDbGridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|GridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|
|/Users/.../test3.tiff |OutDbGridCoverage2D["", GeneralEnvelope[(382240.0, 615266...|GridCoverage2D["", GeneralEnvelope[(382240.0, 615266...|
|/Users/.../test4.tiff |OutDbGridCoverage2D["", GeneralEnvelope[(-180.0, -90.0), ...|GridCoverage2D["", GeneralEnvelope[(-180.0, -90.0), ...|
|/Users/.../test5.tiff |OutDbGridCoverage2D["", GeneralEnvelope[(223586.236519645...|GridCoverage2D["", GeneralEnvelope[(223586.236519645...|
+----------------------+------------------------------------------------------------+-------------------------------------------------------+

RS_BandPath

Introduction: Retrieves the file path of an out-of-database (out-db) raster, providing a link to the external raster file it references. Primarily used with out-db rasters to access their storage location.

Useful in scenarios involving out-db rasters, where only the raster path and geo-referencing metadata are stored in the database.

Format:

RS_BandPath(raster: Raster)

SQL Example:

SELECT raster_outdb, RS_BandPath(raster_outdb) AS band_path FROM Table

Output:

+------------------------------------------------------------+----------------------+
|raster_outdb                                                |band_path             |
+------------------------------------------------------------+----------------------+
|OutDbGridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|/Users/.../test1.tiff |
|OutDbGridCoverage2D["", GeneralEnvelope[(-1.3095817809482...|/Users/.../test2.tiff |
|OutDbGridCoverage2D["", GeneralEnvelope[(382240.0, 615266...|/Users/.../test3.tiff |
|OutDbGridCoverage2D["", GeneralEnvelope[(-180.0, -90.0), ...|/Users/.../test4.tiff |
|OutDbGridCoverage2D["", GeneralEnvelope[(223586.236519645...|/Users/.../test5.tiff |
+------------------------------------------------------------+----------------------|

RS_FromNetCDF

Introduction: Returns a raster geometry representing the given record variable short name from a NetCDF file. This API reads the array data of the record variable in memory along with all its dimensions Since the netCDF format has many variants, the reader might not work for your test case, if that is so, please report this using the public forums.

This API has been tested for netCDF classic (NetCDF 1, 2, 5) and netCDF4/HDF5 files.

This API requires the name of the record variable. It is assumed that a variable of the given name exists, and its last 2 dimensions are 'lat' and 'lon' dimensions respectively.

If this assumption does not hold true for your case, you can choose to pass the lonDimensionName and latDimensionName explicitly.

You can use RS_NetCDFInfo to get the details of the passed netCDF file (variables and its dimensions).

Format 1: RS_FromNetCDF(netCDF: ARRAY[Byte], recordVariableName: String)

Format 2: RS_FromNetCDF(netCDF: ARRAY[Byte], recordVariableName: String, lonDimensionName: String, latDimensionName: String)

SQL Example:

val df = sedona.read.format("binaryFile").load("/some/path/test.nc")
df = df.withColumn("raster", f.expr("RS_FromNetCDF(content, 'O3')"))
val df = sedona.read.format("binaryFile").load("/some/path/test.nc")
df = df.withColumn("raster", f.expr("RS_FromNetCDF(content, 'O3', 'lon', 'lat')"))

RS_NetCDFInfo

Introduction: Returns a string containing names of the variables in a given netCDF file along with its dimensions.

Format: RS_NetCDFInfo(netCDF: ARRAY[Byte])

SQL Example:

val df = sedona.read.format("binaryFile").load("/some/path/test.nc")
recordInfo = df.selectExpr("RS_NetCDFInfo(content) as record_info").first().getString(0)
print(recordInfo)

Output:

O3(time=2, z=2, lat=48, lon=80)

NO2(time=2, z=2, lat=48, lon=80)