Skip to content

Out-DB Rasters performance guide

Performance Characteristics

Out-DB rasters keeps the geo-referencing information of the raster, but not the actual band data. If the application only needs to access the non-band portion of the raster, out-db rasters will be very efficient. For instance, functions such as RS_Metadata, RS_GeoReference and RS_SRID are very fast to run on Out-DB rasters. However, if the application needs to access the band data of the raster, it needs to read the raster file from remote storage, which is much slower than reading the band data from In-DB rasters. For example, functions such as RS_Value, RS_MapAlgebra and RS_Resample are slower to run on Out-DB rasters than In-DB rasters.

Caching of Out-DB Rasters

Havasu supports caching of out-db rasters, so that if a raster is read from the same path multiple times, it may hit the cache and does not need to fetch the raster from remote storage. This avoids the overhead of reading the raster from remote storage multiple times. The cache is per-executor core, and the caches for different cores in the same executor is not shared. This is because the cache is implemented as thread local variables, which is per-executor core. We'll change the implementation in the future to make the cache shared across all the threads in the same executor, so that the cache hit rate will be higher.

The size of each Out-DB raster cache can be configured by spark.wherobots.raster.outdb.pool.size, the default value is 100. If the cache is full, the least recently used raster will be evicted from the cache. The cache is stored in local disk, the directory for storing the cache can be configured by spark.wherobots.raster.outdb.cache.dir, the default value is /data/outdb-cache. It is the same directory as the directory used by Spark block manager for storing shuffled data. If the disk space is not enough, user can tune spark.wherobots.raster.outdb.pool.size to a smaller value, or use a larger EBS volume. If your Out-DB rasters rarely needs to be read more than once, you can also set spark.wherobots.raster.outdb.pool.size to 0 to disable the cache.

The raster caches does not download and cache the entire Out-DB raster at once, it only caches the parts of the raster file that are accessed by the application. For example, if the application calls RS_Value(rast, pt) to retrieve the value of one pixel at point pt, only the part of the raster file that contains the hit tile will be fetched from remote storaga and thus cached. The size of each read is controlled by spark.wherobots.raster.outdb.readahead, the default value is 64k. Larger read ahead value such as 4m may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed.

Improve Cache Hit Rate

Caching the DataFrame

The cache is per-executor core, so if the same raster is read from different executor threads, it will be cached multiple times. To make the cache hit rate higher, it is recommended to cache the dataframe containing out-db rasters before running any spatial operations on it. For example:

df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
val df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
df_tiles.cache()
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
Dataset<Row> df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)");
df_tiles.cache();
df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();

Since the dataframe was cached, Spark will be more likely to schedule the spatial operations on the same executor threads, so that the cached rasters has a higher probability to be reused.

Ordering Out-DB Rasters by remote storage path

If the Out-DB rasters referencing the same remote storage path were ordered adjacently, the cache hit rate will be higher. If you are processing tiled Out-DB raster dataset, it is recommended to order the rasters by RS_BandPath(rast) before processing the dataset. The overhead of sorting the dataset brings great benefit to the overall performance.

df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
val df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
Dataset<Row> df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show();

Note

The size of the out-db object pool should be larger than the number of distinct out-db raster path per partition, Otherwise, the cache will be trashing and won't be effective.

Configurations

There are several configurations for pooling of out-db rasters:

Configuration Default Description
spark.wherobots.raster.outdb.pool.size 100 Size of per-executor core raster cache
spark.wherobots.raster.outdb.cache.dir /data/outdb-cache Directory for storing cached raster data
spark.wherobots.raster.outdb.readahead 64k The minimum number of bytes to read on each read of remote storage. Larger read ahead value may improve the performance of reading large rasters, but may hurt the performance if only a small portion of the raster file is needed.

There is another configuration to take care of when accessing a large number of out-db rasters stored on S3: spark.hadoop.fs.s3a.connection.maximum. This configuration controls the maximum number of concurrent connections to S3. Wherobots cluster may hold a large number of concurrent connections because of out-db raster pooling, so it is configured to a large number by default. The default configuration is 2000.

Note

The options mentioned above can only be configured when creating the Spark session, they cannot be changed after the Spark session is created.