Skip to content

Performance Tips

This document provides some tips for improving the performance of Havasu raster operations, especially out-db raster operations.

Only Store Small Rasters as In-DB Rasters

In-DB rasters are stored in the parquet data files, so storing small in-db rasters in data files can achieve higher read throughput. It is generally a good idea to store small preprocessed chips as in-db rasters, so that they can be read by downstream applications such as machine learning tasks powered by WherobotsAI Raster Inference more efficiently. For instance, the EuroSAT dataset contains labeled chips of size 64x64. These small chips will be loaded faster if they are stored as in-db rasters.

Storing large rasters as in-db rasters will make the data processing tasks prone to out-of-memory errors, because the entire raster will be loaded into memory when the raster is read from the data files. It is recommended to store large rasters as out-db rasters so they can be loaded lazily, and it is faster to run complicated tasks requiring shuffles on out-db rasters, such as repartitioning and sorting.

Tuning Read-ahead size for Out-DB Rasters

Out-DB rasters are stored in remote storage, such as S3. When reading Out-DB rasters, Havasu will read the raster file from remote storage. The external raster file was read on-demand. However, reading larger blocks each time will make the throuput of reading out-db rasters higher. The size of each read is controlled by spark.wherobots.raster.outdb.readahead. The default value is 64k. Larger read ahead value may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed. Here are some guidelines for tuning the read ahead size:

  • For reading the entire raster file such as RS_MapAlgebra, RS_Count etc. A larger read ahead size (for example, 4m) will improve the performance.
  • For reading a small portion of the raster file such as RS_Value, A smaller read ahead size will improve the performance, since the extra amount of data read by read-ahead was not reused. However, there are exceptions. If we will run lots of RS_Value calls that hits most of the tiles of a out-db raster, it is beneficial to have a large read ahead size, since the extra amount of data read by read-ahead will be reused by the subsequent RS_Value calls.

Being Aware of the Out-DB Raster Cache

WherobotsDB will cache recently read out-db rasters in a per-executor object pool, so that subsequent reads of the same raster will be significantly faster. It is recommended to sort the out-db rasters by external storage paths before processing them, so that the rasters referencing the same external storage path will be read adjacently. This will improve the cache hit rate and thus improve the overall performance. Please refer to Out-DB Rasters - Improve Cache Hit Rate for details.

Using Cloud Optimized GeoTiff (COG) for Out-DB Rasters

COG is a format for storing geospatial raster data in a cloud-friendly way. It is optimized for reading from remote storage such as S3. It is recommended to use Cloud Optimized GeoTIFF files as out-db rasters. The reasons are as follows:

  1. COG files have their IFD (Image File Directory) in a contiguous block at the beginning of the file. This allows Havasu to read the entire IFD without seeking to discrete locations in the file. This is very important for reading from remote storage, because multiple discrete small reads is much slower than a single large read.
  2. COG files organize band data in properly sized tiles, such as 256x256. Square shaped tiles are generally better than stride tiles such as 10000x1, because tiles organized in this way is more likely to achieve better tile reuse for most of the raster operations, especially for RS_Tile or RS_TileExplode. Havasu also supports reading GeoTIFF containing stride tiles, however, the performance may be worse than reading COG files.

Inspecting tile size of GeoTIFF files

You can use gdalinfo to inspect the tile size of GeoTIFF files:

gdalinfo <file_name>

Driver: GTiff/GeoTIFF
Files: ...
...
Band 1 Block=256x256 Type=UInt32, ColorInterp=Gray
  Min=0.000 Max=7014.000
  Minimum=0.000, Maximum=7014.000, Mean=103.824, StdDev=330.999
  NoData Value=4294967295
  Overviews: 180410x90000, 90205x45000, 45103x22500, 22552x11250, 11276x5625, 5638x2813, 2819x1407, 1410x704, 705x352, 353x176, 177x88
  Metadata:
    STATISTICS_APPROXIMATE=YES
    STATISTICS_MAXIMUM=7014
    STATISTICS_MEAN=103.82385085182
    STATISTICS_MINIMUM=0
    STATISTICS_STDDEV=330.99937191441
    STATISTICS_VALID_PERCENT=39.95

The tile size is 256x256 in this example. You can find this information in the Band 1 section of the output.

Converting GeoTIFF to COG

If you have GeoTIFF files that are not COG, you can convert them to COG using gdal_translate:

gdal_translate \
  -co TILED=YES \           # Use tiling
  -co BLOCKXSIZE=256 \      # Tile width
  -co BLOCKYSIZE=256 \      # Tile height
  <path/to/input/file> \    # Input file
  <path/to/output/file>     # Output file

You can also specify -co COMPRESS=DEFLATE or -co COMPRESS=LZW to enable compression. Please refer to GDAL COG documentation for more information.


Last update: August 23, 2024 07:15:28