Only Store Small Rasters as In-DB Rasters
In-DB rasters are stored in the parquet data files, so storing small in-db rasters in data files can achieve higher read throughput. It is generally a good idea to store small preprocessed chips as in-db rasters, so that they can be read by downstream applications such as machine learning tasks powered by WherobotsAI Raster Inference more efficiently. For instance, the EuroSAT dataset contains labeled chips of size 64x64. These small chips will be loaded faster if they are stored as in-db rasters. Storing large rasters as in-db rasters will make the data processing tasks prone to out-of-memory errors, because the entire raster will be loaded into memory when the raster is read from the data files. It is recommended to store large rasters as out-db rasters so they can be loaded lazily, and it is faster to run complicated tasks requiring shuffles on out-db rasters, such as repartitioning and sorting.Tuning Read-ahead size for Out-DB Rasters
Out-DB rasters are stored in remote storage, such as S3. When reading Out-DB rasters, Havasu will read the raster file from remote storage. The external raster file was read on-demand. However, reading larger blocks each time will make the throuput of reading out-db rasters higher. The size of each read is controlled byspark.wherobots.raster.outdb.readahead. The default value is 64k. Larger read ahead value may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed. Here are some guidelines for tuning the read ahead size:
- For reading the entire raster file such as
RS_MapAlgebra,RS_Countetc. A larger read ahead size (for example,4m) will improve the performance. - For reading a small portion of the raster file such as
RS_Value, A smaller read ahead size will improve the performance, since the extra amount of data read by read-ahead was not reused. However, there are exceptions. If we will run lots ofRS_Valuecalls that hits most of the tiles of a out-db raster, it is beneficial to have a large read ahead size, since the extra amount of data read by read-ahead will be reused by the subsequentRS_Valuecalls.
Being Aware of the Out-DB Raster Cache
WherobotsDB will cache recently read out-db rasters in a per-executor object pool, so that subsequent reads of the same raster will be significantly faster. It is recommended to sort the out-db rasters by external storage paths before processing them, so that the rasters referencing the same external storage path will be read adjacently. This will improve the cache hit rate and thus improve the overall performance. Please refer to Out-DB Rasters - Improve Cache Hit Rate for details.Using Cloud Optimized GeoTiff (COG) for Out-DB Rasters
COG is a format for storing geospatial raster data in a cloud-friendly way. It is optimized for reading from remote storage such as S3. It is recommended to use Cloud Optimized GeoTIFF files as out-db rasters. The reasons are as follows:- COG files have their IFD (Image File Directory) in a contiguous block at the beginning of the file. This allows Havasu to read the entire IFD without seeking to discrete locations in the file. This is very important for reading from remote storage, because multiple discrete small reads is much slower than a single large read.
- COG files organize band data in properly sized tiles, such as 256x256. Square shaped tiles are generally better than
stride tiles such as 10000x1, because tiles organized in this way is more likely to achieve better tile reuse for most of the raster operations, especially for
RS_TileorRS_TileExplode. Havasu also supports reading GeoTIFF containing stride tiles, however, the performance may be worse than reading COG files.
Inspecting tile size of GeoTIFF files
You can usegdalinfo to inspect the tile size of GeoTIFF files:
256x256 in this example. You can find this information in the Band 1 section of the output.
Converting GeoTIFF to COG
If you have GeoTIFF files that are not COG, you can convert them to COG usinggdal_translate:
-co COMPRESS=DEFLATE or -co COMPRESS=LZW to enable compression. Please refer to GDAL COG documentation for more information.
