Performance Tips¶
This document provides some tips for improving the performance of Havasu raster operations, especially out-db raster operations.
Only Store Small Rasters as In-DB Rasters¶
In-DB rasters are stored in the parquet data files, so storing small in-db rasters in data files can achieve higher read throughput. It is generally a good idea to store small preprocessed chips as in-db rasters, so that they can be read by downstream applications such as machine learning tasks powered by WherobotsAI Raster Inference more efficiently. For instance, the EuroSAT dataset contains labeled chips of size 64x64. These small chips will be loaded faster if they are stored as in-db rasters.
Storing large rasters as in-db rasters will make the data processing tasks prone to out-of-memory errors, because the entire raster will be loaded into memory when the raster is read from the data files. It is recommended to store large rasters as out-db rasters so they can be loaded lazily, and it is faster to run complicated tasks requiring shuffles on out-db rasters, such as repartitioning and sorting.
Tuning Read-ahead size for Out-DB Rasters¶
Out-DB rasters are stored in remote storage, such as S3. When reading Out-DB rasters, Havasu will read the raster file
from remote storage. The external raster file was read on-demand. However, reading larger blocks each time will make the throuput of reading out-db rasters higher. The size of each read is controlled by spark.wherobots.raster.outdb.readahead
. The default value is 64k
. Larger read ahead value may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed. Here are some guidelines for tuning the read ahead size:
- For reading the entire raster file such as
RS_MapAlgebra
,RS_Count
etc. A larger read ahead size (for example,4m
) will improve the performance. - For reading a small portion of the raster file such as
RS_Value
, A smaller read ahead size will improve the performance, since the extra amount of data read by read-ahead was not reused. However, there are exceptions. If we will run lots ofRS_Value
calls that hits most of the tiles of a out-db raster, it is beneficial to have a large read ahead size, since the extra amount of data read by read-ahead will be reused by the subsequentRS_Value
calls.
Being Aware of the Out-DB Raster Cache¶
WherobotsDB will cache recently read out-db rasters in a per-executor object pool, so that subsequent reads of the same raster will be significantly faster. It is recommended to sort the out-db rasters by external storage paths before processing them, so that the rasters referencing the same external storage path will be read adjacently. This will improve the cache hit rate and thus improve the overall performance. Please refer to Out-DB Rasters - Improve Cache Hit Rate for details.
Using Cloud Optimized GeoTiff (COG) for Out-DB Rasters¶
COG is a format for storing geospatial raster data in a cloud-friendly way. It is optimized for reading from remote storage such as S3. It is recommended to use Cloud Optimized GeoTIFF files as out-db rasters. The reasons are as follows:
- COG files have their IFD (Image File Directory) in a contiguous block at the beginning of the file. This allows Havasu to read the entire IFD without seeking to discrete locations in the file. This is very important for reading from remote storage, because multiple discrete small reads is much slower than a single large read.
- COG files organize band data in properly sized tiles, such as 256x256. Square shaped tiles are generally better than
stride tiles such as 10000x1, because tiles organized in this way is more likely to achieve better tile reuse for most of the raster operations, especially for
RS_Tile
orRS_TileExplode
. Havasu also supports reading GeoTIFF containing stride tiles, however, the performance may be worse than reading COG files.
Inspecting tile size of GeoTIFF files¶
You can use gdalinfo
to inspect the tile size of GeoTIFF files:
gdalinfo <file_name>
Driver: GTiff/GeoTIFF
Files: ...
...
Band 1 Block=256x256 Type=UInt32, ColorInterp=Gray
Min=0.000 Max=7014.000
Minimum=0.000, Maximum=7014.000, Mean=103.824, StdDev=330.999
NoData Value=4294967295
Overviews: 180410x90000, 90205x45000, 45103x22500, 22552x11250, 11276x5625, 5638x2813, 2819x1407, 1410x704, 705x352, 353x176, 177x88
Metadata:
STATISTICS_APPROXIMATE=YES
STATISTICS_MAXIMUM=7014
STATISTICS_MEAN=103.82385085182
STATISTICS_MINIMUM=0
STATISTICS_STDDEV=330.99937191441
STATISTICS_VALID_PERCENT=39.95
The tile size is 256x256
in this example. You can find this information in the Band 1
section of the output.
Converting GeoTIFF to COG¶
If you have GeoTIFF files that are not COG, you can convert them to COG using gdal_translate
:
gdal_translate \
-co TILED=YES \ # Use tiling
-co BLOCKXSIZE=256 \ # Tile width
-co BLOCKYSIZE=256 \ # Tile height
<path/to/input/file> \ # Input file
<path/to/output/file> # Output file
You can also specify -co COMPRESS=DEFLATE
or -co COMPRESS=LZW
to enable compression. Please refer to GDAL COG documentation for more information.