# Out-DB Rasters performance guide

## Performance Characteristics

Out-DB rasters keeps the geo-referencing information of the raster, but not the actual band data. If the application only needs to access the non-band portion of the raster, out-db rasters will be very efficient. For instance, functions such as `RS_Metadata`, `RS_GeoReference` and `RS_SRID` are very fast to run on Out-DB rasters. However, if the application needs to access the band data of the raster, it needs to read the raster file from remote storage, which is much slower than reading the band data from In-DB rasters. For example, functions such as `RS_Value`, `RS_MapAlgebra` and `RS_Resample` are slower to run on Out-DB rasters than In-DB rasters.

## Caching of Out-DB Rasters

Havasu supports caching of out-db rasters, so that if a raster is read from the same path multiple times, it may hit the cache and does not need to fetch the raster from remote storage. This avoids the overhead of reading the raster from remote storage multiple times.
The cache is per-executor core, and the caches for different cores in the same executor is not shared. This is because the cache is implemented as thread local variables, which is per-executor core. We'll change the implementation in
the future to make the cache shared across all the threads in the same executor, so that the cache hit rate will be higher.

The size of each Out-DB raster cache can be configured by `spark.wherobots.raster.outdb.pool.size`, the default value is 100. If the cache is full, the least recently used raster will be evicted from the cache. The cache is stored in local disk, the directory for storing the cache can be configured by `spark.wherobots.raster.outdb.cache.dir`, the default value is `/data/outdb-cache`. It is the same directory as the directory used by Spark block manager for storing shuffled data. If the disk space is not enough, user can tune
`spark.wherobots.raster.outdb.pool.size` to a smaller value, or use a larger EBS volume. If your Out-DB rasters rarely needs to be read more than once, you can also set `spark.wherobots.raster.outdb.pool.size` to `0` to disable the cache.

The raster caches does not download and cache the entire Out-DB raster at once, it only caches the parts of the raster file that are accessed by the application. For example, if the application calls `RS_Value(rast, pt)` to retrieve the value of one pixel at point `pt`, only the part of the raster file that contains the hit tile will be fetched from remote storaga and thus cached. The size of each read is controlled by `spark.wherobots.raster.outdb.readahead`, the default value is `64k`. Larger read ahead value such as `4m` may improve the performance of reading the entire raster file, but may hurt the performance if only a small portion of the raster file is needed.

## Improve Cache Hit Rate

### Caching the DataFrame

The cache is per-executor core, so if the same raster is read from different executor threads, it will be cached multiple times. To make the cache hit rate higher, it is recommended to cache the dataframe containing out-db rasters before running any spatial operations on it. For example:

<Tabs>
  <Tab title="Python">
    ```python  theme={"system"}
    df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
    df_tiles.cache()
    df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala  theme={"system"}
    val df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)")
    df_tiles.cache()
    df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
    ```
  </Tab>

  <Tab title="Java">
    ```java  theme={"system"}
    Dataset<Row> df_tiles = df.selectExpr("RS_TileExplode(rast, 32, 32) as (x, y, rast)");
    df_tiles.cache();
    df_tiles.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
    ```
  </Tab>
</Tabs>

Since the dataframe was cached, Spark will be more likely to schedule the spatial operations on the same executor threads, so that the cached rasters has a higher probability to be reused.

### Ordering Out-DB Rasters by remote storage path

If the Out-DB rasters referencing the same remote storage path were ordered adjacently, the cache hit rate will be higher. If you are processing tiled Out-DB raster dataset, it is recommended to order the rasters by `RS_BandPath(rast)` before processing the dataset.
The overhead of sorting the dataset brings great benefit to the overall performance.

<Tabs>
  <Tab title="Python">
    ```python  theme={"system"}
    df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala  theme={"system"}
    val df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache()
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show()
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show()
    ```
  </Tab>

  <Tab title="Java">
    ```java  theme={"system"}
    Dataset<Row> df_sorted_cached = df.sort(expr("RS_BandPath(rast)")).cache();
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.5 * (rast[0] + rast[1]);')").show();
    df_sorted_cached.selectExpr("RS_MapAlgebra(rast, NULL, 'out = 0.3 * rast[0] + 0.7 * rast[1];')").show();
    ```
  </Tab>
</Tabs>

<Note>
  The size of the out-db object pool should be larger than the number of distinct out-db raster path per partition, Otherwise, the cache will be trashing and won't be effective.
</Note>

## Configurations

There are several configurations for pooling of out-db rasters:

| Configuration                            | Default             | Description                                                                                                                                                                                                                          |
| ---------------------------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `spark.wherobots.raster.outdb.pool.size` | `100`               | Size of per-executor core raster cache                                                                                                                                                                                               |
| `spark.wherobots.raster.outdb.cache.dir` | `/data/outdb-cache` | Directory for storing cached raster data                                                                                                                                                                                             |
| `spark.wherobots.raster.outdb.readahead` | `64k`               | The minimum number of bytes to read on each read of remote storage. Larger read ahead value may improve the performance of reading large rasters, but may hurt the performance if only a small portion of the raster file is needed. |

There is another configuration to take care of when accessing a large number of out-db rasters stored on S3: `spark.hadoop.fs.s3a.connection.maximum`. This configuration controls the maximum number of concurrent connections to S3. Wherobots cluster may hold a large number of concurrent connections because of out-db raster pooling, so it is configured to a large number by default. The default configuration is `2000`.

<Note>
  The options mentioned above can only be configured when creating the Spark session, they cannot be changed after the Spark session is created.
</Note>