Describe your Model Metadata in Python

This example shows how to bring your own model to Raster Inference. We will start with a machine learning model from Satlas¹ and describe it using the Machine Learning Model STAC Extension (MLM). After we have an MLM JSON file describing our model on S3, we can pass a URI to the JSON file to RS_* inference functions to run the model on Raster Inference.

Before you start¶

This is a read-only preview of this notebook.

To execute the cells in this Jupyter Notebook, do the following:

Login to Wherobots Cloud.
Start a GPU-Optimized runtime instance.
Open a notebook.
- We recommend using a Tiny GPU-Optimized runtime.
Open the examples/Analyzing_Data/Raster_Segmentation.ipynb notebook path.

For more information on starting and using notebooks, see the following Wherobots Documentation:

Access a GPU-Optimized runtime¶

This notebook requires a GPU-Optimized runtime. For more information on GPU Optimized runtimes, see Runtime types.

To access this runtime category, do the following:

Sign up for a paid Wherobots Organization Edition (Professional or Enterprise).
Submit a Compute Request for a GPU-Optimized runtime.

Step 1: Creating the MLM metadata with `stac_model`¶

The MLM specification is what we use to define the model inference requirements. These include descriptions of model inputs (their shape, role, and preprocessing steps), the categories associated with a model, the model task, and the location of the model asset. Wherobots maintains the stac-model python library which can be used to create and validate that metadata complies with the MLM specification requirements.

Below we will break down the steps involved to fill out and validate the MLM fields using stac-model. But first we will save the model artifact to our s3 user path that is referred by the MLM metadata so that we can later run inference with it.

Pytorch has many export formats, and in this example we will cover onboarding a high performance solar segmentation model compiled for NVIDIA GPUs, exported using Pytorch 2.7's new export APIs . See our docs on the Pytorch model formats we support.

In [ ]:

Copied!





import os
import fsspec

user_uri = os.getenv("USER_S3_PATH")
original_model_uri = "s3://wherobots-modelhub-prod/professional/semantic-segmentation/solar-satlas-sentinel2/inductor/gpu/model.pt2"
user_model_uri = f"{user_uri}model.pt2"
user_mlm_uri = f"{user_uri}model-metadata.json"

fs = fsspec.filesystem('s3')
fs.copy(original_model_uri, user_model_uri)
import os
import fsspec

user_uri = os.getenv("USER_S3_PATH")
original_model_uri = "s3://wherobots-modelhub-prod/professional/semantic-segmentation/solar-satlas-sentinel2/inductor/gpu/model.pt2"
user_model_uri = f"{user_uri}model.pt2"
user_mlm_uri = f"{user_uri}model-metadata.json"

fs = fsspec.filesystem('s3')
fs.copy(original_model_uri, user_model_uri)

The main library we will use to construct the metadata is stac-model, which implements validation for the MLM fields, which begin with mlm:.

We also use pystac to combine MLM metadata with other STAC core and STAC extension metadata. For a primer on STAC and STAC extensions, check out https://stac-extensions.github.io/, which also lists the many extensions out there for describing different kinds of spatio-temporal data.

The other libraries, shapely and dateutil are briefly used to format geometry and time metadata in our metadata JSON.

In [ ]:

Copied!





import pystac
import shapely
from dateutil.parser import parse as parse_dt

from stac_model.base import ProcessingExpression
from stac_model.input import InputStructure, ModelInput, ValueScalingObject
from stac_model.output import MLMClassification, ModelOutput, ModelResult
from stac_model.schema import MLModelExtension, MLModelProperties
import pystac
import shapely
from dateutil.parser import parse as parse_dt

from stac_model.base import ProcessingExpression
from stac_model.input import InputStructure, ModelInput, ValueScalingObject
from stac_model.output import MLMClassification, ModelOutput, ModelResult
from stac_model.schema import MLModelExtension, MLModelProperties

The InputStructure object describes the shape of a tensor/array input to a model's predict function. This describes the shape of the input after all data processing steps have been applied to the original data input.

We'll refer to arrays as tensors throughout this guide. For our purposes, a tensor is a fancy name for raster array that is used in a machine learning model. Since Pytorch is the most popular machine learning framework, we will use Pytorch's terminology.

For our example SATLAS model, the expected input structure is a tensor with a flexible batch size (-1), 9 bands multiplied by 4 time steps to form a single dimension, and 1024 height and width.

In [ ]:

Copied!

input_array = InputStructure(
    shape=[-1, 9 * 4, 1024, 1024], dim_order=["batch", "channel", "height", "width"], data_type="float32"
)
input_array = InputStructure(
    shape=[-1, 9 * 4, 1024, 1024], dim_order=["batch", "channel", "height", "width"], data_type="float32"
)

The ValueScalingObject describes the statistics or values used to adjust an input's range or distribution prior to running a model's prediction function. Some models may only need the input data range adjusted and the model itself will handle normalizing the inputs to a given distribution without preprocessing. Others will adjust both the range and distribution. In the ValueScalingObject, the field type indicates how inputs need to be normalized given the statistics.

Wherobots Raster Inference functions expect that the values are applied to the band dimension of an overhead imagery input. For interoperability with WherobotsAI Raster Inference, use strings for band names rather than the Model Band Object.

Wherobots Raster Inference will handle broadcasting the `ValueScalingObject` to all bands if only one `ValueScalingObject` is specified. This means if all bands need to be divided by 255, you only need to specify one `ValueScalingObject`.

In [ ]:

Copied!





band_names = ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B11", "B12"]*4
stats = [
    ValueScalingObject(maximum=255, minimum=0)
    for band in band_names
]
band_names = ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B11", "B12"]*4
stats = [
    ValueScalingObject(maximum=255, minimum=0)
    for band in band_names
]

With the band names, input structure (input_array), and ValueScalingObject, we have a few other fields to fill out to fully describe our Model Input. The resize_type indicates how to convert data samples to the size expected by the model's predict function, which is specified by the InputStructure object.

For processing operations that cannot be described by any of the predefined scaling operations available within a ValueScalingObject, a ProcessingExpression with an expression format of gdal-calc can be used within a ValueScalingObjerct instead to specify a custom processing operation.

In [ ]:

Copied!





model_input = ModelInput(
    name="9 Band Sentinel-2 4 Time Step Series Batch",
    bands=band_names,
    input=input_array,
    resize_type="crop",
    value_scaling=stats,
    pre_processing_function=ProcessingExpression(format="documentation-link", expression="https://github.com/allenai/satlas/blob/main/CustomInference.md#sentinel-2-inference-example"),
)
model_input = ModelInput(
    name="9 Band Sentinel-2 4 Time Step Series Batch",
    bands=band_names,
    input=input_array,
    resize_type="crop",
    value_scaling=stats,
    pre_processing_function=ProcessingExpression(format="documentation-link", expression="https://github.com/allenai/satlas/blob/main/CustomInference.md#sentinel-2-inference-example"),
)

Similar to the Input Structure object, we need to describe the model output with the Result Structure object. -1 denotes a flexible batch size, 1 refers to the category dimension (in this case the model predicts only solar farm/not solar farm), and 1024 refers to our height and width dimensions again. dim_order enumerates the meaning of each shape element.

In [ ]:

Copied!





confidence = ModelResult(
    shape=[-1, 1, 1024, 1024],
    dim_order=["batch", "category", "height", "width"],
    data_type="float32"
)
confidence = ModelResult(
    shape=[-1, 1, 1024, 1024],
    dim_order=["batch", "category", "height", "width"],
    data_type="float32"
)

We use the STAC Classification Extension to describe categories predicted by the model. The only required fields are an integer value to represent the category and the name of the category.

The Model Output Object ties together the classes, task, and any post processing functions that need to be applied to the Result Structure Object. Options for tasks are specified in the Tasks Enum. WherobotsAI Raster Inference currently supports scene-classification, object-detection, and semantic-segmentation.

Let us know what other tasks and models you would like to see us support on the Apache Sedona Discord server's `#vendor-wherobots` channel!

In [ ]:

Copied!





class_map = {"Solar Farm": 1,}
class_objects = [
    MLMClassification(value=class_value, name=class_name)
    for class_name, class_value in class_map.items()
]
model_output = ModelOutput(
    name="confidence array",
    tasks={"semantic-segmentation"},
    result=confidence,
    classes=class_objects,
    post_processing_function=None,
)
class_map = {"Solar Farm": 1,}
class_objects = [
    MLMClassification(value=class_value, name=class_name)
    for class_name, class_value in class_map.items()
]
model_output = ModelOutput(
    name="confidence array",
    tasks={"semantic-segmentation"},
    result=confidence,
    classes=class_objects,
    post_processing_function=None,
)

To describe the actual model file we load and run in Raster Inference, we use the STAC Core spec for Asset Objects. The MLM spec adds additional fields and roles for different asset types. To use the MLM you are required to specify the mlm:model asset that points to a model file and where it is hosted (href). The Artifact Enum specifies the type of the model file. Currently Pytorch torch.jit.script and torch.compile models are supported by Raster Inference.

In [ ]:

Copied!





assets = {
    "model": pystac.Asset(
        title="AOTInductor model exported from edited Satlas model source code.",
        description=(
            "A Swin Transformer backbone with a U-net head trained on the 9-band Sentinel-2 Top of Atmosphere product."
        ),
        href=user_model_uri,
        media_type="application/zip; application=pytorch",
        roles=[
            "mlm:model",
            "data"
        ],
        extra_fields={"mlm_artifact_type": "torch.export.save",}
    ),
    "source_code": pystac.Asset(
        title="Model implementation.",
        description="Source code to export the model.",
        href="https://github.com/wherobots/modelhub/blob/main/model-forge/satlas/solar/export.py",
        media_type="text/x-python",
        roles=[
            "mlm:model",
            "code"
        ]
    )
}
assets = {
    "model": pystac.Asset(
        title="AOTInductor model exported from edited Satlas model source code.",
        description=(
            "A Swin Transformer backbone with a U-net head trained on the 9-band Sentinel-2 Top of Atmosphere product."
        ),
        href=user_model_uri,
        media_type="application/zip; application=pytorch",
        roles=[
            "mlm:model",
            "data"
        ],
        extra_fields={"mlm_artifact_type": "torch.export.save",}
    ),
    "source_code": pystac.Asset(
        title="Model implementation.",
        description="Source code to export the model.",
        href="https://github.com/wherobots/modelhub/blob/main/model-forge/satlas/solar/export.py",
        media_type="text/x-python",
        roles=[
            "mlm:model",
            "code"
        ]
    )
}

After specifying our model input, model output, and assets, we can assemble this info in the top level Item Properties. Note that the required fields of the spec are mlm:name, mlm:architecture, mlm:tasks, mlm:input, and mlm:output. Additionally, WherobotsAI Raster Inference requires the following fields and options:

Only one value for tasks is supported.
framework="pytorch" is currently required
framework_version="2.7.0+cu126" is recommended as this is the default version installed in our GPU Runtimes. You can still install a different version during Notebook Instance setup if needed, however Raster Inference expects to run on features availabler in Pytorch 2.7.
batch_size_suggestion is recommended. If not specified, Raster Inference defaults to a batch size of 10, or the batch size that is set in the sedona configuration wherobots.inference.args. For example:

    config = (
        SedonaContext.builder()
        .appName("raster-inference")
        .config("spark.wherobots.inference.args", "10") # sets the batch size for RS_ inference functions to 10

accelerator is recommended. If not set, we assume CUDA is available.
accelerator_constrained is recommended to indicate that a model must run on the accelerator.

In [ ]:

Copied!





ml_model_meta = MLModelProperties(
    name="Satlas Solar Farm Segmentation",
    architecture="Swin Transformer V2 with U-Net head",
    tasks={"semantic-segmentation"},
    framework="pytorch",
    framework_version="2.7.0+cu126",
    batch_size_suggestion=10,
    accelerator="cuda",
    accelerator_constrained=True,
    accelerator_summary="It is necessary to use GPU since it was compiled for NVIDIA Ampere and newer architectures with AOTInductor and the computational demands of the model.",
    input=[model_input],
    output=[model_output],
)
ml_model_meta = MLModelProperties(
    name="Satlas Solar Farm Segmentation",
    architecture="Swin Transformer V2 with U-Net head",
    tasks={"semantic-segmentation"},
    framework="pytorch",
    framework_version="2.7.0+cu126",
    batch_size_suggestion=10,
    accelerator="cuda",
    accelerator_constrained=True,
    accelerator_summary="It is necessary to use GPU since it was compiled for NVIDIA Ampere and newer architectures with AOTInductor and the computational demands of the model.",
    input=[model_input],
    output=[model_output],
)

A requirement of describing the model with the MLM is specifying it's spatial and temporal relevance. These fields are not used by Raster Inference currently but can be useful for search and discovery. These fields must have a value to comply with STAC.

After assembling all of our model metadata we now need to create a STAC item with pystac where we will insert our MLM Extension metadata.

In [ ]:

Copied!





start_datetime_str = "1900-01-01"
end_datetime_str = "9999-01-01"
start_datetime = parse_dt(start_datetime_str).isoformat() + "Z"
end_datetime = parse_dt(end_datetime_str).isoformat() + "Z"
bbox = [
    -7.882190080512502,
    37.13739173208318,
    27.911651652899923,
    58.21798141355221
]
geometry = shapely.geometry.Polygon.from_bounds(*bbox).__geo_interface__
item_name = "item_solar_satlas_sentinel2"

item = pystac.Item(
    id=item_name,
    geometry=geometry,
    bbox=bbox,
    datetime=None,
    properties={
        "start_datetime": start_datetime,
        "end_datetime": end_datetime,
        "description": (
            "Sourced from satlas source code released by Allen AI under Apache 2.0"
        ),
    },
    assets=assets,
)
start_datetime_str = "1900-01-01"
end_datetime_str = "9999-01-01"
start_datetime = parse_dt(start_datetime_str).isoformat() + "Z"
end_datetime = parse_dt(end_datetime_str).isoformat() + "Z"
bbox = [
    -7.882190080512502,
    37.13739173208318,
    27.911651652899923,
    58.21798141355221
]
geometry = shapely.geometry.Polygon.from_bounds(*bbox).__geo_interface__
item_name = "item_solar_satlas_sentinel2"

item = pystac.Item(
    id=item_name,
    geometry=geometry,
    bbox=bbox,
    datetime=None,
    properties={
        "start_datetime": start_datetime,
        "end_datetime": end_datetime,
        "description": (
            "Sourced from satlas source code released by Allen AI under Apache 2.0"
        ),
    },
    assets=assets,
)

We add a link to the source dataset that the model was trained on and should be inferenced on. If the model is trained on multiple datasets, multiple links can be added with the DERIVED_FROM relation type. We also add a self referential link to the Item to aid in search and discovery.

In [ ]:

Copied!





item.add_link(
    pystac.Link(
        target="https://earth-search.aws.element84.com/v1/collections/sentinel-2-l1c",
        rel=pystac.RelType.DERIVED_FROM,
        media_type=pystac.MediaType.JSON,
    )
)
item.set_self_href(user_mlm_uri)
item.add_link(
    pystac.Link(
        target="https://earth-search.aws.element84.com/v1/collections/sentinel-2-l1c",
        rel=pystac.RelType.DERIVED_FROM,
        media_type=pystac.MediaType.JSON,
    )
)
item.set_self_href(user_mlm_uri)

Finally we add our extension metadata to the item we created with pystac. Using the .ext() method we produce an item that has all of the methods from pystac as well as custom MLM methods that are needed to correctly format and validate MLM metadata with .apply().

In [ ]:

Copied!

item_mlm = MLModelExtension.ext(item, add_if_missing=True)
item_mlm.apply(ml_model_meta.model_dump(by_alias=True, exclude_unset=False, exclude_defaults=True))
item_mlm = MLModelExtension.ext(item, add_if_missing=True)
item_mlm.apply(ml_model_meta.model_dump(by_alias=True, exclude_unset=False, exclude_defaults=True))

This can now be saved to a JSON file and copied to the user_mlm_uri path we specified on s3. We will also need to copy our model artifact path to the correct link we specified in the model asset object in order to use the model in Raster Inference.

In [ ]:

Copied!





import json
with open("model-metadata.json", "w") as json_file:
    json.dump(item_mlm.item.to_dict(), json_file, indent=4)
fs.put("model-metadata.json", user_mlm_uri)
import json
with open("model-metadata.json", "w") as json_file:
    json.dump(item_mlm.item.to_dict(), json_file, indent=4)
fs.put("model-metadata.json", user_mlm_uri)

Step 2: Set Up The WherobotsDB Context¶

In [ ]:

Copied!





import warnings
warnings.filterwarnings('ignore')

from wherobots.inference.data.io import read_raster_table
from sedona.spark import *
from pyspark.sql.functions import expr

config = SedonaContext.builder().appName('segmentation-batch-inference')\
    .getOrCreate()

sedona = SedonaContext.create(config)
import warnings
warnings.filterwarnings('ignore')

from wherobots.inference.data.io import read_raster_table
from sedona.spark import *
from pyspark.sql.functions import expr

config = SedonaContext.builder().appName('segmentation-batch-inference')\
    .getOrCreate()

sedona = SedonaContext.create(config)

Step 3: Load Satellite Imagery¶

Next, we load the satellite imagery that we will be running inference over. These GeoTiff images are loaded as out-db rasters in WherobotsDB, where each row represents a different scene.

In [ ]:

Copied!





tif_folder_path = 's3a://wherobots-benchmark-prod/data/ml/satlas/'
files_df = read_raster_table(tif_folder_path, sedona, limit=400)
df_raster_input = files_df.withColumn(
        "outdb_raster", expr("RS_FromPath(path)")
    )

df_raster_input.cache().count()
df_raster_input.show(truncate=False)
df_raster_input.createOrReplaceTempView("df_raster_input")
tif_folder_path = 's3a://wherobots-benchmark-prod/data/ml/satlas/'
files_df = read_raster_table(tif_folder_path, sedona, limit=400)
df_raster_input = files_df.withColumn(
        "outdb_raster", expr("RS_FromPath(path)")
    )

df_raster_input.cache().count()
df_raster_input.show(truncate=False)
df_raster_input.createOrReplaceTempView("df_raster_input")

Step 4: Run Predictions And Visualize Results¶

To run predictions we will specify the MLM model metadata file we saved to user_mlm_uri. Predictions can be run with the Raster Inference SQL function RS_Segment or the Python API.

Here we generate 400 raster predictions using RS_Segment.

In [ ]:

Copied!





predictions_df = sedona.sql(f"""
SELECT
  outdb_raster,
  segment_result.*
FROM (
  SELECT
    outdb_raster,
    RS_SEGMENT('{user_mlm_uri}', outdb_raster) AS segment_result
  FROM
    df_raster_input
) AS segment_fields
""")

predictions_df.cache().count()
predictions_df.show()
predictions_df.createOrReplaceTempView("predictions")
predictions_df = sedona.sql(f"""
SELECT
  outdb_raster,
  segment_result.*
FROM (
  SELECT
    outdb_raster,
    RS_SEGMENT('{user_mlm_uri}', outdb_raster) AS segment_result
  FROM
    df_raster_input
) AS segment_fields
""")

predictions_df.cache().count()
predictions_df.show()
predictions_df.createOrReplaceTempView("predictions")

Now that we've generated predictions using our model over our satellite imagery, we can use the RS_Segment_To_Geoms function to extract the geometries indicating the model has identified as possible solar farms. we'll specify the following:

a raster column to use for georeferencing our results
the prediction result from the previous step
our category label "1" returned by the model representing Solar Farms and the class map to use for assigning labels to the prediction
a confidence threshold between 0 and 1.

In [ ]:

Copied!





df_multipolys = sedona.sql("""
    WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions
    )
    SELECT result.* FROM t
""")

df_multipolys.cache().count()
df_multipolys.show()
df_multipolys.createOrReplaceTempView("multipolygon_predictions")
df_multipolys = sedona.sql("""
    WITH t AS (
        SELECT RS_SEGMENT_TO_GEOMS(outdb_raster, confidence_array, array(1), class_map, 0.65) result
        FROM predictions
    )
    SELECT result.* FROM t
""")

df_multipolys.cache().count()
df_multipolys.show()
df_multipolys.createOrReplaceTempView("multipolygon_predictions")

Since we ran inference across the state of Arizona, many scenes don't contain solar farms and don't have positive detections. Let's filter out scenes without segmentation detections so that we can plot the results.

In [ ]:

Copied!





df_merged_predictions = sedona.sql("""
    SELECT
        element_at(class_name, 1) AS class_name,
        cast(element_at(average_pixel_confidence_score, 1) AS double) AS average_pixel_confidence_score,
        ST_Collect(geometry) AS merged_geom
    FROM
        multipolygon_predictions
""")
df_merged_predictions = sedona.sql("""
    SELECT
        element_at(class_name, 1) AS class_name,
        cast(element_at(average_pixel_confidence_score, 1) AS double) AS average_pixel_confidence_score,
        ST_Collect(geometry) AS merged_geom
    FROM
        multipolygon_predictions
""")

This leaves us with a few predicted solar farm polygons for our 300 satellite image samples.

In [ ]:

Copied!

df_filtered_predictions = df_merged_predictions.filter("ST_IsEmpty(merged_geom) = False")
df_filtered_predictions.cache().count()
df_filtered_predictions = df_merged_predictions.filter("ST_IsEmpty(merged_geom) = False")
df_filtered_predictions.cache().count()

In [ ]:

Copied!

df_filtered_predictions.show()
df_filtered_predictions.show()

We'll plot these with SedonaKepler. Compare the satellite basemap with the predictions and see if there's a match!

In [ ]:

Copied!





from sedona.spark import *
config = {
    'version': 'v1',
    'config': {
        'mapStyle': {
            'styleType': 'dark',
            'topLayerGroups': {},
            'visibleLayerGroups': {},
            'mapStyles': {}
        },
    }
}
map = SedonaKepler.create_map(config=config)

SedonaKepler.add_df(map, df=df_filtered_predictions, name="Solar Farm Detections")
map
from sedona.spark import *
config = {
    'version': 'v1',
    'config': {
        'mapStyle': {
            'styleType': 'dark',
            'topLayerGroups': {},
            'visibleLayerGroups': {},
            'mapStyles': {}
        },
    }
}
map = SedonaKepler.create_map(config=config)

SedonaKepler.add_df(map, df=df_filtered_predictions, name="Solar Farm Detections")
map

wherobots.inference Python API¶

If you prefer python, wherobots.inference offers a module for registering the SQL inference functions as python functions. Below we run the same inference as before with RS_SEGMENT.

In [ ]:

Copied!





from wherobots.inference.engine.register import create_semantic_segmentation_udfs
from pyspark.sql.functions import col
rs_segment =  create_semantic_segmentation_udfs(batch_size = 10, sedona=sedona)
df = df_raster_input.withColumn("segment_result", rs_segment(user_mlm_uri, col("outdb_raster"))).select(
                               "outdb_raster",
                               col("segment_result.confidence_array").alias("confidence_array"),
                               col("segment_result.class_map").alias("class_map")
                           )
df.show(3)
from wherobots.inference.engine.register import create_semantic_segmentation_udfs
from pyspark.sql.functions import col
rs_segment =  create_semantic_segmentation_udfs(batch_size = 10, sedona=sedona)
df = df_raster_input.withColumn("segment_result", rs_segment(user_mlm_uri, col("outdb_raster"))).select(
                               "outdb_raster",
                               col("segment_result.confidence_array").alias("confidence_array"),
                               col("segment_result.class_map").alias("class_map")
                           )
df.show(3)

References¶

Bastani, Favyen, Wolters, Piper, Gupta, Ritwik, Ferdinando, Joe, and Kembhavi, Aniruddha. "SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding." arXiv preprint arXiv:2211.15660 (2023). https://doi.org/10.48550/arXiv.2211.15660

Describe your Model Metadata in Python

Before you start¶

Access a GPU-Optimized runtime¶

Step 1: Creating the MLM metadata with stac_model¶

Step 2: Set Up The WherobotsDB Context¶

Step 3: Load Satellite Imagery¶

Step 4: Run Predictions And Visualize Results¶

wherobots.inference Python API¶

References¶

Step 1: Creating the MLM metadata with `stac_model`¶