TaskInstance executes a Job Run and waits for it to complete.
For a comprehensive list of the parameters associated with Job Runs,
see Runs REST API.
Benefits
You can use Airflow to streamline, automate, and manage complex ETL workload tasks that are running on your vector or raster based data. By usingWherobotsRunOperator, you can incorporate Wherobots’
geospatial data processing features into your Airflow workflows.
With WherobotsRunOperator, you no longer need to manage and
configure your own cluster to complete your geospatial data processing.
Wherobots can handle that resource allocation for you, while providing
analytics capabilities that are optimized for geospatial concerns.
Before you start
Before usingWherobotsRunOperator, ensure that you
have the following required resources:
- An account within a Professional or Enterprise Edition Organization. For more information, see Create a Wherobots Account.
- Wherobots API key. For more information, see API keys in the Wherobots documentation.
- The Wherobots Apache Airflow Provider. For installation information, see Wherobots Apache Airflow Provider.
- An Airflow Connection. For more information, see Create a new Connection in Airflow Server.
- Python version ≥ 3.8
- Apache Airflow. For more information, see Installation of Airflow in the Apache Airflow documentation.
DAG files
Directed Acyclic Graph (DAG) files are Python scripts that define your Airflow workflows. DAG files need to be accessible to the Airflow scheduler so that your tasks can be executed. DAG files are commonly stored in$AIRFLOW_HOME/dags. For more information, see
Setting Configuration Options
in the Apache Airflow Documentation.
WherobotsRunOperator constructor
The following example details how to integrateWherobotsRunOperator into your DAG file.
In this documentation, we discuss the WherobotsRunOperator-related lines of code which are
highlighted below. For more information on the general formatting of Airflow DAG files, see
Best Practices in
the Airflow Documentation.
Example DAG file
| Parameter | Type | Description | Accepted values |
|---|---|---|---|
operator | WherobotsRunOperator instantiates WherobotsRunOperator. | ||
name | str | The name of the run. If not specified, a default name will be generated. | |
runtime | enum | Specifies the Wherobots runtime. runtime is optional. Defaults to Runtime.SMALL for Professional Edition. See Runs REST API for more information. | Runtime.TINY, Runtime.SMALL, Runtime.MEDIUM, Runtime.LARGE, Runtime.X_LARGE, Runtime.XX_LARGE, Runtime.XXXX_LARGE, Runtime.MEDIUM_HIMEM, Runtime.LARGE_HIMEM, Runtime.X_LARGE_HIMEM, Runtime.XX_LARGE_HIMEM, Runtime.XXXX_LARGE_HIMEM, Runtime.TINY_A10_GPU, Runtime.SMALL_A10_GPU, Runtime.MEDIUM_A10_GPU You can see the runtimes available to your Organization within the Start a Notebook dropdown in Wherobots Cloud. |
region | enum | The compute region where your workload is running. | Region.AWS_US_WEST_2, Region.AWS_EU_WEST_1, or Region.AWS_US_EAST_1. For more information, see Runtime enum values. |
run_python | dict | Provide the Python file’s S3 path. Wherobots Managed Storage and Storage Integration S3 paths are supported. You can use either run_python or run_jar, not both. | Takes the following keys: uri:(str) and args: (list[str]). |
poll_logs | bool | Enables Log polling when set to True. | True, False |
timeout_seconds | int | The total duration of the Job Run’s execution in seconds, starting from when its status changes to RUNNING. The default value is 3600. | |
do_xcom_push | bool | The default value is True. In this case the operator will push the run_id: {value} into XComs during the Airflow Task execution. | True, False |
Other parameters
The following parameters aren’t included in the example but are also compatible with theWherobotsRunsOperator.
| Parameter | Type | Description | Accepted values |
|---|---|---|---|
run_jar | dict | Wherobots Managed Storage and Storage Integration S3 paths are supported. You can use either run_python or run_jar, not both. | Takes the following keys: uri:(str), args: (list[str]), mainClass: (str). |
polling_interval | int | The interval in seconds to poll the status of the run. The default value is 30. | |
environment | dict | The model for runtime environment configs, including Spark cluster configs and dependencies. Defaults to {}. For more information see Environment keys. |
Environment keys
| Environment parameter | Type | Required or Optional | Parameter description |
|---|---|---|---|
sparkDriverDiskGB | int | Optional | The driver disk size of the Spark cluster. |
sparkExecutorDiskGB | int | Optional | The executor disk size of the Spark cluster. |
sparkConfigs | dict{str:str} | Optional | The user specified Spark configs. |
dependencies | list[dict] | Optional | Indicates the 3rd party dependencies need to add to the runtime. Required if adding sparkConfigs. Must be an array (list) of objects (dictionaries). Each object must represent either a Python Package Index or a File Dependency. |
Dependencies
The following details information on the third-party dependencies that can be added to the runtime environment.| Parameter | Type | Details | Rules for accepted values |
|---|---|---|---|
sourceType | string | Enum for the source type of dependency | Valid values: PYPI or FILE. |
filePath | string | The file path to the dependency file. Must be a valid file path accessible within the runtime environment. The file extension must be one of: .jar, .whl, .zip, or .json. Can only be used with the FILE sourceType.JSON files can be used to place a configuration file at a predetermined file-system location on all nodes of the compute cluster ( /opt/wherobots/<file name>.json). | |
libraryName | string | The python package name. Can only be used with the PYPI sourceType. | |
libraryVersion | string | The Python package version. Can only be used with the PYPI sourceType. |
JAR and Python Scripts
You can use Python or JAR files to further leverage Wherobots’ geospatial features and optimizations in conjunction with an Airflow DAG.Tile generation
This example uses Sedona to process and generate vector tiles. The example reads buildings and roads data from Wherobots open datasets, filters the data based on a specified region, and then generates vector tiles using thewherobots.vtiles library. It writes the tiles to a
PMTiles file in the user’s S3 storage and displays a sample of the tiles.
Tile generation example Python script
Raster inference
This example performs batch image classification using Sedona and Wherobots. It reads raster data from an S3 bucket, applies a classification model to each image, extracts the most confident prediction for each image, and displays a few sample images along with their predicted labels.Raster inference is not available to Community Edition Organizations. To perform raster inference, you need access to Wherobots’ GPU-Optimized runtimes. For more information on GPU-Optimized runtimes, see Runtimes.
For more information on the different Organization plans available, see Wherobots Pricing.
Raster inference example script
Reviewing logs
WherobotsRunOperator removes the need to write your own logic to poll the
Wherobots Job Run logs. Airflow users can access the logs by
specifying poll_logs=true in a DAG. Wherobots polls the logs
and streams them in the Airflow Task logs.
Airflow UI
When you start your Airflow DAG from the Airflow server, logs can be seen in the Airflow Server’s Logs tab. Relevant logs begin with the lineINFO - === Logs for Run <run_id> Start:. The run_id
can be found in the Airflow Server’s XCom tab.
Monitor Job Runs in Wherobots Cloud
The Job Runs section Wherobots Cloud lets you monitor the progress of Jobs Runs, understand resource consumption, review the execution timeline, and verify configuration settings. For more information, see the Monitor Job Runs Documentation.Feature Limitations
WherobotsRunOperator has the following limitations:
- Wherobots only stores logs for 30 days.
- The amount of concurrent Job Runs that can occur are determined by your Organization’s maximum concurrent Spatial Unit consumption quota. For more information on Spatial Units and quota, see Runtimes.

