Skip to main content
This guide walks through configuring AWS Glue Catalog as an Iceberg catalog for Wherobots. This integration allows you to manage Iceberg tables in AWS Glue and query them from Wherobots Spark workloads.

Before you start

Before starting, ensure you have the following:
  • An AWS Account with an existing Glue database
  • An S3 bucket for Glue table data

Integration workflow

This integration requires configuring both the AWS Console and Wherobots Cloud. We recommend opening both in separate browser tabs. Complete these steps in order:
  1. AWS Console: Add Glue permissions to your Storage Integration IAM role
  2. Wherobots Cloud: Configure your Spark session and test the connection

AWS Console

Before starting, make sure you have:
The Storage Integration creates an IAM role with S3 permissions. To access Glue Catalog, you must add Glue permissions to this same role.
1

Locate the IAM role

  1. Open the AWS Console and navigate to IAM > Roles.
  2. Find the role created by your Storage Integration (the role name you specified during setup).
  3. Click on the role to view its details.
2

Update the permissions policy

Edit the role’s permissions policy to include both S3 and Glue access. Replace the existing policy with the following, substituting your values for the placeholders:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3BucketAccess",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<YOUR_BUCKET_NAME>"
        },
        {
            "Sid": "S3ObjectAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
        },
        {
            "Sid": "GlueCatalogAccess",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetPartitions",
                "glue:BatchCreatePartition",
                "glue:BatchDeletePartition",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:catalog",
                "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:database/<YOUR_GLUE_DATABASE_NAME>",
                "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:table/<YOUR_GLUE_DATABASE_NAME>/*"
            ]
        }
    ]
}

Placeholder reference

PlaceholderDescription
<YOUR_AWS_ACCOUNT_ID>Your 12-digit AWS account ID
<YOUR_REGION>AWS region (e.g., us-west-2)
<YOUR_BUCKET_NAME>S3 bucket containing Glue table data
<YOUR_GLUE_DATABASE_NAME>Name of your Glue database

Wherobots Cloud

After completing the IAM role configuration in the AWS Console, continue here to configure and test the Glue Catalog connection in a Wherobots notebook.
1

Configure Spark session variables

Understanding CATALOG_NAME: The CATALOG_NAME variable is a local alias for your Spark session. It does not need to match your Glue database name. For example, you can set CATALOG_NAME = "glue_catalog" even if your Glue database is named my_production_db. The catalog name is simply how you reference the connection in Spark SQL queries.
Set the following variables at the top of your notebook or script:
from sedona.spark import SedonaContext
import os

# CATALOG_NAME: Local alias for this Spark session.
# This is NOT your Glue database name - it can be any valid identifier.
# Avoid hyphens; use underscores instead.
CATALOG_NAME = "glue_catalog"

# AWS account ID that owns the Glue Catalog (12-digit number)
ACCOUNT_ID = "<YOUR_AWS_ACCOUNT_ID>"

# Wherobots Organization ID
# Found at: https://cloud.wherobots.com/organization
ORG_ID = os.environ['USER_ORG_ID']

# IAM role name created by the Storage Integration
ROLE_NAME = "<YOUR_ROLE_NAME>"

# AWS region where your Glue Catalog and S3 bucket reside
REGION = "<YOUR_REGION>"
2

Build the Spark session

# Derived values (do not modify)
iam_role_arn = f"arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}"
external_id = f"{ORG_ID}:wherobots-workloads"

# Build Spark session with Glue Catalog
config = (
    SedonaContext.builder()
        .config("spark.sql.defaultCatalog", CATALOG_NAME)
        .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
        .config(f"spark.sql.catalog.{CATALOG_NAME}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
        .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", "s3://<YOUR_BUCKET_NAME>/<PATH_TO_TABLES>/")
        .config(f"spark.sql.catalog.{CATALOG_NAME}.client.factory", "com.wherobots.iceberg.aws.WherobotsStIntCredentialsFactory")
        .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.arn", iam_role_arn)
        .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.region", REGION)
        .config(f"spark.sql.catalog.{CATALOG_NAME}.client.credentials-provider.external-id", external_id)
        .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.external-id", external_id)
        .config(f"spark.sql.catalog.{CATALOG_NAME}.glue.account-id", ACCOUNT_ID)
        .getOrCreate()
)

sedona = SedonaContext.create(config)

Configuration parameters explained

ParameterPurpose
catalog-implSpecifies Glue as the catalog backend
client.factoryWherobots credential factory for Storage Integration
client.assume-role.arnIAM role ARN to assume for AWS access
credentials-provider.external-idSecurity token for role assumption
glue.account-idDirects queries to your AWS account’s Glue
3

Test the connection

Run the following commands in your Wherobots notebook to verify the configuration is working correctly.List available databases:
sedona.sql("SHOW DATABASES").show()
You should see your Glue database in the output.Select your database:
# Replace with your actual Glue database name
sedona.sql("USE <YOUR_GLUE_DATABASE_NAME>")
4

Create a test table with geometry (Iceberg V3)

Iceberg V3 supports geometry and geography types natively. Create a table with a geometry column:
sedona.sql("""
    CREATE TABLE IF NOT EXISTS test_geom_table (
        id BIGINT,
        name STRING,
        location GEOMETRY
    ) USING iceberg
    TBLPROPERTIES (
        'format-version' = '3'
    )
""")
Insert test data with geometry:
sedona.sql("""
    INSERT INTO test_geom_table VALUES
        (1, 'P1', ST_GeomFromWKT('POINT(-122.4194 37.7749)')),
        (2, 'P2', ST_GeomFromWKT('POINT(-118.2437 34.0522)'))
""")
Query the data:
sedona.sql("SELECT * FROM test_geom_table").show()
Verify table creation:
sedona.sql("SHOW TABLES").show()
5

Write and read back a DataFrame

You can write DataFrames directly to the Glue Catalog using writeTo():
# Load sample data from Wherobots Open Data
df = sedona.table("wherobots_open_data.overture_maps_foundation.buildings_building").limit(100)

# Write to Glue Catalog (creates or replaces the table)
df.writeTo("glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample").createOrReplace()

# Verify the schema shows geometry column
sedona.sql("DESCRIBE glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample").show()
The DESCRIBE output should show the geometry column with its proper type, confirming that spatial data is preserved when writing to Glue.Read data back and verify geometry:
# Read the table back
df_from_glue = sedona.table("glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample")

# Check the schema
df_from_glue.printSchema()

# Verify geometry operations work
df_from_glue.selectExpr("ST_Area(geometry) as area").show(5)

Troubleshooting

Invalid table identifier

Error: IllegalArgumentException: Invalid table identifier: my-table Cause: Hyphens in the catalog name cause parsing issues. Solution: Use underscores instead of hyphens in CATALOG_NAME. For example, use glue_catalog instead of glue-catalog.

Database not found

Error: EntityNotFoundException: Database default not found Cause: Attempting to use a database that doesn’t exist in Glue. Solution: Run SHOW DATABASES to see available databases, then USE <database_name> with the correct name.

AccessDenied on glue:GetDatabases

Error: AccessDeniedException: User is not authorized to perform: glue:GetDatabases Cause: The IAM policy is missing Glue permissions or has incorrect resource ARNs. Solution:
  • Verify the IAM policy includes all Glue actions listed in the AWS Console section.
  • Check that resource ARNs match your account ID, region, and database name.
  • Ensure glue.account-id is set in the Spark config.

sts:AssumeRole not authorized

Error: StsException: User is not authorized to perform: sts:AssumeRole Cause: The IAM role’s trust policy doesn’t allow Wherobots to assume it. Solution:
  • Verify the Storage Integration is properly configured.
  • Check that the trust policy matches what Wherobots provided during Storage Integration setup.
  • Ensure the External ID format is correct: <ORG_ID>:wherobots-workloads.

Glue queries hitting wrong AWS account

Error: Queries fail with access denied errors referencing an unexpected AWS account ID. Cause: Missing glue.account-id configuration. Solution: Ensure .config(f"spark.sql.catalog.{CATALOG_NAME}.glue.account-id", ACCOUNT_ID) is included in the Spark session builder.

Quick reference

The following tables summarize key placeholders and configuration values used in this integration guide for quick reference.

Placeholders

PlaceholderDescription
<YOUR_AWS_ACCOUNT_ID>12-digit AWS account ID
<YOUR_WHEROBOTS_ORG_ID>From Wherobots console Organization page
<YOUR_ROLE_NAME>IAM role name from Storage Integration setup
<YOUR_REGION>AWS region (e.g., us-west-2)
<YOUR_BUCKET_NAME>S3 bucket for Glue table data
<YOUR_GLUE_DATABASE_NAME>Name of your Glue database
<PATH_TO_TABLES>S3 path prefix for table storage

Key values

ItemValue
Wherobots Credentials Factorycom.wherobots.iceberg.aws.WherobotsStIntCredentialsFactory
External ID Format<ORG_ID>:wherobots-workloads
Glue Catalog Implementationorg.apache.iceberg.aws.glue.GlueCatalog
Storage Integration SetupS3 Storage Integration