> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Connect to AWS Glue Catalog

> Configure AWS Glue Catalog as an Iceberg catalog for Wherobots to manage and query Iceberg tables from Wherobots Spark workloads.

This guide walks through configuring AWS Glue Catalog as an Iceberg catalog for Wherobots. This integration allows you to manage Iceberg tables in AWS Glue and query them from Wherobots Spark workloads.

## Before you start

Before starting, ensure you have the following:

<AccordionGroup cols={2}>
  <Accordion title="Wherobots Requirements" icon="cloud">
    * A **Wherobots Account** with access to your Organization ID (found at [cloud.wherobots.com/organization](https://cloud.wherobots.com/organization))
    * A **Storage Integration** configured in Wherobots, pointing to the S3 bucket that contains your Glue table data. See [S3 Storage Integration](/develop/storage-management/s3-storage-integration) for setup instructions.
  </Accordion>

  <Accordion title="AWS Requirements" icon="aws">
    * An **AWS Account** with an existing Glue database
    * An **S3 bucket** for Glue table data
  </Accordion>
</AccordionGroup>

## Integration workflow

This integration requires configuring both the **AWS Console** and **Wherobots Cloud**. We recommend opening both in separate browser tabs.

Complete these steps in order:

1. **AWS Console:** Add Glue permissions to your Storage Integration IAM role
2. **Wherobots Cloud:** Configure your Spark session and test the connection

## AWS Console

<Info>
  **What you'll need from Wherobots**

  Before starting, make sure you have:

  * Your **Wherobots Organization ID** (found at [cloud.wherobots.com/organization](https://cloud.wherobots.com/organization))
  * A **Storage Integration** already configured in Wherobots, pointing to the S3 bucket that contains your Glue table data. See [S3 Storage Integration](/develop/storage-management/s3-storage-integration) for setup instructions.
</Info>

The Storage Integration creates an IAM role with S3 permissions. To access Glue Catalog, you must add Glue permissions to this same role.

<Steps>
  <Step title="Locate the IAM role">
    1. Open the AWS Console and navigate to **IAM > Roles**.
    2. Find the role created by your Storage Integration (the role name you specified during setup).
    3. Click on the role to view its details.
  </Step>

  <Step title="Update the permissions policy">
    Edit the role's permissions policy to include both S3 and Glue access. Replace the existing policy with the following, substituting your values for the placeholders:

    ```json theme={"system"}
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "S3BucketAccess",
                "Effect": "Allow",
                "Action": "s3:ListBucket",
                "Resource": "arn:aws:s3:::<YOUR_BUCKET_NAME>"
            },
            {
                "Sid": "S3ObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
            },
            {
                "Sid": "GlueCatalogAccess",
                "Effect": "Allow",
                "Action": [
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:CreateTable",
                    "glue:UpdateTable",
                    "glue:DeleteTable",
                    "glue:GetPartitions",
                    "glue:BatchCreatePartition",
                    "glue:BatchDeletePartition",
                    "glue:BatchGetPartition"
                ],
                "Resource": [
                    "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:catalog",
                    "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:database/<YOUR_GLUE_DATABASE_NAME>",
                    "arn:aws:glue:<YOUR_REGION>:<YOUR_AWS_ACCOUNT_ID>:table/<YOUR_GLUE_DATABASE_NAME>/*"
                ]
            }
        ]
    }
    ```

    #### Placeholder reference

    | Placeholder                 | Description                          |
    | :-------------------------- | :----------------------------------- |
    | `<YOUR_AWS_ACCOUNT_ID>`     | Your 12-digit AWS account ID         |
    | `<YOUR_REGION>`             | AWS region (e.g., `us-west-2`)       |
    | `<YOUR_BUCKET_NAME>`        | S3 bucket containing Glue table data |
    | `<YOUR_GLUE_DATABASE_NAME>` | Name of your Glue database           |
  </Step>
</Steps>

## Wherobots Cloud

After completing the IAM role configuration in the AWS Console, continue here to configure and test the Glue Catalog connection in a Wherobots notebook.

<Steps>
  <Step title="Configure Spark session variables">
    <Warning>
      **Understanding `CATALOG_NAME`**: The `CATALOG_NAME` variable is a local alias for your Spark session. It does not need to match your Glue database name. For example, you can set `CATALOG_NAME = "glue_catalog"` even if your Glue database is named `my_production_db`. The catalog name is simply how you reference the connection in Spark SQL queries.
    </Warning>

    Set the following variables at the top of your notebook or script:

    ```python theme={"system"}
    from sedona.spark import SedonaContext
    import os

    # CATALOG_NAME: Local alias for this Spark session.
    # This is NOT your Glue database name - it can be any valid identifier.
    # Avoid hyphens; use underscores instead.
    CATALOG_NAME = "glue_catalog"

    # AWS account ID that owns the Glue Catalog (12-digit number)
    ACCOUNT_ID = "<YOUR_AWS_ACCOUNT_ID>"

    # Wherobots Organization ID
    # Found at: https://cloud.wherobots.com/organization
    ORG_ID = os.environ['USER_ORG_ID']

    # IAM role name created by the Storage Integration
    ROLE_NAME = "<YOUR_ROLE_NAME>"

    # AWS region where your Glue Catalog and S3 bucket reside
    REGION = "<YOUR_REGION>"
    ```
  </Step>

  <Step title="Build the Spark session">
    ```python theme={"system"}
    # Derived values (do not modify)
    iam_role_arn = f"arn:aws:iam::{ACCOUNT_ID}:role/{ROLE_NAME}"
    external_id = f"{ORG_ID}:wherobots-workloads"

    # Build Spark session with Glue Catalog
    config = (
        SedonaContext.builder()
            .config("spark.sql.defaultCatalog", CATALOG_NAME)
            .config(f"spark.sql.catalog.{CATALOG_NAME}", "org.apache.iceberg.spark.SparkCatalog")
            .config(f"spark.sql.catalog.{CATALOG_NAME}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
            .config(f"spark.sql.catalog.{CATALOG_NAME}.warehouse", "s3://<YOUR_BUCKET_NAME>/<PATH_TO_TABLES>/")
            .config(f"spark.sql.catalog.{CATALOG_NAME}.client.factory", "com.wherobots.iceberg.aws.WherobotsStIntCredentialsFactory")
            .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.arn", iam_role_arn)
            .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.region", REGION)
            .config(f"spark.sql.catalog.{CATALOG_NAME}.client.credentials-provider.external-id", external_id)
            .config(f"spark.sql.catalog.{CATALOG_NAME}.client.assume-role.external-id", external_id)
            .config(f"spark.sql.catalog.{CATALOG_NAME}.glue.account-id", ACCOUNT_ID)
            .getOrCreate()
    )

    sedona = SedonaContext.create(config)
    ```

    #### Configuration parameters explained

    | Parameter                          | Purpose                                              |
    | :--------------------------------- | :--------------------------------------------------- |
    | `catalog-impl`                     | Specifies Glue as the catalog backend                |
    | `client.factory`                   | Wherobots credential factory for Storage Integration |
    | `client.assume-role.arn`           | IAM role ARN to assume for AWS access                |
    | `credentials-provider.external-id` | Security token for role assumption                   |
    | `glue.account-id`                  | Directs queries to your AWS account's Glue           |
  </Step>

  <Step title="Test the connection">
    Run the following commands in your Wherobots notebook to verify the configuration is working correctly.

    **List available databases:**

    ```python theme={"system"}
    sedona.sql("SHOW DATABASES").show()
    ```

    You should see your Glue database in the output.

    **Select your database:**

    ```python theme={"system"}
    # Replace with your actual Glue database name
    sedona.sql("USE <YOUR_GLUE_DATABASE_NAME>")
    ```
  </Step>

  <Step title="Create a test table with geometry (Iceberg V3)">
    Iceberg V3 supports geometry and geography types natively. Create a table with a geometry column:

    ```python theme={"system"}
    sedona.sql("""
        CREATE TABLE IF NOT EXISTS test_geom_table (
            id BIGINT,
            name STRING,
            location GEOMETRY
        ) USING iceberg
        TBLPROPERTIES (
            'format-version' = '3'
        )
    """)
    ```

    **Insert test data with geometry:**

    ```python theme={"system"}
    sedona.sql("""
        INSERT INTO test_geom_table VALUES
            (1, 'P1', ST_GeomFromWKT('POINT(-122.4194 37.7749)')),
            (2, 'P2', ST_GeomFromWKT('POINT(-118.2437 34.0522)'))
    """)
    ```

    **Query the data:**

    ```python theme={"system"}
    sedona.sql("SELECT * FROM test_geom_table").show()
    ```

    **Verify table creation:**

    ```python theme={"system"}
    sedona.sql("SHOW TABLES").show()
    ```
  </Step>

  <Step title="Write and read back a DataFrame">
    You can write DataFrames directly to the Glue Catalog using `writeTo()`:

    ```python theme={"system"}
    # Load sample data from Wherobots Open Data
    df = sedona.table("wherobots_open_data.overture_maps_foundation.buildings_building").limit(100)

    # Write to Glue Catalog (creates or replaces the table)
    df.writeTo("glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample").createOrReplace()

    # Verify the schema shows geometry column
    sedona.sql("DESCRIBE glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample").show()
    ```

    The `DESCRIBE` output should show the geometry column with its proper type, confirming that spatial data is preserved when writing to Glue.

    **Read data back and verify geometry:**

    ```python theme={"system"}
    # Read the table back
    df_from_glue = sedona.table("glue_catalog.<YOUR_GLUE_DATABASE_NAME>.buildings_sample")

    # Check the schema
    df_from_glue.printSchema()

    # Verify geometry operations work
    df_from_glue.selectExpr("ST_Area(geometry) as area").show(5)
    ```
  </Step>
</Steps>

## Troubleshooting

### Invalid table identifier

**Error:** `IllegalArgumentException: Invalid table identifier: my-table`

**Cause:** Hyphens in the catalog name cause parsing issues.

**Solution:** Use underscores instead of hyphens in `CATALOG_NAME`. For example, use `glue_catalog` instead of `glue-catalog`.

### Database not found

**Error:** `EntityNotFoundException: Database default not found`

**Cause:** Attempting to use a database that doesn't exist in Glue.

**Solution:** Run `SHOW DATABASES` to see available databases, then `USE <database_name>` with the correct name.

### AccessDenied on glue:GetDatabases

**Error:** `AccessDeniedException: User is not authorized to perform: glue:GetDatabases`

**Cause:** The IAM policy is missing Glue permissions or has incorrect resource ARNs.

**Solution:**

* Verify the IAM policy includes all Glue actions listed in the [AWS Console](#aws-console) section.
* Check that resource ARNs match your account ID, region, and database name.
* Ensure `glue.account-id` is set in the Spark config.

### sts:AssumeRole not authorized

**Error:** `StsException: User is not authorized to perform: sts:AssumeRole`

**Cause:** The IAM role's trust policy doesn't allow Wherobots to assume it.

**Solution:**

* Verify the Storage Integration is properly configured.
* Check that the trust policy matches what Wherobots provided during Storage Integration setup.
* Ensure the External ID format is correct: `<ORG_ID>:wherobots-workloads`.

### Glue queries hitting wrong AWS account

**Error:** Queries fail with access denied errors referencing an unexpected AWS account ID.

**Cause:** Missing `glue.account-id` configuration.

**Solution:** Ensure `.config(f"spark.sql.catalog.{CATALOG_NAME}.glue.account-id", ACCOUNT_ID)` is included in the Spark session builder.

## Quick reference

The following tables summarize key placeholders and configuration values used in this integration guide for quick reference.

### Placeholders

| Placeholder                 | Description                                  |
| :-------------------------- | :------------------------------------------- |
| `<YOUR_AWS_ACCOUNT_ID>`     | 12-digit AWS account ID                      |
| `<YOUR_WHEROBOTS_ORG_ID>`   | From Wherobots console Organization page     |
| `<YOUR_ROLE_NAME>`          | IAM role name from Storage Integration setup |
| `<YOUR_REGION>`             | AWS region (e.g., `us-west-2`)               |
| `<YOUR_BUCKET_NAME>`        | S3 bucket for Glue table data                |
| `<YOUR_GLUE_DATABASE_NAME>` | Name of your Glue database                   |
| `<PATH_TO_TABLES>`          | S3 path prefix for table storage             |

### Key values

| Item                          | Value                                                                        |
| :---------------------------- | :--------------------------------------------------------------------------- |
| Wherobots Credentials Factory | `com.wherobots.iceberg.aws.WherobotsStIntCredentialsFactory`                 |
| External ID Format            | `<ORG_ID>:wherobots-workloads`                                               |
| Glue Catalog Implementation   | `org.apache.iceberg.aws.glue.GlueCatalog`                                    |
| Storage Integration Setup     | [S3 Storage Integration](/develop/storage-management/s3-storage-integration) |
