Skip to main content
Wherobots’ integration with Amazon Simple Storage Service (S3) allows Amazon S3 customers to utilize Wherobots as the spatial engine that operates on their data while still using Amazon S3 for data storage. Accelerate your creation of spatial data products by using data directly from Amazon S3 public or private buckets, bypassing the need for time-consuming data transfers.

Benefits

  • Ease of access to data: Integrating your S3 buckets allows Wherobots Organization members to seamlessly access and work with data stored in Amazon S3 buckets without having to manually transfer or duplicate the data.
  • Self-service setup: Administrators can configure the integration themselves through a user interface.
  • Secure authentication: Wherobots’ S3 integration supports secure authentication methods, including Amazon Web Services (AWS) access keys and AWS Identity and Access Management (IAM) role-based access.
  • Control data access: Administrators can select specific buckets to be accessed through the integration, providing granular control over data access.
  • Supports Requester Pays buckets: Wherobots’ S3 integration supports integrations with Amazon S3 Requester Pays buckets.

Before you start

The following is required to integrate your Amazon S3 storage bucket with Wherobots:
  • An Admin account within a Professional or Enterprise Edition Organization.
    • This feature is not available for Community Edition Organizations. For more information on Organization tiers, see Organization Editions. To upgrade your Organization, see Upgrade Organization.
    • You need an Admin role to create storage integrations. Those with User role accounts can use existing integrations set up by their Admins. For more information on roles, see Organization Roles.
  • An AWS account.
  • An existing AWS IAM role with the permissions to modify and manage trust and role policies for that IAM role. For more information, see AWS IAM Role permissions in the Wherobots Documentation.
    • If this role does not exist in your AWS IAM Console, you must have the permissions to create such a role.
  • An existing public or private Amazon S3 bucket to integrate.
Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Managed Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Managed Catalog.

AWS IAM Role permissions

The following table shows the IAM Actions needed to create or manage IAM roles in AWS, a requirement for integrating an Amazon S3 bucket with Wherobots. Completing these Actions typically requires AdministratorAccess in AWS.
IAM ActionDescription
AttachRolePolicyAttaches a managed policy to the role.
CreateRoleCreates a new IAM role.
DeleteRolePolicyRemoves an inline policy from the role.
DetachRolePolicyDetaches a managed policy from the role.
PutRolePolicyCreates a new inline policy and attaches it to the role.
UpdateAssumeRolePolicyModifies the existing trust policy of the role.
UpdateRoleModifies the role’s description or maximum session duration.

Bucket types

This section details the differences between Amazon S3 public and private buckets in relation to the Wherobots S3 Storage Integration. For more information on Amazon S3 buckets, see Creating a bucket in the Amazon S3 documentation.
A public bucket on Amazon S3 is a bucket that has turned off Amazon S3’s default Block all public access option.
A private bucket on Amazon S3 is a bucket that keeps the default Block all public access option enabled.
In Amazon S3, a Requester Pays bucket shifts the responsibility for the cost of the request and the data download from the bucket owner to the person accessing the data. With a Requester Pays bucket, those downloading the data from an Amazon S3 bucket pay the transfer fees, not the bucket owner.
Requester Pays buckets shift the responsibility for the cost of the request and the data download from the bucket owner to the person accessing the data.

Integrate a public or private bucket

This section discusses how to integrate an Amazon S3 bucket storage with Wherobots. In Amazon S3, you can create a private or public bucket and give specific Roles access to that bucket. To implement a storage integration between Wherobots and an Amazon S3 bucket, you need access to Wherobots Cloud and the Amazon IAM Dashboard.

Complete bucket integration

Integrating an Amazon S3 bucket requires going between the Wherobots Cloud and AWS Console user interfaces. We recommend having Wherobots Cloud and AWS Console opened in separate browser tabs while completing the integration process.
  1. Log in to Wherobots Cloud.
  2. Click Storage.
  3. Click Create Storage Integration. Create storage integration
  4. On the Add New Storage Integration page, do the following:
    1. Create a Name for your storage integration. A Name must consist of alphanumeric characters, spaces, special characters, or underscores, and it must include at least one English letter (uppercase or lowercase). Add storage integration
    2. In the S3 Path field, add the path to the existing bucket. The S3 path is the name of the existing Amazon S3 bucket, prefaced by s3://.
      In Wherobots Cloud, your Amazon S3 bucket’s path cannot contain periods. For example, s3://my.bucket.path is not allowed. Acceptable S3 paths can consist of alphanumeric characters, underscores, equal signs, and dashes.
    3. In the Role ARN field, add the role’s Amazon Resource Name (ARN).
      • To find the Role ARN for an existing role:
        • In the AWS IAM Console, go to Roles.
        • Select the role that you have permission to modify.
        • In the Summary section, copy the Role ARN.
      • If you haven’t created a role yet, click the Grant role access to a S3 bucket accordion below and complete the necessary steps.
        If you haven’t created an IAM role already, you must configure an IAM role to enable access to your Amazon S3 Bucket. Wherobots uses these roles to make your data accessible to you on the platform in a secure fashion.
        Before proceeding with this section, ensure that you have the ability to complete the AWS tasks outlined in Before you start - AWS IAM Role permissions.
        1. Go to the AWS Console, IAM > Roles and create a new Role.
        2. Under Service or use case, choose S3.
        3. Click Next.
        4. Under Permissions Policy, click Next.
        5. Enter a descriptive Role Name.
        6. Under “Step 3: Add tags”, click Add new tag.
        7. For Key, enter wherobotsOrgID.
        8. For Value, enter your Wherobots Organization ID. To find your organization ID, log in to Wherobots Cloud, go to Organization Settings, and copy the value next to Organization ID.
        9. Click Create role.
        10. Click your newly created Role.
        11. Under Permissions policies, click Add permissions > Create Inline policy.
        12. Click JSON, then select all text and paste the Sample Role Policy from Wherobots Cloud. Grant access 1
        13. Click Next, then choose a policy name and click Create Policy.
        14. Click on the Trust relationships tab.
        15. Click the Edit trust policy button.
        16. Select all within the edit area.
        17. Copy the Trust Relationship JSON from Wherobots Cloud and paste it into the editor. Grant access 2
        18. Click Update policy.
        19. Return to Complete bucket integration and continue with the remaining steps.
    4. In Wherobots Cloud, click Copy to copy the Trust Relationship JSON. Grant access 2
    5. In AWS, edit your existing IAM role’s Role and Trust policies:
      1. Within the IAM Console, click Roles in the left-side navigation.
      2. Click your existing role to get to that role’s Detail page.
      3. Within the Permissions tab, click Add permissions > Attach policies.
      4. In the Policy Name column, select the necessary checkbox to attach the Wherobots-generated Sample Role Policy to the role policy that you created in step 4.d of Complete bucket integration.
      5. Click the Add Permissions button.
      6. Click the Trust relationships tab.
      7. Click the Edit trust policy button.
      8. In the Edit Trust Policy field, paste the Wherobots-generated Trust Relationship JSON from the previous step of Complete bucket integration.
      9. Click Update policy to save the new configurations of your role.
    6. Return to the Wherobots Cloud Create Storage Integration screen.
      1. Click Submit. You will be taken to Organization Settings.
      2. Scroll to Storage.
      3. Find your bucket and click … >Verify Access to confirm a successful integration. Verify public integration
      4. If your integration is unsuccessful, wait a few seconds and then click Retry.

Verify your bucket storage integration

To ensure that your bucket has been created successfully, do the following:
  1. Log in to Wherobots Cloud.
  2. Go to Organization Settings.
  3. Scroll to Storage.
  4. Click … > Verify Access next to your desired bucket. If you have successfully integrated a bucket, a pop-up window will appear and confirm that Wherobots has access to your storage.

View storage integration

View specific integration

To review the contents of a specific storage integration, do the following:
  1. Log in to Wherobots Cloud.
  2. At the top of the screen, click the Storage Source dropdown. Storage source
  3. Select your desired bucket.

View all storage integrations

To review the storage integrations in your Organization, do the following:
  1. Go to Organization Settings.
  2. Scroll to Storage.
You can see the Name, Type, Path, and Created on date associated with each bucket.

Delete storage integration

To delete a storage integration, do the following:
  1. Go to Organization Settings.
  2. Scroll to Storage.
  3. Click > Delete next to the storage integration that you want to remove.
  4. Click Delete to confirm that you want to delete this storage integration.

Access integrated storage in a notebook

Once you create a storage integration, you can read your data in a Wherobots Notebook. This works for both private and public Amazon S3 buckets.
To use new storage integrations or catalogs in your notebooks, you must start a new runtime. Notebooks can only access storage integrations or catalogs that were created before the runtime started.
To use a storage integration in a notebook, do the following:
  1. Log in to Wherobots Cloud.
  2. Start a Notebook. For more information on how to start and manage a notebook, see Notebook instance management.
  3. Create a Notebook with a Python Kernel. For more information, see Jupyter Notebook Management.
  4. In the Notebook, include the following Python code, replacing YOUR-S3-BUCKET-PATH and FILE-PATH with the path to your bucket and the file path to your bucket resource, respectively:
    from sedona.spark import *
    config = SedonaContext.builder().getOrCreate()
    sedona = SedonaContext.create(config)
    path = "s3://YOUR-S3-BUCKET-PATH/FILE-PATH"
    rawDf = sedona.read.format("binaryFile").load(path)
    rawDf.printSchema()
    
    In Wherobots Cloud, YOUR-S3-BUCKET-PATH cannot contain periods. For example, s3://my.bucket.name is not allowed. Acceptable S3 bucket paths can consist of alphanumeric characters, underscores, equal signs, and dashes.
  5. Run the cells. You should see the following output:
    .yaml .no-copy }
    root
    |-- path: string (nullable = true)
    |-- modificationTime: timestamp (nullable = true)
    |-- length: long (nullable = true)
    |-- content: binary (nullable = true)
    

Managed Catalog

A Managed Catalog can be created from a private bucket storage integration at any time, allowing for multiple catalogs per integration and addressing situations where a Managed Catalog wasn’t created during the initial setup. For usage information, see Managed Catalog.
Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Managed Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Managed Catalog.

What is a Managed Catalog?

A Managed Catalog is a metadata repository that is created, owned, and controlled directly within your Wherobots Organization. When you connect a data source like an S3 bucket and register it as a managed catalog, Wherobots takes on the following responsibilities:
  • Source of Truth: Wherobots becomes the authoritative source for all metadata, including schemas, table definitions, file locations, and partition information.
  • Data Discovery: Wherobots actively scans the underlying storage (e.g., S3) to discover new data and automatically update the catalog.
  • Lifecycle Management: Wherobots handles all metadata operations, such as creating, updating, and deleting tables. Changes in the underlying data are automatically synced to the catalog.
  • Optimization: Because Wherobots has full control, it can build and manage advanced spatial indexes and perform other performance optimizations directly on the metadata.
You typically use a managed catalog when your raw spatial data files reside in an AWS S3 private bucket and you want Wherobots to handle all aspects of data management, query optimization, and spatial ETL.

Create a Managed Catalog from an S3 bucket

To create a Managed Catalog from an S3 storage integration, do the following:
  1. Log in to Wherobots Cloud.
  2. Click Data Hub.
  3. Click Add Catalog. Click Data Hub
  4. In the Name field, enter a name for your Spatial Catalog. A Name must consist of alphanumeric characters, spaces, special characters, or underscores, and it must include at least one English letter (uppercase or lowercase). Create spatial catalog
  5. In the Storage dropdown, select a private bucket.
  6. (Optional) In the Path field, enter the sub-folder where you’d like to store this Managed Catalog within your Amazon S3 bucket.
    To use new storage integrations or catalogs in your notebooks, you must start a new runtime. Notebooks can only access storage integrations or catalogs that were created before the runtime started.

Storage integration limitations

Currently, Wherobots’ Amazon S3 integration has the following limitations:
  • In Wherobots Cloud, your Amazon S3 bucket’s path cannot contain periods. For example, s3://my.bucket.name is not allowed. Acceptable S3 bucket paths can consist of alphanumeric characters, underscores, equal signs, and dashes.
  • A bucket can only be configured with a single storage integration.
  • Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Managed Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Managed Catalog.