S3 Storage Integration
Wherobots’ integration with Amazon Simple Storage Service (S3) allows Amazon S3 customers to utilize Wherobots as the spatial engine that operates on their data while still using Amazon S3 for data storage.
Accelerate your creation of spatial data products by using data directly from Amazon S3 public or private buckets, bypassing the need for time-consuming data transfers.
Benefits¶
- Ease of access to data: Integrating your S3 buckets allows Wherobots Organization members to seamlessly access and work with data stored in Amazon S3 buckets without having to manually transfer or duplicate the data.
- Self-service setup: Administrators can configure the integration themselves through a user interface.
- Secure authentication: Wherobots’ S3 integration supports secure authentication methods, including Amazon Web Services (AWS) access keys and AWS Identity and Access Management (IAM) role-based access.
- Control data access: Administrators can select specific buckets to be accessed through the integration, providing granular control over data access.
- Supports Requester Pays buckets: Wherobots’ S3 integration supports integrations with Amazon S3 Requester Pays buckets.
- In Amazon S3, a Requester Pays bucket shifts the responsibility for the cost of the request and the data download from the bucket owner to the person accessing the data.
- For more information on Requester Pays buckets, see Using Requester Pays buckets for storage transfers and usage in the Amazon S3 documentation and Requester Pays bucket in the Wherobots documentation.
Before you start¶
The following is required to integrate your Amazon S3 storage bucket with Wherobots:
- An Admin account within a Professional or Enterprise Edition Organization.
- This feature is not available for Community Edition Organizations. For more information on Organization tiers, see Organization Editions. To upgrade your Organization, see Upgrade Organization.
- You need an Admin role to create storage integrations. Those with User role accounts can use existing integrations set up by their Admins. For more information on roles, see Organization Roles.
- An AWS account.
- An existing AWS IAM role with the permissions to modify and manage trust and role policies for that IAM role. For more information, see AWS IAM Role permissions in the Wherobots Documentation.
- If this role does not exist in your AWS IAM Console, you must have the permissions to create such a role.
-
An existing public or private Amazon S3 bucket to integrate.
Info
Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Spatial Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Spatial Catalog.
AWS IAM Role permissions¶
The following table shows the IAM Actions needed to create or manage IAM roles in AWS, a requirement for integrating an Amazon S3 bucket with Wherobots.
Completing these Actions typically requires AdministratorAccess
in AWS.
Important notes regarding AWS IAM Actions and Roles
- Ensure that you have the necessary AWS permissions to complete these Actions before creating a storage integration.
- For a complete list of IAM Actions, see Actions defined by AWS Identity and Access Management (IAM) in the AWS Documentation.
- For more information about
AdministratorAccess
, seeAdministratorAccess
in the AWS documentation.
IAM Action | Description |
---|---|
AttachRolePolicy |
Attaches a managed policy to the role. |
CreateRole |
Creates a new IAM role. |
DeleteRolePolicy |
Removes an inline policy from the role. |
DetachRolePolicy |
Detaches a managed policy from the role. |
PutRolePolicy |
Creates a new inline policy and attaches it to the role. |
UpdateAssumeRolePolicy |
Modifies the existing trust policy of the role. |
UpdateRole |
Modifies the role's description or maximum session duration. |
Bucket types¶
This section details the differences between Amazon S3 public and private buckets in relation to the Wherobots S3 Storage Integration.
For more information on Amazon S3 buckets, see Creating a bucket in the Amazon S3 documentation.
Public bucket¶
A public bucket on Amazon S3 is a bucket that has turned off Amazon S3’s default Block all public access option.
Private bucket¶
A private bucket on Amazon S3 is a bucket that keeps the default Block all public access option enabled.
Requester Pays bucket¶
In Amazon S3, a Requester Pays bucket shifts the responsibility for the cost of the request and the data download from the bucket owner to the person accessing the data. With a Requester Pays bucket, those downloading the data from an Amazon S3 bucket pay the transfer fees, not the bucket owner.
Accessing data from Requester Pays buckets will result in additional fees
Requester Pays buckets shift the responsibility for the cost of the request and the data download from the bucket owner to the person accessing the data.
Integrate a public or private bucket¶
This section discusses how to integrate an Amazon S3 bucket storage with Wherobots. In Amazon S3, you can create a private or public bucket and give specific Roles access to that bucket.
To implement a storage integration between Wherobots and an Amazon S3 bucket, you need access to Wherobots Cloud and the Amazon IAM Dashboard.
Complete bucket integration¶
Integrating an Amazon S3 bucket requires going between the Wherobots Cloud and AWS Console user interfaces. We recommend having Wherobots Cloud and AWS Console opened in separate browser tabs while completing the integration process.
- Log in to Wherobots Cloud.
- Click Storage.
- Click Create Storage Integration.
-
On the Add New Storage Integration page, do the following:
-
Create a Name for your storage integration. The name must start with a letter and can only contain letters, numbers, and underscores.
-
In the S3 Path field, add the path to the existing bucket. The S3 path is the name of the existing Amazon S3 bucket, prefaced by
s3://
.Info
In Wherobots Cloud, your Amazon S3 bucket's path cannot contain periods. For example,
s3://my.bucket.path
is not allowed. Acceptable S3 paths can consist of alphanumeric characters, underscores, equal signs, and dashes. -
In the Role ARN field, add the role’s Amazon Resource Name (ARN).
- To find the Role ARN for an existing role:
- In the AWS IAM Console, go to Roles.
- Select the role that you have permission to modify.
- In the Summary section, copy the Role ARN.
- If you haven’t created a role yet, complete the steps in Grant role access to a S3 bucket and then return to this step after creating a Role.
- To find the Role ARN for an existing role:
-
Click Copy to copy the Sample role policy.
-
Create a Role policy in AWS:
Note
Before proceeding with this section, ensure that you have the ability to complete the AWS tasks outlined in Before you start - AWS IAM Role permissions.
- Sign in to the AWS IAM console.
- In the navigation pane on the left, choose Policies.
- Click Create policy.
- In the Policy editor:
- Click JSON.
- Paste the Sample role policy generated in Wherobots.
- (Optional) Follow additional instructions in Create IAM policies (console) within the AWS documentation.
- Click Create policy in AWS to complete the creation of your IAM policy.
-
-
In Wherobots Cloud, click Copy to copy the Trust Relationship JSON.
- In AWS, edit your existing IAM role's Role and Trust policies:
- Within the IAM Console, click Roles in the left-side navigation.
- Click your existing role to get to that role's Detail page.
- Within the Permissions tab, click Add permissions > Attach policies.
- In the Policy Name column, select the necessary checkbox to attach the Wherobots-generated Sample Role Policy to the role policy that you created in step 4.d of Complete bucket integration.
- Click the Add Permissions button.
- Click the Trust relationships tab.
- Click the Edit trust policy button.
- In the Edit Trust Policy field, paste the previously copied Wherobots-generated Trust Relationship JSON you copied in step 4.e of Complete bucket integration.
- Click Update policy to save the new configurations of your role.
- Return to the Wherobots Cloud Create Storage Integration screen.
- (Optional) Leave Would you like to create a Wherobots Spatial Catalog in this location? checkbox checked to create a Spatial Catalog.
- Click Submit. You will be taken to Organization Settings.
- Scroll to Storage.
- Find your bucket and click … >Verify Access to confirm a successful integration.
- If your integration is unsuccessful, wait a few seconds and then click Retry.
-
Grant role access to a S3 bucket¶
You must configure an IAM role to enable access to your Amazon S3 Bucket. Creating a Role for Wherobots' S3 storage integration streamlines access management by granting permissions to the Role itself, rather than individual users.
Note
Before proceeding with this section, ensure that you have the ability to complete the AWS tasks outlined in Before you start - AWS IAM Role permissions.
To create and configure a role, do the following:
- Sign in to the AWS IAM Console and go to Roles.
-
Click Create role:
- Select Custom trust policy.
- Click Next.
- (Optional) On the Add permissions screen, add any required permissions.
- Click Next.
-
Enter a Role name and Description.
Info
Consult your internal team for any specific requirements regarding your business' Amazon S3 role configuration.
-
Click Create role to save your new role.
- On the IAM Role Dashboard, click the Role you just created.
- Copy the ARN for that Role and return to Complete bucket integration.
Verify your bucket storage integration¶
To ensure that your bucket has been created successfully, do the following:
- Log in to Wherobots Cloud.
- Go to Organization Settings.
- Scroll to Storage.
- Click … > Verify Access next to your desired bucket.
If you have successfully integrated a bucket, a pop-up window will appear and confirm that Wherobots has access to your storage.
View storage integration¶
View specific integration¶
To review the contents of a specific storage integration, do the following:
- Log in to Wherobots Cloud.
- At the top of the screen, click the Storage Source dropdown.
- Select your desired bucket.
View all storage integrations¶
To review the storage integrations in your Organization, do the following:
- Go to Organization Settings.
- Scroll to Storage.
You can see the Name, Type, Path, and Created on date associated with each bucket.
Delete storage integration¶
To delete a storage integration, do the following:
- Go to Organization Settings.
- Scroll to Storage.
- Click … > Delete next to the storage integration that you want to remove.
- Click Delete to confirm that you want to delete this storage integration.
Access integrated storage in a notebook¶
Once you create a storage integration, you can read your data in a Wherobots Notebook. This works for both private and public Amazon S3 buckets.
Info
To use new storage integrations or catalogs in your notebooks, you must start a new runtime. Notebooks can only access storage integrations or catalogs that were created before the runtime started.
To use a storage integration in a notebook, do the following:
- Log in to Wherobots Cloud.
- Start a Notebook. For more information on how to start and manage a notebook, see Notebook instance management.
- Create a Notebook with a Python Kernel. For more information, see Jupyter Notebook Management.
-
In the Notebook, include the following Python code, replacing
YOUR-S3-BUCKET-PATH
andFILE-PATH
with the path to your bucket and the file path to your bucket resource, respectively:from sedona.spark import * config = SedonaContext.builder().getOrCreate() sedona = SedonaContext.create(config) path = "s3://YOUR-S3-BUCKET-PATH/FILE-PATH" rawDf = sedona.read.format("binaryFile").load(path) rawDf.printSchema()
Note
In Wherobots Cloud,
YOUR-S3-BUCKET-PATH
cannot contain periods. For example,s3://my.bucket.name
is not allowed. Acceptable S3 bucket paths can consist of alphanumeric characters, underscores, equal signs, and dashes. -
Run the cells. You should see the following output:
root |-- path: string (nullable = true) |-- modificationTime: timestamp (nullable = true) |-- length: long (nullable = true) |-- content: binary (nullable = true)
Spatial Catalog¶
A Spatial Catalog can be created from a private bucket storage integration at any time, allowing for multiple catalogs per integration and addressing situations where a Spatial Catalog wasn't created during the initial setup. For usage information, see Spatial Catalog.
Wherobots discourages using a public S3 bucket to create a Spatial Catalog.
Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Spatial Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Spatial Catalog.
To create a Spatial Catalog from a storage integration, do the following:
- Log in to Wherobots Cloud.
- Click Spatial Catalog.
- Click Create Catalog.
- In the Name field, enter a name for your Spatial Catalog.
- In the Storage dropdown, select a private bucket.
- (Optional) In the Path field, enter the sub-folder where you’d like to store this Spatial Catalog within your Amazon S3 bucket.
Info
To use new storage integrations or catalogs in your notebooks, you must start a new runtime. Notebooks can only access storage integrations or catalogs that were created before the runtime started.
Storage integration limitations¶
Currently, Wherobots' Amazon S3 integration has the following limitations:
- In Wherobots Cloud, your Amazon S3 bucket's path cannot contain periods. For example,
s3://my.bucket.name
is not allowed. Acceptable S3 bucket paths can consist of alphanumeric characters, underscores, equal signs, and dashes. - A bucket can only be configured with a single storage integration.
- Granting any external entity write access to a public S3 bucket is strongly discouraged as it is unaligned with security best practices. Therefore, Wherobots advises against using a public bucket to create a Spatial Catalog, since doing so would require giving Wherobots write access. In line with security best practices, we recommend using a private S3 bucket for your Spatial Catalog.