Skip to content

Accessing your own S3 buckets

It is possible to access your own S3 buckets using WherobotsDB. This chapter describes how to configure WherobotsDB context to access your own S3 buckets.

Accessing Public Buckets

To access publicly readable buckets, you need to specify the following configuration when creating the SedonaContext:

config = SedonaContext \
    .builder() \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") \
    .getOrCreate()
sedona = SedonaContext.create(config)

This will allow WherobotsDB to access the publicly readable bucket named <bucket-name> anonymously.

Access Private S3 Bucket Using Access Keys

To use access keys to access your private bucket named <bucket-name>, you need to specify the following configuration when creating the SedonaContext:

config = SedonaContext \
    .builder() \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.access.key", "<your-access-key-id>") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.secret.key", "<your-secret-access-key>") \
    .getOrCreate()
sedona = SedonaContext.create(config)

Please replace <your-access-key-id> and <your-secret-access-key> with your own access key ID and secret access key,

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred.

Although it is easy to setup cross account S3 bucket access with access keys, it is not recommended to use static credentials in a production environment. It is recommended to use IAM roles instead.

Using IAM Roles

To use IAM roles, you need to have the following:

  1. A bucket in your AWS account which you want to use as Havasu warehouse. We'll refer to the bucket as <bucket-name> in the following sections.
  2. An IAM role in your AWS account which has access to the bucket you want to use as Havasu warehouse. We'll refer to the ARN of the IAM role as <your-iam-role-arn> in the following sections.
  3. A trust relationship allowing Wherobots user roles to assume your IAM role.

1. Create an IAM role in your AWS account

Create an IAM role in your AWS account, and attach the following policy to give permissions on your bucket. Adjust path to the sub-folder in your bucket you want to give access to, if needed. We'll refer to this role as <your-iam-role-arn>.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::<bucket-name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "path/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<bucket-name>/path/*"
        }
    ]
}

Note

If you only intend to read data from this bucket, you can limit the permissions to s3:GetObject in the second policy statement.

2. Add a trust relationship to your IAM role

Add a trust relationship to the <your-iam-role-arn> IAM role you just created, allowing your Wherobots users' role to assume your role. The trust relationship should look like the following:

{
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "arn:aws:iam::329898491045:root"
    ]
  },
  "Action": "sts:AssumeRole",
  "Condition": {
    "StringLike": {
      "aws:PrincipalArn": "arn:aws:iam::329898491045:role/wherobotsusers/organization/*/wherobots-user_<wherobots-org-id>-*"
    }
  }
}

The <wherobots-org-id> can be obtained from your organization's settings page, or by running the following command in a Wherobots notebook terminal:

echo $WBC_ORG_ID

3. Add permissions to the IAM roles of Wherobots users

Warning

TODO: This step cannot be done by the end user for now. Please contact us to get this in place for your account.

4. Verify the configuration

You can verify the configuration by running the following command in your Wherobots notebook's terminal:

aws sts assume-role --role-arn <your-iam-role-arn> --role-session-name cross-account-s3-access

If the command succeeded and outputs STS credentials, then the configuration is correct.

5. Configure the SedonaContext for WherobotsDB (compute only)

You can configure WherobotsDB to access your S3 bucket through this IAM role.

config = SedonaContext \
    .builder() \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.assumed.role.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.assumed.role.arn", "<your-iam-role-arn>") \
    .getOrCreate()
sedona = SedonaContext.create(config)

Optional: Configure the SedonaContext for WherobotsDB (compute and Havasu Iceberg storage)

Havasu can use users' own S3 buckets as table storage. This requires some configurations on both AWS IAM and Wherobots SedonaContext. Please refer to Access Your Own S3 Buckets for details.


Last update: May 20, 2024 07:48:29