Skip to content

Access Your Own S3 Buckets

It is possible to use your own S3 buckets as the storage of Havasu. This chapter describes how to configure SedonaContext to store Havasu tables in your own S3 buckets.

Using Access Keys

Prerequisites

To use access keys, you need to have the following:

  1. A bucket in your AWS account which you want to use as Havasu warehouse. We'll refer to the bucket as <bucket-name> in the following sections.
  2. An IAM user in your AWS account which has access to the bucket you want to use as Havasu warehouse.
  3. Access key ID and secret access key of the IAM user. We will refer to the access key ID as <your-access-key-id> and secret access key as <your-secret-access-key> in the following sections.

Configure SedonaContext

You can configure the access key in the SedonaContext to access your bucket with the following configuration:

config = SedonaContext \
    .builder() \
    .config("spark.sql.catalog.<your_catalog>.type", "hadoop") \
    .config("spark.sql.catalog.<your_catalog>", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.<your_catalog>.warehouse", "s3://<bucket-name>/path/to/your/warehouse") \
    .config("spark.sql.catalog.<your_catalog>.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.catalog.<your_catalog>.client.region", "<your-region>") \
    .config("spark.sql.catalog.<your_catalog>.s3.access-key-id",  "<your-access-key-id>") \
    .config("spark.sql.catalog.<your_catalog>.s3.secret-access-key", "<your-secret-access-key>") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.access.key", "<your-access-key-id>") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.secret.key", "<your-secret-access-key>") \
    .getOrCreate()
sedona = SedonaContext.create(config)

Note

The AWS credentials need to be configured for both Havasu (spark.sql.catalog.<your_catalog>.*) and Hadoop FileSystem (spark.hadoop.fs.s3a.bucket.<bucket-name>.*).

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred. spark.sql.catalog.<your_catalog>.client.region can be omitted if the region of your S3 bucket is the same with the Wherobots cluster. If you are using S3 bucket in a different region, you must configure spark.sql.catalog.<your_catalog>.client.region to the region of your S3 bucket.

Although it is easy to setup cross account Havasu warehouse using access keys, it is not recommended to use static credentials in a production environment. It is recommended to use IAM roles instead.

Using IAM Roles

To use IAM roles, you need to have the following:

  1. A bucket in your AWS account which you want to use as Havasu warehouse. We'll refer to the bucket as <bucket-name> in the following sections.
  2. An IAM role in your AWS account which has access to the bucket you want to use as Havasu warehouse. We'll refer to the ARN of the IAM role as <your-iam-role-arn> in the following sections.
  3. A trust relationship allowing Wherobots user roles to assume your IAM role.

1. Create an IAM role in your AWS account

Create an IAM role in your AWS account, and attach the following policy to give permissions on your bucket. Adjust path to the sub-folder in your bucket you want to give access to, if needed. We'll refer to this role as <your-iam-role-arn>.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::<bucket-name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "path/*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<bucket-name>/path/*"
        }
    ]
}

Note

If you only intend to read data from this bucket, you can limit the permissions to s3:GetObject in the second policy statement.

2. Add a trust relationship to your IAM role

Add a trust relationship to the <your-iam-role-arn> IAM role you just created, allowing your Wherobots users' role to assume your role. The trust relationship should look like the following:

{
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "arn:aws:iam::329898491045:root"
    ]
  },
  "Action": "sts:AssumeRole",
  "Condition": {
    "StringLike": {
      "aws:PrincipalArn": "arn:aws:iam::329898491045:role/wherobotsusers/organization/*/wherobots-user_<wherobots-org-id>-*"
    }
  }
}

The <wherobots-org-id> can be obtained from your organization's settings page, or by running the following command in a Wherobots notebook terminal:

echo $WBC_ORG_ID

3. Add permissions to the IAM roles of Wherobots users

Warning

TODO: This step cannot be done by the end user for now. Please contact us to get this in place for your account.

4. Verify the configuration

You can verify the configuration by running the following command in your Wherobots notebook's terminal:

aws sts assume-role --role-arn <your-iam-role-arn> --role-session-name cross-account-s3-access

If the command succeeded and outputs STS credentials, then the configuration is correct.

5. Configure the SedonaContext

You can configure WherobotsDB to use your S3 bucket as warehouse path. You can configure the IAM role to access your S3 bucket in the SedonaContext:

config = SedonaContext \
    .builder() \
    .config("spark.sql.catalog.<your_catalog>.type", "hadoop") \
    .config("spark.sql.catalog.<your_catalog>", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.<your_catalog>.warehouse", "s3://<bucket-name>/path/to/your/warehouse") \
    .config("spark.sql.catalog.<your_catalog>.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.catalog.<your_catalog>.client.factory", "org.apache.iceberg.aws.AssumeRoleAwsClientFactory") \
    .config("spark.sql.catalog.<your_catalog>.client.assume-role.arn", "<your-iam-role-arn>") \
    .config("spark.sql.catalog.<your_catalog>.client.assume-role.region",  "<your-region>") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.assumed.role.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.bucket.<bucket-name>.assumed.role.arn", "<your-iam-role-arn>") \
    .getOrCreate()
sedona = SedonaContext.create(config)

Note

The IAM role ARN need to be configured for both Havasu (spark.sql.catalog.<your_catalog>.*) and Hadoop FileSystem (spark.hadoop.fs.s3a.bucket.<bucket-name>.*).

It is generally recommended to use S3 bucket in the same region with the Wherobots cluster, otherwise the performance may be affected and additional cost may be incurred. spark.sql.catalog.<your_catalog>.client.region can be omitted if the region of your S3 bucket is the same with the Wherobots cluster. If you are using S3 bucket in a different region, you must configure spark.sql.catalog.<your_catalog>.client.region to the region of your S3 bucket.


Last update: May 20, 2024 02:13:05