Geometry Support in Havasu¶
Havasu supports geometry data type and allows users to use geometry functions in WherobotsDB for manipulating geometry data. This document describes how to use geometry data type and geometry functions in Havasu.
Geometry as a Primitive Type¶
Besides the primitive types supported by Apache Iceberg, Havasu introduced a new data type GEOMETRY
to represent geospatial data. For instance, user can create a table with a geometry column using SQL:
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
id bigint,
data string,
geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
id bigint,
data string,
geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql(
"CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
"id bigint," +
"data string," +
"geom geometry" +
") PARTITIONED BY (bucket(4, id))"
);
This will create an empty table partitioned by the bucketed value of id
. Notice that Havasu introduced a new data type GEOMETRY
to represent geospatial data. We can inspect the table schema using DESCRIBE
command:
sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()
sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show()
sedona.sql("DESCRIBE TABLE wherobots.test_db.test_table").show();
+--------------+-------------+-------+
| col_name| data_type|comment|
+--------------+-------------+-------+
| id| bigint| null|
| data| string| null|
| geom| geometry| null|
| | | |
|# Partitioning| | |
| Part 0|bucket(4, id)| |
+--------------+-------------+-------+
Or using .printSchema()
function:
sedona.table("wherobots.test_db.test_table").printSchema()
sedona.table("wherobots.test_db.test_table").printSchema()
sedona.table("wherobots.test_db.test_table").printSchema();
root
|-- id: long (nullable = true)
|-- data: string (nullable = true)
|-- geom: geometry (nullable = true)
Creating Table with Geometry Column¶
As mentioned above, user can create a Havasu table using SQL:
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
id bigint,
data string,
geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql("""
CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (
id bigint,
data string,
geom geometry
) PARTITIONED BY (bucket(4, id))""")
sedona.sql(
"CREATE TABLE IF NOT EXISTS wherobots.test_db.test_table (" +
"id bigint," +
"data string," +
"geom geometry" +
") PARTITIONED BY (bucket(4, id))"
);
If user has a DataFrame containing geometry data, they can also create a Havasu table using the DataFrame:
# Create a DataFrame with geometry data
df = sedona.createDataFrame([
(1, "a", "POINT (1 2)"),
(2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))
# Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()
// Create a DataFrame with geometry data
val df = Seq(
(1, "a", "POINT (1 2)"),
(2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))
// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create()
// Create a DataFrame with geometry data
List<Row> data = Arrays.asList(
RowFactory.create(1, "a", "POINT (1 2)"),
RowFactory.create(2, "b", "POINT (2 3)")
);
List<StructField> fields = Arrays.asList(
DataTypes.createStructField("id", DataTypes.IntegerType, true),
DataTypes.createStructField("data", DataTypes.StringType, true),
DataTypes.createStructField("geom", DataTypes.StringType, true)
);
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sedona.createDataFrame(data, schema)
.withColumn("geom", expr("ST_GeomFromText(geom)"));
// Create a Havasu table using the DataFrame
df.writeTo("wherobots.test_db.test_table").create();// Create a DataFrame with geometry data
Writing Data¶
INSERT INTO¶
User can insert data into a Havasu table using INSERT INTO table_name VALUES
:
sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")
sedona.sql("""
INSERT INTO wherobots.test_db.test_table
VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))
""")
sedona.sql(
"INSERT INTO wherobots.test_db.test_table " +
"VALUES (1, 'a', ST_GeomFromText('POINT (1 2)')), (2, 'b', ST_Point(2, 3))"
);
Or using INSERT INTO table_name SELECT ...
to insert result set of a query into the table:
sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")
sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)")
sedona.sql("INSERT INTO wherobots.test_db.test_table SELECT 3, 'c', ST_Point(3, 4)");
Writing DataFrame to Havasu table¶
User can write a DataFrame containing geometry data to a Havasu table:
# Create a DataFrame with geometry data
df = sedona.createDataFrame([
(1, "a", "POINT (1 2)"),
(2, "b", "POINT (2 3)")
], ["id", "data", "geom"])\
.withColumn("geom", expr("ST_GeomFromText(geom)"))
# Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()
// Create a DataFrame with geometry data
val df = Seq(
(1, "a", "POINT (1 2)"),
(2, "b", "POINT (2 3)")
).toDF("id", "data", "geom")
.withColumn("geom", expr("ST_GeomFromText(geom)"))
// Write the DataFrame to a Havasu table
df.writeTo("wherobots.test_db.test_table").append()
List<Row> data = Arrays.asList(
RowFactory.create(1, "a", "POINT (1 2)"),
RowFactory.create(2, "b", "POINT (2 3)")
);
List<StructField> fields = Arrays.asList(
DataTypes.createStructField("id", DataTypes.IntegerType, true),
DataTypes.createStructField("data", DataTypes.StringType, true),
DataTypes.createStructField("geom", DataTypes.StringType, true)
);
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = sedona.createDataFrame(data, schema)
.withColumn("geom", expr("ST_GeomFromText(geom)"));
df.writeTo("wherobots.test_db.test_table").append();
This is semantically equivalent to the INSERT INTO table_name SELECT ...
statement.
Updating data in Havasu table¶
Havasu supports UPDATE queries that update matching rows in tables. Update queries accept a filter to match rows to update. Spatial filters are also supported in Havasu.
sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")
# Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1")
// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql("""
UPDATE wherobots.test_db.test_table
SET data = 'd'
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("UPDATE wherobots.test_db.test_table SET data = 'd' WHERE id = 1");
// Use ST_Intersects to update rows that intersect with a polygon
sedona.sql(
"UPDATE wherobots.test_db.test_table " +
"SET data = 'd' " +
"WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);
Deleting data from Havasu table¶
Havasu supports DELETE FROM queries to remove data from tables. Delete queries accept a filter to match rows to delete. Spatial filters are also supported in Havasu.
sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")
# Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1")
// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql("""
DELETE FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
sedona.sql("DELETE FROM wherobots.test_db.test_table WHERE id = 1");
// Use ST_Intersects to delete rows that intersect with a polygon
sedona.sql(
"DELETE FROM wherobots.test_db.test_table " +
"WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);
Merging DataFrame into Havasu table using MERGE INTO
¶
Havasu supports MERGE INTO by rewriting data files that contain rows that need to be updated in an overwrite commit. The syntax is identical to the open source Apache Iceberg, please refer to Apache Iceberg - MERGE INTO for more information.
Querying Data¶
User can load data from a Havasu table using sedona.table(...)
:
df = sedona.table("wherobots.test_db.test_table")
df = sedona.table("wherobots.test_db.test_table")
Dataset<Row> df = sedona.table("wherobots.test_db.test_table");
You can apply some configurations for reading the table, such as the split size if you want to read the table into a DataFrame with more partitions:
df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")
df = sedona.read.option('split-size', '1000').table("wherobots.test_db.test_table")
Dataset<Row> df = sedona.read().option("split-size", "1000").table("wherobots.test_db.test_table");
You can run spatial range query on the table using Spatial SQL:
df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
df = sedona.sql("""
SELECT * FROM wherobots.test_db.test_table
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))
""")
Dataset<Row> df = sedona.sql(
"SELECT * FROM wherobots.test_db.test_table " +
"WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))"
);
Or using the DataFrame API:
df = sedona.table("wherobots.test_db.test_table")\
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
df = sedona.table("wherobots.test_db.test_table")\
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
Dataset<Row> df = sedona.table("wherobots.test_db.test_table")
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");
Users can also load a Havasu table by specifying the name of the data source explicitly using .format("havasu.iceberg")
, this will load an isolated table reference that will not automatically refresh tables used by queries.
df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
val df = sedona.read.format("havasu.iceberg").load("wherobots.test_db.test_table")\
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))")
Dataset<Row> df = sedona.read().format("havasu.iceberg").load("wherobots.test_db.test_table")
.where("ST_Intersects(geom, ST_GeomFromText('POLYGON ((0 0, 0 2, 2 2, 2 0, 0 0))'))");
Spatial range queries are very efficient in Havasu. Havasu supports spatial filter based data skipping. This feature allows user to skip reading data files that don’t contain data that satisfy the spatial filter. Please refer to Cluster by geospatial fields for faster queries for more information.
Working with Geometry Data¶
Data in columns with geometry
type will be loaded as GeometryUDT
values in Sedona, user can use any ST_ functions provided by WherobotsDB to manipulate the geospatial data. For example, user can use ST_Buffer
to create a buffer around the geometry column:
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table").show();
+----------------------+
|st_buffer(geom, 0.001)|
+----------------------+
| POLYGON ((3.001 4...|
| POLYGON ((1.001 2...|
| POLYGON ((2.001 3...|
| POLYGON ((1.001 2...|
| POLYGON ((2.001 3...|
+----------------------+
The resulting DataFrame can also be written back to a Havasu table using a CTAS statement:
sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")
sedona.sql("""
CREATE TABLE wherobots.test_db.test_table_buf AS
SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table
""")
sedona.sql(
"CREATE TABLE wherobots.test_db.test_table_buf AS " +
"SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table"
);
Or simply call writeTo
function on the resulting DataFrame of the query:
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")\
.writeTo("wherobots.test_db.test_table_buf").create()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
.writeTo("wherobots.test_db.test_table_buf").create()
sedona.sql("SELECT ST_Buffer(geom, 0.001) FROM wherobots.test_db.test_table")
.writeTo("wherobots.test_db.test_table_buf").create();
Further Reading¶
Havasu is based on Apache Iceberg, all features of Apache Iceberg except MOR tables are supported in Havasu. Please refer to Apache Iceberg documentation for Spark for more information.
If you have spatial data stored in Iceberg table or parquet files as WKT or WKB, you can migrate your data to Havasu very efficiently without scanning or rewriting your data files. Please refer to Convert Existing Table to Havasu Table and Migrating Parquet Files to Havasu for more information.
Havasu supports spatial filter push down and can optimize spatial range queries. Please refer to Cluster by geospatial fields for faster queries for how to organize your spatial data for better performance.