Skip to content

Migrating Parquet Files to Havasu

If you have some parquet files containing geometry data serialized in well known formats such as EWKB, WKB, WKT or GeoJSON, you can load these parquet files into a Havasu table without scanning these data files.

We’ll use Google open buildings parquet files retrieved from here as an example. The file we gonna migrate looks like the this. You can see that there is a geometry column encoded in EWKB format.

df = sedona.read.format("parquet").load("/path/to/Aswan_Governorate.parquet")
df.show(5)
+-------+-----------------+--------------+----------+--------------+--------------------+
|country|          admin_1|area_in_meters|confidence|full_plus_code|            geometry|
+-------+-----------------+--------------+----------+--------------+--------------------+
|    EGY|Aswan Governorate|       71.1657|    0.6522| 7GPJXWQ6+7HF2|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       44.5844|    0.6278| 7GPJXWV4+MG99|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|      158.4749|    0.8307| 7GPJXWV6+G756|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       22.3858|    0.7216| 7GPM92XC+MP77|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       46.3999|    0.7935| 7GPMC228+7FPX|[01 03 00 00 00 0...|
+-------+-----------------+--------------+----------+--------------+--------------------+
only showing top 5 rows

Steps

Step 1: Create a external table for your data

We define geometry column as BINARY since it is a binary column in the parquet file. We’ll change it in-place to a geometry column after migrating it to Havasu.

CREATE TABLE spark_catalog.default.tmp_table (
    country STRING,
    admin_1 STRING,
    area_in_meters DOUBLE,
    confidence DOUBLE,
    full_plus_code STRING,
    geometry BINARY
) USING parquet LOCATION '/path/to/Aswan_Governorate.parquet'

Note

The LOCATION clause specifies the path to the parquet files. Specify the path to a single file or a directory containing multiple files (globs are not permitted).

Step 2: Migrate the table

CALL wherobots.system.snapshot('spark_catalog.default.tmp_table', 'wherobots.test_db.open_building_aswan')

After migration, we can see that there is a Havasu table named wherobots.test_db.open_building_aswan. The Havasu table was created using the original parquet files as data files.

sedona.table("wherobots.test_db.open_building_aswan").show(5)
sedona.table("wherobots.test_db.open_building_aswan").show(5)
sedona.table("wherobots.test_db.open_building_aswan").show(5);
+-------+-----------------+--------------+----------+--------------+--------------------+
|country|          admin_1|area_in_meters|confidence|full_plus_code|            geometry|
+-------+-----------------+--------------+----------+--------------+--------------------+
|    EGY|Aswan Governorate|       71.1657|    0.6522| 7GPJXWQ6+7HF2|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       44.5844|    0.6278| 7GPJXWV4+MG99|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|      158.4749|    0.8307| 7GPJXWV6+G756|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       22.3858|    0.7216| 7GPM92XC+MP77|[01 03 00 00 00 0...|
|    EGY|Aswan Governorate|       46.3999|    0.7935| 7GPMC228+7FPX|[01 03 00 00 00 0...|
+-------+-----------------+--------------+----------+--------------+--------------------+
only showing top 5 rows

However, geometry field is still a binary field. We’ll convert it to a geometry field in the next step.

Step 3: Set geometry fields

We can set the field storing geometry as a geometry field. This won’t rewrite the data files and will only affect how Havasu interpret configured fields.

ALTER TABLE wherobots.test_db.open_building_aswan SET GEOMETRY FIELDS geometry AS 'ewkb'

Now the geometry field is geometry, and we can apply spatial functions provided by WherobotsDB to this column.

sedona.table("wherobots.test_db.open_building_aswan").withColumn("centroid", expr("ST_Centroid(geometry)")).show(5)
sedona.table("wherobots.test_db.open_building_aswan")
    .withColumn("centroid", expr("ST_Centroid(geometry)"))
    .show(5)
sedona.table("wherobots.test_db.open_building_aswan")
    .withColumn("centroid", expr("ST_Centroid(geometry)"))
    .show(5);
+-------+-----------------+--------------+----------+--------------+--------------------+--------------------+
|country|          admin_1|area_in_meters|confidence|full_plus_code|            geometry|            centroid|
+-------+-----------------+--------------+----------+--------------+--------------------+--------------------+
|    EGY|Aswan Governorate|       71.1657|    0.6522| 7GPJXWQ6+7HF2|POLYGON ((32.9114...|POINT (32.9114078...|
|    EGY|Aswan Governorate|       44.5844|    0.6278| 7GPJXWV4+MG99|POLYGON ((32.9064...|POINT (32.9063703...|
|    EGY|Aswan Governorate|      158.4749|    0.8307| 7GPJXWV6+G756|POLYGON ((32.9108...|POINT (32.9107265...|
|    EGY|Aswan Governorate|       22.3858|    0.7216| 7GPM92XC+MP77|POLYGON ((33.0217...|POINT (33.0217906...|
|    EGY|Aswan Governorate|       46.3999|    0.7935| 7GPMC228+7FPX|POLYGON ((33.0162...|POINT (33.0162143...|
+-------+-----------------+--------------+----------+--------------+--------------------+--------------------+
only showing top 5 rows

Step 4: (OPTIONAL) Create spatial index

We can run CREATE SPATIAL INDEX FOR statement to reorder the rows by spatial proximity to achieve better performance. We can optionally set a target file size for better query performance. Usually hundreds of megabytes per file is a good choice.

CREATE SPATIAL INDEX FOR wherobots.test_db.open_building_aswan
USING hilbert(geometry, 16)
OPTIONS map('target-file-size-bytes', '10485760')

The rewritten files will be stored in the data path of the iceberg table, and it is usually not mixed up with your original data.

SELECT file_path, record_count, file_size_in_bytes FROM wherobots.test_db.open_building_aswan.files
+---------------------------------------------------------------------------------------------+------------+------------------+
|file_path                                                                                    |record_count|file_size_in_bytes|
+---------------------------------------------------------------------------------------------+------------+------------------+
|/path/to/open_building_aswan/data/00003-82-8f72fd15-f0d4-4cc7-a57a-826f8c0b5dad-00001.parquet|142720      |9977792           |
|/path/to/open_building_aswan/data/00001-80-8f72fd15-f0d4-4cc7-a57a-826f8c0b5dad-00001.parquet|150821      |10205356          |
|/path/to/open_building_aswan/data/00002-81-8f72fd15-f0d4-4cc7-a57a-826f8c0b5dad-00001.parquet|159186      |10755403          |
|/path/to/open_building_aswan/data/00000-79-8f72fd15-f0d4-4cc7-a57a-826f8c0b5dad-00001.parquet|158553      |10793641          |
+---------------------------------------------------------------------------------------------+------------+------------------+

Step 5: (OPTIONAL) Allow expiring the original files

If everything is OK, we may want Havasu to take control over all the data files, including the original data file. If Havasu don’t need to use the original data file (for example, the entire table was rewritten), then we can allow Havasu to expire the original data file. We need to set some table properties to explicitly enable this.

ALTER TABLE wherobots.test_db.open_building_aswan SET TBLPROPERTIES ('snapshot'='false', 'gc.enabled'='true')

If we don’t do this step, the Havasu table we migrated will be a snapshot table, and won’t allow us to expire snapshots, since it may accidentally delete your original parquet files.


Last update: May 20, 2024 20:50:47