Load data from external storage

Wherobots loads data from files, databases, and in-memory Python objects like Pandas DataFrames.

Data naming conventions¶

To ensure data loads correctly, please adhere to the following rules.

File naming: Files and directories must not start with an underscore (_) or a period (.). These files will be ignored and result in an AnalysisException: [UNABLE_TO_INFER_SCHEMA] error.
- Incorrect: _my_data.parquet
- Correct: my_data.parquet
S3 storage path: When loading from S3, the bucket name cannot contain periods (.).
- Incorrect: s3://wherobots.sf.data/
- Correct: s3://wherobots-sf-data/
Formatting S3 filename paths and for Wherobots

If you're loading a file from S3, ensure that it complies with the following:
- Incorrect Path: s3://wherobots.sf.data/2025/_august_parcels.parquet This path will fail because the bucket name (wherobots.sf.data) contains periods and the filename (_august_parcels.parquet) starts with an underscore.
- Correct Path: s3://wherobots-sf-data/2025/august_parcels.parquet This path is valid because the bucket name uses dashes instead of periods and the leading underscore is removed from the filename.

Load data from text files¶

Assume we have a WKT file, namely usa-county.tsv, at Path /Download/usa-county.tsv as follows:

POLYGON (..., ...)  Cuming County
POLYGON (..., ...)  Wahkiakum County
POLYGON (..., ...)  De Baca County
POLYGON (..., ...)  Lancaster County

The file may have many other columns.

Load a raw DataFrame¶

Use the following code to load the data and create a raw DataFrame:

ScalaJavaPython

var rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

Dataset<Row> rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

The output will be like this:

|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|

Create a Geometry type column¶

All geometrical operations in Spatial SQL are on Geometry type objects. Therefore, before any kind of queries, you need to create a Geometry type column on a DataFrame.

SELECT ST_GeomFromWKT(_c0) AS countyshape, _c1, _c2

You can select many other attributes to compose this spatialdDf. The output will be something like this:

|                 countyshape|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|

Although it looks same with the input, but actually the type of column countyshape has been changed to Geometry type.

To verify this, use the following code to print the schema of the DataFrame:

spatialDf.printSchema()

The output will be like this:

root
 |-- countyshape: geometry (nullable = false)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)

Note

Spatial SQL provides lots of functions to create a Geometry column, please read Spatial SQL constructor API.

Load GeoJSON Data¶

WherobotsDB supports reading GeoJSON files using the geojson data source. It is designed to handle JSON files that use GeoJSON format for their geometries.

This includes SpatioTemporal Asset Catalog (STAC) files, GeoJSON features, GeoJSON feature collections and other variations. The key functionality lies in the way 'geometry' fields are processed: these are specifically read as Sedona's GeometryUDT type, ensuring integration with Sedona's suite of spatial functions.

Load MultiLine GeoJSON FeatureCollection¶

Suppose we have a GeoJSON FeatureCollection file as follows. This entire file is considered as a single GeoJSON FeatureCollection object. Multiline format is preferable for scenarios where files need to be human-readable or manually edited.

{ "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value1",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value2",
           "prop1": {"this": "that"}
           }
         }
       ]
}

Set the multiLine option to True to read multiline GeoJSON files.

PythonScalaJava

df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
 .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
 .select("features.*") # Unpack the features struct.
 .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

val df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
val parsedDf = df.selectExpr("explode(features) as features").select("features.*")
        .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

parsedDf.show()
parsedDf.printSchema()

Dataset<Row> df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
 .selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
 .select("features.*") // Unpack the features struct.
 .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show();
df.printSchema();

The output is as follows:

+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)

Load Single Line GeoJSON Features¶

Suppose we have a single-line GeoJSON Features dataset as follows. Each line is a single GeoJSON Feature. This format is efficient for processing large datasets where each line is a separate, self-contained GeoJSON object.

{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}

By default, when option is not specified, WherobotsDB reads a GeoJSON file as a single line GeoJSON.

PythonScalaJava

df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
   .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

val df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
   .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

Dataset<Row> df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
   .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

The output is as follows:

+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)

Load Shapefile¶

WherobotsDB supports loading Shapefile as a DataFrame.

Scala/JavaJavaPython

val df = sedona.read.format("shapefile").load("/path/to/shapefile")

Dataset<Row> df = sedona.read().format("shapefile").load("/path/to/shapefile")

df = sedona.read.format("shapefile").load("/path/to/shapefile")

The input path can be a directory containing one or multiple shapefiles, or path to a .shp file.

When the input path is a directory, all shapefiles directly under the directory will be loaded. If you want to load all shapefiles in subdirectories, please specify .option("recursiveFileLookup", "true").
When the input path is a .shp file, that shapefile will be loaded. Sedona will look for sibling files (.dbf, .shx, etc.) with the same main file name and load them automatically.

The name of the geometry column is geometry by default. You can change the name of the geometry column using the geometry.name option. If one of the non-spatial attributes is named "geometry", geometry.name must be configured to avoid conflict.

Scala/JavaJavaPython

val df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")

Dataset<Row> df = sedona.read().format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")

df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")

Each record in shapefile has a unique record number, that record number is not loaded by default. If you want to include record number in the loaded DataFrame, you can set the key.name option to the name of the record number column:

Scala/JavaJavaPython

val df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")

Dataset<Row> df = sedona.read().format("shapefile").option("key.name", "FID").load("/path/to/shapefile")

df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")

The character encoding of string attributes are inferred from the .cpg file. If you see garbled values in string fields, you can manually specify the correct charset using the charset option. For example:

Scala/JavaJavaPython

val df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")

Dataset<Row> df = sedona.read().format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")

df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")

Load GeoParquet¶

WherobotsDB natively supports loading GeoParquet file. Wherobots will infer geometry fields using the "geo" metadata in GeoParquet files.

ScalaJavaPython

val df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()

Dataset<Row> df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()

df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
df.printSchema()

The output will be as follows:

root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)

WherobotsDB supports spatial predicate push-down for GeoParquet files, please refer to the Spatial SQL query optimizer documentation for details.

GeoParquet file reader can also be used to read legacy Parquet files written by Apache Sedona 1.3.1-incubating or earlier. Please refer to Reading Legacy Parquet Files for details.

Inspect GeoParquet metadata¶

WherobotsDB provides a Spark SQL data source "geoparquet.metadata" for inspecting GeoParquet metadata. The resulting dataframe contains the "geo" metadata for each input file.

Scala/JavaJavaPython

val df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
df.printSchema()

Dataset<Row> df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
df.printSchema()

df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
df.printSchema()

The output will be as follows:

root
 |-- path: string (nullable = true)
 |-- version: string (nullable = true)
 |-- primary_column: string (nullable = true)
 |-- columns: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- encoding: string (nullable = true)
 |    |    |-- geometry_types: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- bbox: array (nullable = true)
 |    |    |    |-- element: double (containsNull = true)
 |    |    |-- crs: string (nullable = true)

If the input Parquet file does not have GeoParquet metadata, the values of version, primary_column and columns fields of the resulting dataframe will be null.

Note

geoparquet.metadata only supports reading GeoParquet specific metadata. Users can use G-Research/spark-extension to read comprehensive metadata of generic Parquet files.

Load from GeoPackage¶

WherobotsDB supports loading Geopackage file format as a DataFrame.

Scala/JavaJavaPython

val df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage")

Dataset<Row> df = sedona.read().format("geopackage").option("tableName", "tab").load("/path/to/geopackage")

df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage")

Geopackage files can contain vector data and raster data. To show the possible options from a file you can look into the metadata table by adding parameter showMetadata and set its value as true.

Scala/JavaJavaPython

val df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")

Dataset<Row> df = sedona.read().format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")

```python df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")

Then you can see the metadata of the geopackage file like below.

+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+
|          table_name|data_type|          identifier|description|         last_change|     min_x|            min_y|     max_x|     max_y|srs_id|
+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+
|gis_osm_water_a_f...| features|gis_osm_water_a_f...|           |2024-09-30 23:07:...|-9.0257084|57.96814069999999|33.4866675|80.4291867|  4326|
+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+

You can also load data from raster tables in the geopackage file. To load raster data, you can use the following code.

Scala/JavaJavaPython

val df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")

Dataset<Row> df = sedona.read().format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")

df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")

+---+----------+-----------+--------+--------------------+
| id|zoom_level|tile_column|tile_row|           tile_data|
+---+----------+-----------+--------+--------------------+
|  1|        11|        428|     778|GridCoverage2D["c...|
|  2|        11|        429|     778|GridCoverage2D["c...|
|  3|        11|        428|     779|GridCoverage2D["c...|
|  4|        11|        429|     779|GridCoverage2D["c...|
|  5|        11|        427|     777|GridCoverage2D["c...|
+---+----------+-----------+--------+--------------------+

Known limitations:

webp rasters are not supported
ewkb geometries are not supported
filtering based on geometries envelopes are not supported

Load from OSM PBF¶

WherobotsDB supports loading OSM PBF file format as a DataFrame.

Scala/JavaJavaPython

val df = sedona.read.format("osmpbf").load("/path/to/osmpbf")

Dataset<Row> df = sedona.read().format("osmpbf").load("/path/to/osmpbf")

df = sedona.read.format("osmpbf").load("/path/to/osmpbf")

OSM PBF files can contain nodes, ways, and relations. Currently WherobotsDB support Nodes, DenseNodes, Ways and Relations. When you load the data you get a DataFrame with the following schema.

root
 |- id: long (nullable = true)
 |-- kind: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- refs: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- ref_roles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ref_types: array (nullable = true)
 |    |-- element: string (containsNull = true)

Where:

*id is the unique identifier of the object. *kind is the type of the object, it can be node, way or relation. *location is the location of the object, it contains the longitude and latitude of the object. *tags is a map of key-value pairs that represent the tags of the object. *refs is an array of the references of the object. *ref_roles is an array of the roles of the references. *ref_types is an array of the types of the references.

The dataframe for ways might look like this for nodes:

+---------+----+--------------------+--------------------+----+---------+---------+
|       id|kind|            location|                tags|refs|ref_roles|ref_types|
+---------+----+--------------------+--------------------+----+---------+---------+
|248675410|node|{21.0884952545166...|{tactile_paving -...|NULL|     NULL|     NULL|
|260821820|node|{21.0191555023193...|{created_by -> JOSM}|NULL|     NULL|     NULL|
|349189665|node|{22.1437530517578...|{source -> http:/...|NULL|     NULL|     NULL|
|353366899|node|{22.9787712097167...|{source -> http:/...|NULL|     NULL|     NULL|
|359460224|node|{22.4816703796386...|{source -> http:/...|NULL|     NULL|     NULL|
+---------+----+--------------------+--------------------+----+---------+---------+
only showing top 5 rows

and for way

+-------+----+--------+--------------------+--------------------+---------+---------+
|     id|kind|location|                tags|                refs|ref_roles|ref_types|
+-------+----+--------+--------------------+--------------------+---------+---------+
|4307329| way|    NULL|{junction -> roun...|[2448759046, 7093...|     NULL|     NULL|
|4307330| way|    NULL|{surface -> aspha...|[26063923, 260639...|     NULL|     NULL|
|4308966| way|    NULL|{sidewalk -> sepa...|[3387797238, 9252...|     NULL|     NULL|
|4308968| way|    NULL|{surface -> pavin...|[26083890, 744724...|     NULL|     NULL|
|4308969| way|    NULL|{cycleway:both ->...|[9526831176, 1218...|     NULL|     NULL|
+-------+----+--------+--------------------+--------------------+---------+---------+

and for relation

+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|   id|    kind|location|                tags|                refs|           ref_roles|           ref_types|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|28124|relation|    NULL|{official_name ->...|[26382394, 26259985]|      [inner, outer]|          [WAY, WAY]|
|28488|relation|    NULL|  {type -> junction}|[26409253, 303249...|[roundabout, roun...|[WAY, WAY, WAY, WAY]|
|32939|relation|    NULL|{ref -> E 67, rou...|[140673970, 14067...|        [, , , , , ]|[WAY, WAY, RELATI...|
|34387|relation|    NULL|{note -> rząd III...|[209161000, 52154...|[main_stream, mai...|[WAY, WAY, WAY, W...|
|34392|relation|    NULL|{distance -> 1047...|[150033976, 25076...|[main_stream, mai...|[WAY, WAY, WAY, W...|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+

Load data from JDBC data sources (PostGIS)¶

The 'query' option in Spark SQL's JDBC data source can be used to convert geometry columns to a format that WherobotsDB can interpret. This should work for most spatial JDBC data sources. For Postgis there is no need to add a query to convert geometry types since it's already using EWKB as it's wire format.

ScalaJavaPython

// For any JDBC data source, including Postgis.
val df = sedona.read.format("jdbc")
    // Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

// This is a simplified version that works for Postgis.
val df = sedona.read.format("jdbc")
    // Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

// For any JDBC data source, including Postgis.
Dataset<Row> df = sedona.read().format("jdbc")
    // Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

// This is a simplified version that works for Postgis.
Dataset<Row> df = sedona.read().format("jdbc")
    // Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", expr("ST_GeomFromWKB(geom)"))

# For any JDBC data source, including Postgis.
df = (sedona.read.format("jdbc")
    # Other options.
    .option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))

# This is a simplified version that works for Postgis.
df = (sedona.read.format("jdbc")
    # Other options.
    .option("dbtable", "my_table")
    .load()
    .withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))

Load data from GeoPandas file reader¶

WherobotsDB Python has implemented serializers and deserializers which allows to convert WherobotsDB Geometry objects into Shapely BaseGeometry objects. Based on that it is possible to load the data with geopandas from file and create Sedona DataFrame based on GeoDataFrame object.

import geopandas as gpd

gdf = gpd.read_file("<some path>.shp")
df = sedona.createDataFrame(gdf)

To leverage Arrow optimization and speed up the conversion, we recommend the create_spatial_dataframe method in Sedona.

from sedona.spark import *

gdf = gpd.read_file("<some path>.shp")
create_spatial_dataframe(sedona, gdf)

Load data from Python Shapely¶

The following Shapely geometry types are supported:

Point
MultiPoint
LineString
MultiLinestring
Polygon
MultiPolygon

Create a schema¶

To create Spark DataFrame from Shapely objects, you need to first create a schema. Please use GeometryType from sedona.sql.types module.

from pyspark.sql.types import IntegerType, StructField, StructType
from sedona.spark import *

schema = StructType(
    [
        StructField("id", IntegerType(), False),
        StructField("geom", GeometryType(), False)
    ]
)

Create Shapely geometries¶

Now you can create Shapely geometries and convert them to a Sedona DataFrame.

PointMultiPointLineStringMultiLineStringPolygonMultiPolygon

from shapely.geometry import Point

data = [
    [1, Point(21.0, 52.0)],
    [1, Point(23.0, 42.0)],
    [1, Point(26.0, 32.0)]
]

gdf = sedona.createDataFrame(
    data,
    schema
)

gdf.show()

+---+-------------+
| id|         geom|
+---+-------------+
|  1|POINT (21 52)|
|  1|POINT (23 42)|
|  1|POINT (26 32)|
+---+-------------+

from shapely.geometry import MultiPoint

data = [
    [1, MultiPoint([[19.511463, 51.765158], [19.446408, 51.779752]])]
]

gdf = sedona.createDataFrame(
    data,
    schema
).show(1, False)

+---+---------------------------------------------------------+
|id |geom                                                     |
+---+---------------------------------------------------------+
|1  |MULTIPOINT ((19.511463 51.765158), (19.446408 51.779752))|
+---+---------------------------------------------------------+

from shapely.geometry import LineString

line = [(40, 40), (30, 30), (40, 20), (30, 10)]

data = [
    [1, LineString(line)]
]

gdf = sedona.createDataFrame(
    data,
    schema
)

gdf.show(1, False)

+---+--------------------------------+
|id |geom                            |
+---+--------------------------------+
|1  |LINESTRING (10 10, 20 20, 10 40)|
+---+--------------------------------+

from shapely.geometry import MultiLineString

line1 = [(10, 10), (20, 20), (10, 40)]
line2 = [(40, 40), (30, 30), (40, 20), (30, 10)]

data = [
    [1, MultiLineString([line1, line2])]
]

gdf = sedona.createDataFrame(
    data,
    schema
)

gdf.show(1, False)

+---+---------------------------------------------------------------------+
|id |geom                                                                 |
+---+---------------------------------------------------------------------+
|1  |MULTILINESTRING ((10 10, 20 20, 10 40), (40 40, 30 30, 40 20, 30 10))|
+---+---------------------------------------------------------------------+

from shapely.geometry import Polygon

polygon = Polygon(
    [
         [19.51121, 51.76426],
         [19.51056, 51.76583],
         [19.51216, 51.76599],
         [19.51280, 51.76448],
         [19.51121, 51.76426]
    ]
)

data = [
    [1, polygon]
]

gdf = sedona.createDataFrame(
    data,
    schema
)

gdf.show(1, False)

+---+--------------------------------------------------------------------------------------------------------+
|id |geom                                                                                                    |
+---+--------------------------------------------------------------------------------------------------------+
|1  |POLYGON ((19.51121 51.76426, 19.51056 51.76583, 19.51216 51.76599, 19.5128 51.76448, 19.51121 51.76426))|
+---+--------------------------------------------------------------------------------------------------------+

from shapely.geometry import MultiPolygon

exterior_p1 = [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)]
interior_p1 = [(1, 1), (1, 1.5), (1.5, 1.5), (1.5, 1), (1, 1)]

exterior_p2 = [(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)]

polygons = [
    Polygon(exterior_p1, [interior_p1]),
    Polygon(exterior_p2)
]

data = [
    [1, MultiPolygon(polygons)]
]

gdf = sedona.createDataFrame(
    data,
    schema
)

gdf.show(1, False)

+---+----------------------------------------------------------------------------------------------------------+
|id |geom                                                                                                      |
+---+----------------------------------------------------------------------------------------------------------+
|1  |MULTIPOLYGON (((0 0, 0 2, 2 2, 2 0, 0 0), (1 1, 1.5 1, 1.5 1.5, 1 1.5, 1 1)), ((0 0, 0 1, 1 1, 1 0, 0 0)))|
+---+----------------------------------------------------------------------------------------------------------+

Confirm the DataFrame structure¶

gdf.printSchema()

root
 |-- id: integer (nullable = false)
 |-- geom: geometry (nullable = false)

Load data from Snowflake¶

Tables created in Snowflake can be loaded on to Sedona Dataframes using Sedona. Once the data is loaded, it can be processed using the extensive catalog and efficient computation of WherobotsDB.

In order to enable bi-directional communication between Spark and Snowflake, a map of configuration parameters must be passed as options to the SedonaContext object.

The configuration parameters include connection and context options. Details on the possible values of these options can be found here.

Load entire table from Snowflake¶

The dbtable option can be used to load the contents of the entire table from Snowflake to a Sedona Dataframe, effectively running a SELECT * FROM table query.

PythonScalaJava

# snowflake_url is https://<accountIdentifier>.snowflakecomputing.com
sfOptions = {"sfUrl": snowflake_url, "sfUser": username, "sfPassword" : password, "sfDatabase": database, "sfSchema": schema}

# source table name in Snowflake
src_table_name = "<SOURCE_TABLE_NAME>"

df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)\
    .options(**sfOptions)\
    .option("dbtable", src_table_name)\
    .load()
df.show()
df.printSchema()

// snowflakeUrl is https://<accountIdentifier>.snowflakecomputing.com
val sfOptions = Map("sfUrl" -> snowflakeUrl, "sfUser" -> username, "sfPassword" -> password, "sfDatabase" -> database, "sfSchema" -> schema)

// source table name in Snowflake
val src_table_name = "<SOURCE_TABLE_NAME>"

val df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
                    .options(sfOptions)
                    .option("dbtable", src_table_name)
                    .load()
df.show()
df.printSchema()

import java.util.HashMap;
HashMap<String, String> sfOptions = new HashMap<>();
sfOptions.put("sfUrl", snowflakeUrl); // snowflakeUrl is https://<accountIdentifier>.snowflakecomputing.com
sfOptions.put("sfUser", username);
sfOptions.put("sfPassword", password);
sfOptions.put("sfDatabase", database);
sfOptions.put("sfSchema", schema);

String src_table_name = "<SOURCE_TABLE_NAME>";
Dataset<Row> df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
                    .options(sfOptions)
                    .option("dbtable", src_table_name)
                    .load();
df.show();
df.printSchema();

Load query results from Snowflake¶

The query option can be used to load the results of a query. Snowflake employs query and predicate pushdown which is enabled by default.

If you wish to disable pushdown, the autopushdown option can be used with value as off.

PythonScalaJava

query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" #custom query to run
df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)\
            .options(**sfOptions)\
            .option("query", query)\
            .load()
df.show()
df.printSchema()

val query = s"SELECT GEOM, CITY_NAME FROM $src_table_name WHERE CITY_NAME = 'Seattle'"
val df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
            .options(sfOptions)
            .option("query", query)
            .load()
df.show()
df.printSchema()

import java.util.HashMap;
String query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'"
Dataset<Row> df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
            .options(sfOptions)
            .option("query", query)
            .load();
df.show();
df.printSchema();

Load data from AWS RDS PostGIS¶

Tables created in a AWS RDS PostGIS (PostgreSQL) instance can be imported on to Sedona Dataframes using Sedona. Once the data is loaded, it can be processed using the extensive catalog and efficient computation of WherobotsDB.

Data can be loaded on to a Sedona Dataframe using the load() function. A map of configuration and context options must be passed to establish connection with the RDS instance.

If you're unable to establish connection with the RDS instance, double check if the instance is accessible by the server running this code. For more information on intra or inter VPC connection with the RDS instance, consult here.

Load entire table from RDS¶

The dbtable option can be used to load the contents of the entire table from RDS to a Sedona Dataframe, effectively running a SELECT * FROM table query.

PythonScalaJava

url = '<URL>' #jdbc:postgresql://ENDPOINT/DATABASE_NAME
driver = 'org.postgresql.Driver'
user = '<USERNAME>'
password = '<PASSWORD>'
options = {"url": url, "driver": driver, "user": user, "password": password}
src_table_name = 'city_tbl_geom'
df = sedona.read.format("jdbc")
    .options(**options)
    .option('dbtable', src_table_name)
    .load()

val url = "<URL>"//jdbc:postgresql://ENDPOINT/DATABASE_NAME
val driver = "org.postgresql.Driver"
val user = "<USERNAME>"
val password = "<PASSWORD>"
val options = Map("url" -> url, "user" -> user, "password" -> password, "driver" -> driver)
val src_table_name = "city_tbl_geom"
val df = sedona.read.format("jdbc")
                .options(options)
                .option("dbtable", src_table_name)
                .load()

import java.util.HashMap;
HashMap<String, String> options = new HashMap<>();
options.put("url", url); // url is jdbc:postgresql://ENDPOINT/DATABASE_NAME
options.put("user", username);
options.put("password", password);
options.put("driver", "org.postgresql.Driver");
String src_table_name = "<SOURCE_TABLE_NAME>";
Dataset<Row> df = sedona.read.format("jdbc")
                            .options(options)
                            .option("dbtable", src_table_name)
                            .load();
df.show();
df.printSchema();

Load query results from RDS¶

The query option can be used to load the results of a query.

PythonScalaJava

url = '<URL>' # jdbc:postgresql://ENDPOINT/DATABASE_NAME
driver = 'org.postgresql.Driver'
user = '<USERNAME>'
password = '<PASSWORD>'
options = {"url": url, "driver": driver, "user": user, "password": password}
src_table_name = 'city_tbl_geom'
query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" #custom query to run
df = sedona.read.format("jdbc")\
    .options(**options)\
    .option('query', query)\
    .load()

val url = "<URL>" // jdbc:postgresql://ENDPOINT/DATABASE_NAME
val driver = "org.postgresql.Driver"
val user = "<USERNAME>"
val password = "<PASSWORD>"
val options = Map("url" -> url, "user" -> user, "password" -> password, "driver" -> driver)
val src_table_name = "city_tbl_geom"
val query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" //Custom query to run
val df = sedona.read.format("jdbc")
                .options(options)
                .option("query", query)
                .load()

import java.util.HashMap;
HashMap<String, String> options = new HashMap<>();
options.put("url", url); // jdbc:postgresql://ENDPOINT/DATABASE_NAME
options.put("user", username);
options.put("password", password);
options.put("driver", "org.postgresql.Driver");
String src_table_name = "<SOURCE_TABLE_NAME>";
String query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'"; //Custom query to run
Dataset<Row> df = sedona.read.format("jdbc")
                            .options(options)
                            .option("query", query)
                            .load();
df.show();
df.printSchema();