> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Load Data From External Storage

Wherobots loads data from files, databases, and in-memory Python objects like Pandas DataFrames.

## Data naming conventions

To ensure data loads correctly, please adhere to the following rules.

1. **File naming:** Files and directories must not start with an underscore (`_`) or a period (`.`). These files will be ignored and result in an `AnalysisException: [UNABLE_TO_INFER_SCHEMA]` error.
   * **Incorrect:** `_my_data.parquet`
   * **Correct:** `my_data.parquet`

2. **S3 storage path:** When loading from S3, the bucket name cannot contain periods (`.`).
   * **Incorrect:** `s3://wherobots.sf.data/`
   * **Correct:** `s3://wherobots-sf-data/`

<Tip>
  **Formatting S3 filename paths for Wherobots**

  If you're loading a file from S3, ensure that it complies with the following:
</Tip>

* **Incorrect Path:**
  `s3://wherobots.sf.data/2025/_august_parcels.parquet`
  This path will fail because the bucket name (`wherobots.sf.data`) contains periods
  and the filename (`_august_parcels.parquet`) starts with an underscore.

* **Correct Path:**
  `s3://wherobots-sf-data/2025/august_parcels.parquet`
  This path is valid because the bucket name uses dashes instead of periods and the
  leading underscore is removed from the filename.

## Load data from text files

Assume we have a WKT file, namely `usa-county.tsv`, at Path `/Download/usa-county.tsv` as follows:

```
POLYGON (..., ...)	Cuming County
POLYGON (..., ...)	Wahkiakum County
POLYGON (..., ...)	De Baca County
POLYGON (..., ...)	Lancaster County
```

The file may have many other columns.

### Load a raw DataFrame

Use the following code to load the data and create a raw DataFrame:

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    var rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
    rawDf.createOrReplaceTempView("rawdf")
    rawDf.show()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
    rawDf.createOrReplaceTempView("rawdf")
    rawDf.show()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load("/Download/usa-county.tsv")
    rawDf.createOrReplaceTempView("rawdf")
    rawDf.show()
    ```
  </Tab>
</Tabs>

The output will be like this:

```
|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|
```

### Create a Geometry type column

All geometrical operations in Spatial SQL are on Geometry type objects. Therefore, before any kind of queries, you need to create a Geometry type column on a DataFrame.

```sql theme={"system"}
SELECT ST_GeomFromWKT(_c0) AS countyshape, _c1, _c2
```

You can select many other attributes to compose this `spatialdDf`. The output will be something like this:

```
|                 countyshape|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|null| null|null|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|null| null|null|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|null| null|null|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06| H1|G4020| 339|30700|null|   A|2169240202|22877180|+40.7835474|-096.6886584|
```

Although it looks same with the input, but actually the type of column countyshape has been changed to ==Geometry== type.

To verify this, use the following code to print the schema of the DataFrame:

```scala theme={"system"}
spatialDf.printSchema()
```

The output will be like this:

```
root
 |-- countyshape: geometry (nullable = false)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
```

<Note>
  Spatial SQL provides lots of functions to create a Geometry column, please read [Spatial SQL constructor API](/reference/wherobots-db/geometry-data/geometry-functions).
</Note>

## Load GeoJSON Data

WherobotsDB supports reading GeoJSON files using the `geojson` data source. It is designed to handle JSON files that use [GeoJSON format](https://datatracker.ietf.org/doc/html/rfc7946) for their geometries.

This includes SpatioTemporal Asset Catalog (STAC) files, GeoJSON features, GeoJSON feature collections and other variations.
The key functionality lies in the way 'geometry' fields are processed: these are specifically read as Sedona's `GeometryUDT` type, ensuring integration with Sedona's suite of spatial functions.

### Load MultiLine GeoJSON FeatureCollection

Suppose we have a GeoJSON FeatureCollection file as follows.
This entire file is considered as a single GeoJSON FeatureCollection object.
Multiline format is preferable for scenarios where files need to be human-readable or manually edited.

```json theme={"system"}
{ "type": "FeatureCollection",
    "features": [
      { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
        "properties": {"prop0": "value0"}
        },
      { "type": "Feature",
        "geometry": {
          "type": "LineString",
          "coordinates": [
            [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
            ]
          },
        "properties": {
          "prop0": "value1",
          "prop1": 0.0
          }
        },
      { "type": "Feature",
         "geometry": {
           "type": "Polygon",
           "coordinates": [
             [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
               [100.0, 1.0], [100.0, 0.0] ]
             ]
         },
         "properties": {
           "prop0": "value2",
           "prop1": {"this": "that"}
           }
         }
       ]
}
```

Set the `multiLine` option to `True` to read multiline GeoJSON files.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
     .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
     .select("features.*") # Unpack the features struct.
     .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    val df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
    val parsedDf = df.selectExpr("explode(features) as features").select("features.*")
    				.withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

    parsedDf.show()
    parsedDf.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json")
     .selectExpr("explode(features) as features") // Explode the envelope to get one feature per row.
     .select("features.*") // Unpack the features struct.
     .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>

The output is as follows:

```
+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)

```

### Load Single Line GeoJSON Features

Suppose we have a single-line GeoJSON Features dataset as follows. Each line is a single GeoJSON Feature.
This format is efficient for processing large datasets where each line is a separate, self-contained GeoJSON object.

```json theme={"system"}
{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
```

By default, when `option` is not specified, WherobotsDB reads a GeoJSON file as a single line GeoJSON.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
        .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    val df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
        .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
        .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type")

    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>

The output is as follows:

```
+--------------------+------+
|            geometry| prop0|
+--------------------+------+
|     POINT (102 0.5)|value0|
|LINESTRING (102 0...|value1|
|POLYGON ((100 0, ...|value2|
+--------------------+------+

root
 |-- geometry: geometry (nullable = false)
 |-- prop0: string (nullable = true)
```

## Load Shapefile

WherobotsDB supports loading Shapefile as a DataFrame.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("shapefile").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("shapefile").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("shapefile").load("/path/to/shapefile")
    ```
  </Tab>
</Tabs>

The input path can be a directory containing one or multiple shapefiles, or path to a `.shp` file.

* When the input path is a directory, all shapefiles directly under the directory will be loaded. If you want to load all shapefiles in subdirectories, please specify `.option("recursiveFileLookup", "true")`.
* When the input path is a `.shp` file, that shapefile will be loaded. Sedona will look for sibling files (`.dbf`, `.shx`, etc.) with the same main file name and load them automatically.

The name of the geometry column is `geometry` by default. You can change the name of the geometry column using the `geometry.name` option. If one of the non-spatial attributes is named "geometry", `geometry.name` must be configured to avoid conflict.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("shapefile").option("geometry.name", "geom").load("/path/to/shapefile")
    ```
  </Tab>
</Tabs>

Each record in shapefile has a unique record number, that record number is not loaded by default. If you want to include record number in the loaded DataFrame, you can set the `key.name` option to the name of the record number column:

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("shapefile").option("key.name", "FID").load("/path/to/shapefile")
    ```
  </Tab>
</Tabs>

The character encoding of string attributes are inferred from the `.cpg` file. If you see garbled values in string fields, you can manually specify the correct charset using the `charset` option. For example:

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("shapefile").option("charset", "UTF-8").load("/path/to/shapefile")
    ```
  </Tab>
</Tabs>

## Load GeoParquet

WherobotsDB natively supports loading GeoParquet file. Wherobots will infer geometry fields using the "geo" metadata in GeoParquet files.

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    val df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geoparquet").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>
</Tabs>

The output will be as follows:

```
root
 |-- pop_est: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- name: string (nullable = true)
 |-- iso_a3: string (nullable = true)
 |-- gdp_md_est: double (nullable = true)
 |-- geometry: geometry (nullable = true)
```

WherobotsDB supports spatial predicate push-down for GeoParquet files, please refer to the [Spatial SQL query optimizer](/reference/wherobots-db/geometry-data/optimizer/) documentation for details.

GeoParquet file reader can also be used to read legacy Parquet files written by Apache Sedona 1.3.1-incubating or earlier.
Please refer to [Reading Legacy Parquet Files](/reference/wherobots-db/geometry-data/reading-legacy-parquet/) for details.

### Inspect GeoParquet metadata

WherobotsDB provides a Spark SQL data source `"geoparquet.metadata"` for inspecting GeoParquet metadata. The resulting dataframe contains
the "geo" metadata for each input file.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geoparquet.metadata").load(geoparquetdatalocation1)
    df.printSchema()
    ```
  </Tab>
</Tabs>

The output will be as follows:

```
root
 |-- path: string (nullable = true)
 |-- version: string (nullable = true)
 |-- primary_column: string (nullable = true)
 |-- columns: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- encoding: string (nullable = true)
 |    |    |-- geometry_types: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- bbox: array (nullable = true)
 |    |    |    |-- element: double (containsNull = true)
 |    |    |-- crs: string (nullable = true)
```

If the input Parquet file does not have GeoParquet metadata, the values of `version`, `primary_column` and `columns` fields of the resulting dataframe will be `null`.

<Note>
  `geoparquet.metadata` only supports reading GeoParquet specific metadata. Users can use [G-Research/spark-extension](https://github.com/G-Research/spark-extension/blob/13109b8e43dfba9272c85896ba5e30cfe280426f/PARQUET.md) to read comprehensive metadata of generic Parquet files.
</Note>

## Load from GeoPackage

WherobotsDB supports loading Geopackage file format as a DataFrame.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("geopackage").option("tableName", "tab").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage")
    ```
  </Tab>
</Tabs>

Geopackage files can contain vector data and raster data. To show the possible options from a file you can look into the metadata table by adding parameter showMetadata and set its value as true.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage")
    ```
  </Tab>
</Tabs>

Then you can see the metadata of the geopackage file like below.

```
+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+
|          table_name|data_type|          identifier|description|         last_change|     min_x|            min_y|     max_x|     max_y|srs_id|
+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+
|gis_osm_water_a_f...| features|gis_osm_water_a_f...|           |2024-09-30 23:07:...|-9.0257084|57.96814069999999|33.4866675|80.4291867|  4326|
+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+
```

You can also load data from raster tables in the geopackage file. To load raster data, you can use the following code.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage")
    ```
  </Tab>
</Tabs>

```
+---+----------+-----------+--------+--------------------+
| id|zoom_level|tile_column|tile_row|           tile_data|
+---+----------+-----------+--------+--------------------+
|  1|        11|        428|     778|GridCoverage2D["c...|
|  2|        11|        429|     778|GridCoverage2D["c...|
|  3|        11|        428|     779|GridCoverage2D["c...|
|  4|        11|        429|     779|GridCoverage2D["c...|
|  5|        11|        427|     777|GridCoverage2D["c...|
+---+----------+-----------+--------+--------------------+
```

Known limitations:

* webp rasters are not supported
* ewkb geometries are not supported
* filtering based on geometries envelopes are not supported

## Load from OSM PBF

WherobotsDB supports loading OSM PBF file format as a DataFrame.

<Tabs>
  <Tab title="Scala/Java">
    ```scala theme={"system"}
    val df = sedona.read.format("osmpbf").load("/path/to/osmpbf")
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    Dataset<Row> df = sedona.read().format("osmpbf").load("/path/to/osmpbf")
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    df = sedona.read.format("osmpbf").load("/path/to/osmpbf")
    ```
  </Tab>
</Tabs>

OSM PBF files can contain nodes, ways, and relations. Currently WherobotsDB support
Nodes, DenseNodes, Ways and Relations. When you load the data you get a DataFrame with the following schema.

```
root
 |- id: long (nullable = true)
 |-- kind: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- longitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)
 |-- tags: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- refs: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- ref_roles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ref_types: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

Where:

* `id` is the unique identifier of the object.
* `kind` is the type of the object, it can be `node`, `way` or `relation`.
* `location` is the location of the object, it contains the `longitude` and `latitude` of the object.
* `tags` is a map of key-value pairs that represent the tags of the object.
* `refs` is an array of the references of the object.
* `ref_roles` is an array of the roles of the references.
* `ref_types` is an array of the types of the references.

The dataframe for ways might look like this for nodes:

```
+---------+----+--------------------+--------------------+----+---------+---------+
|       id|kind|            location|                tags|refs|ref_roles|ref_types|
+---------+----+--------------------+--------------------+----+---------+---------+
|248675410|node|{21.0884952545166...|{tactile_paving -...|NULL|     NULL|     NULL|
|260821820|node|{21.0191555023193...|{created_by -> JOSM}|NULL|     NULL|     NULL|
|349189665|node|{22.1437530517578...|{source -> http:/...|NULL|     NULL|     NULL|
|353366899|node|{22.9787712097167...|{source -> http:/...|NULL|     NULL|     NULL|
|359460224|node|{22.4816703796386...|{source -> http:/...|NULL|     NULL|     NULL|
+---------+----+--------------------+--------------------+----+---------+---------+
only showing top 5 rows
```

and for way

```
+-------+----+--------+--------------------+--------------------+---------+---------+
|     id|kind|location|                tags|                refs|ref_roles|ref_types|
+-------+----+--------+--------------------+--------------------+---------+---------+
|4307329| way|    NULL|{junction -> roun...|[2448759046, 7093...|     NULL|     NULL|
|4307330| way|    NULL|{surface -> aspha...|[26063923, 260639...|     NULL|     NULL|
|4308966| way|    NULL|{sidewalk -> sepa...|[3387797238, 9252...|     NULL|     NULL|
|4308968| way|    NULL|{surface -> pavin...|[26083890, 744724...|     NULL|     NULL|
|4308969| way|    NULL|{cycleway:both ->...|[9526831176, 1218...|     NULL|     NULL|
+-------+----+--------+--------------------+--------------------+---------+---------+
```

and for relation

```
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|   id|    kind|location|                tags|                refs|           ref_roles|           ref_types|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
|28124|relation|    NULL|{official_name ->...|[26382394, 26259985]|      [inner, outer]|          [WAY, WAY]|
|28488|relation|    NULL|  {type -> junction}|[26409253, 303249...|[roundabout, roun...|[WAY, WAY, WAY, WAY]|
|32939|relation|    NULL|{ref -> E 67, rou...|[140673970, 14067...|        [, , , , , ]|[WAY, WAY, RELATI...|
|34387|relation|    NULL|{note -> rząd III...|[209161000, 52154...|[main_stream, mai...|[WAY, WAY, WAY, W...|
|34392|relation|    NULL|{distance -> 1047...|[150033976, 25076...|[main_stream, mai...|[WAY, WAY, WAY, W...|
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
```

## Load data from JDBC data sources (PostGIS)

The 'query' option in Spark SQL's JDBC data source can be used to convert geometry columns to a format that WherobotsDB can interpret.
This should work for most spatial JDBC data sources.
For Postgis there is no need to add a query to convert geometry types since it's already using EWKB as it's wire format.

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    // For any JDBC data source, including Postgis.
    val df = sedona.read.format("jdbc")
    	// Other options.
    	.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    	.load()
    	.withColumn("geom", expr("ST_GeomFromWKB(geom)"))

    // This is a simplified version that works for Postgis.
    val df = sedona.read.format("jdbc")
    	// Other options.
    	.option("dbtable", "my_table")
    	.load()
    	.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    // For any JDBC data source, including Postgis.
    Dataset<Row> df = sedona.read().format("jdbc")
    	// Other options.
    	.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    	.load()
    	.withColumn("geom", expr("ST_GeomFromWKB(geom)"))

    // This is a simplified version that works for Postgis.
    Dataset<Row> df = sedona.read().format("jdbc")
    	// Other options.
    	.option("dbtable", "my_table")
    	.load()
    	.withColumn("geom", expr("ST_GeomFromWKB(geom)"))
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    # For any JDBC data source, including Postgis.
    df = (sedona.read.format("jdbc")
    	# Other options.
    	.option("query", "SELECT id, ST_AsBinary(geom) as geom FROM my_table")
    	.load()
    	.withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))

    # This is a simplified version that works for Postgis.
    df = (sedona.read.format("jdbc")
    	# Other options.
    	.option("dbtable", "my_table")
    	.load()
    	.withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
    ```
  </Tab>
</Tabs>

## Load data from GeoPandas file reader

WherobotsDB Python has implemented serializers and deserializers which allows to convert WherobotsDB Geometry objects into Shapely BaseGeometry objects. Based on that it is possible to load the data with geopandas from file and create Sedona DataFrame based on GeoDataFrame object.

```python theme={"system"}
import geopandas as gpd

gdf = gpd.read_file("<some path>.shp")
df = sedona.createDataFrame(gdf)
```

To leverage Arrow optimization and speed up the conversion, we recommend the `create_spatial_dataframe` method in Sedona.

```python theme={"system"}
from sedona.spark import *

gdf = gpd.read_file("<some path>.shp")
create_spatial_dataframe(sedona, gdf)
```

## Load data from Python Shapely

The following Shapely geometry types are supported:

* Point
* MultiPoint
* LineString
* MultiLinestring
* Polygon
* MultiPolygon

### Create a schema

To create Spark DataFrame from Shapely objects, you need to first create a schema. Please use <b> GeometryType </b> from  <b> sedona.sql.types </b> module.

```python theme={"system"}
from pyspark.sql.types import IntegerType, StructField, StructType
from sedona.spark import *

schema = StructType(
    [
        StructField("id", IntegerType(), False),
        StructField("geom", GeometryType(), False)
    ]
)
```

### Create Shapely geometries

Now you can create Shapely geometries and convert them to a Sedona DataFrame.

<Tabs>
  <Tab title="Point">
    ```python theme={"system"}
    from shapely.geometry import Point

    data = [
    	[1, Point(21.0, 52.0)],
    	[1, Point(23.0, 42.0)],
    	[1, Point(26.0, 32.0)]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    )

    gdf.show()
    ```

    ```
    +---+-------------+
    | id|         geom|
    +---+-------------+
    |  1|POINT (21 52)|
    |  1|POINT (23 42)|
    |  1|POINT (26 32)|
    +---+-------------+
    ```
  </Tab>

  <Tab title="MultiPoint">
    ```python theme={"system"}
    from shapely.geometry import MultiPoint

    data = [
    	[1, MultiPoint([[19.511463, 51.765158], [19.446408, 51.779752]])]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    ).show(1, False)
    ```

    ```
    +---+---------------------------------------------------------+
    |id |geom                                                     |
    +---+---------------------------------------------------------+
    |1  |MULTIPOINT ((19.511463 51.765158), (19.446408 51.779752))|
    +---+---------------------------------------------------------+
    ```
  </Tab>

  <Tab title="LineString">
    ```python theme={"system"}
    from shapely.geometry import LineString

    line = [(40, 40), (30, 30), (40, 20), (30, 10)]

    data = [
    	[1, LineString(line)]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    )

    gdf.show(1, False)
    ```

    ```
    +---+--------------------------------+
    |id |geom                            |
    +---+--------------------------------+
    |1  |LINESTRING (10 10, 20 20, 10 40)|
    +---+--------------------------------+
    ```
  </Tab>

  <Tab title="MultiLineString">
    ```python theme={"system"}
    from shapely.geometry import MultiLineString

    line1 = [(10, 10), (20, 20), (10, 40)]
    line2 = [(40, 40), (30, 30), (40, 20), (30, 10)]

    data = [
    	[1, MultiLineString([line1, line2])]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    )

    gdf.show(1, False)
    ```

    ```
    +---+---------------------------------------------------------------------+
    |id |geom                                                                 |
    +---+---------------------------------------------------------------------+
    |1  |MULTILINESTRING ((10 10, 20 20, 10 40), (40 40, 30 30, 40 20, 30 10))|
    +---+---------------------------------------------------------------------+
    ```
  </Tab>

  <Tab title="Polygon">
    ```python theme={"system"}
    from shapely.geometry import Polygon

    polygon = Polygon(
    	[
    		 [19.51121, 51.76426],
    		 [19.51056, 51.76583],
    		 [19.51216, 51.76599],
    		 [19.51280, 51.76448],
    		 [19.51121, 51.76426]
    	]
    )

    data = [
    	[1, polygon]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    )

    gdf.show(1, False)
    ```

    ```
    +---+--------------------------------------------------------------------------------------------------------+
    |id |geom                                                                                                    |
    +---+--------------------------------------------------------------------------------------------------------+
    |1  |POLYGON ((19.51121 51.76426, 19.51056 51.76583, 19.51216 51.76599, 19.5128 51.76448, 19.51121 51.76426))|
    +---+--------------------------------------------------------------------------------------------------------+
    ```
  </Tab>

  <Tab title="MultiPolygon">
    ```python theme={"system"}
    from shapely.geometry import MultiPolygon

    exterior_p1 = [(0, 0), (0, 2), (2, 2), (2, 0), (0, 0)]
    interior_p1 = [(1, 1), (1, 1.5), (1.5, 1.5), (1.5, 1), (1, 1)]

    exterior_p2 = [(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)]

    polygons = [
    	Polygon(exterior_p1, [interior_p1]),
    	Polygon(exterior_p2)
    ]

    data = [
    	[1, MultiPolygon(polygons)]
    ]

    gdf = sedona.createDataFrame(
    	data,
    	schema
    )

    gdf.show(1, False)
    ```

    ```
    +---+----------------------------------------------------------------------------------------------------------+
    |id |geom                                                                                                      |
    +---+----------------------------------------------------------------------------------------------------------+
    |1  |MULTIPOLYGON (((0 0, 0 2, 2 2, 2 0, 0 0), (1 1, 1.5 1, 1.5 1.5, 1 1.5, 1 1)), ((0 0, 0 1, 1 1, 1 0, 0 0)))|
    +---+----------------------------------------------------------------------------------------------------------+
    ```
  </Tab>
</Tabs>

### Confirm the DataFrame structure

```python theme={"system"}
gdf.printSchema()
```

```
root
 |-- id: integer (nullable = false)
 |-- geom: geometry (nullable = false)
```

## Load data from Snowflake

Tables created in Snowflake can be loaded on to Sedona Dataframes using Sedona. Once the data is loaded, it can be processed using
the extensive catalog and efficient computation of WherobotsDB.

In order to enable bi-directional communication between Spark and Snowflake, a map of configuration parameters must be passed as `options` to the SedonaContext object.

The configuration parameters include connection and context options. Details on the possible values of these options can be found [here](https://docs.snowflake.com/en/user-guide/spark-connector-use#label-spark-options).

### Load entire table from Snowflake

The `dbtable` option can be used to load the contents of the entire table from Snowflake to a Sedona Dataframe, effectively running a `SELECT * FROM table` query.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    # snowflake_url is https://<accountIdentifier>.snowflakecomputing.com
    sfOptions = {"sfUrl": snowflake_url, "sfUser": username, "sfPassword" : password, "sfDatabase": database, "sfSchema": schema}

    # source table name in Snowflake
    src_table_name = "<SOURCE_TABLE_NAME>"

    df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)\
    		.options(**sfOptions)\
    		.option("dbtable", src_table_name)\
    		.load()
    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    // snowflakeUrl is https://<accountIdentifier>.snowflakecomputing.com
    val sfOptions = Map("sfUrl" -> snowflakeUrl, "sfUser" -> username, "sfPassword" -> password, "sfDatabase" -> database, "sfSchema" -> schema)

    // source table name in Snowflake
    val src_table_name = "<SOURCE_TABLE_NAME>"

    val df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
    										.options(sfOptions)
    										.option("dbtable", src_table_name)
    										.load()
    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import java.util.HashMap;
    HashMap<String, String> sfOptions = new HashMap<>();
    sfOptions.put("sfUrl", snowflakeUrl); // snowflakeUrl is https://<accountIdentifier>.snowflakecomputing.com
    sfOptions.put("sfUser", username);
    sfOptions.put("sfPassword", password);
    sfOptions.put("sfDatabase", database);
    sfOptions.put("sfSchema", schema);

    String src_table_name = "<SOURCE_TABLE_NAME>";
    Dataset<Row> df = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
    										.options(sfOptions)
    										.option("dbtable", src_table_name)
    										.load();
    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>

### Load query results from Snowflake

The `query` option can be used to load the results of a query. Snowflake employs query and predicate pushdown which is enabled by default.

If you wish to disable pushdown, the `autopushdown` option can be used with value as `off`.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" #custom query to run
    df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)\
    						.options(**sfOptions)\
    						.option("query", query)\
    						.load()
    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    val query = s"SELECT GEOM, CITY_NAME FROM $src_table_name WHERE CITY_NAME = 'Seattle'"
    val df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
    						.options(sfOptions)
    						.option("query", query)
    						.load()
    df.show()
    df.printSchema()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import java.util.HashMap;
    String query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'"
    Dataset<Row> df_query = sedona.read.format(SNOWFLAKE_SOURCE_NAME)
    						.options(sfOptions)
    						.option("query", query)
    						.load();
    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>

## Load data from AWS RDS PostGIS

Tables created in a AWS RDS PostGIS (PostgreSQL) instance can be imported on to Sedona Dataframes using Sedona. Once the data is loaded, it can be processed using
the extensive catalog and efficient computation of WherobotsDB.

Data can be loaded on to a Sedona Dataframe using the `load()` function. A map of configuration and context options must be passed to establish connection with the RDS instance.

If you're unable to establish connection with the RDS instance, double check if the instance is accessible by the server running this code.
For more information on intra or inter VPC connection with the RDS instance, consult [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html).

### Load entire table from RDS

The `dbtable` option can be used to load the contents of the entire table from RDS to a Sedona Dataframe, effectively running a `SELECT * FROM table` query.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    url = '<URL>' #jdbc:postgresql://ENDPOINT/DATABASE_NAME
    driver = 'org.postgresql.Driver'
    user = '<USERNAME>'
    password = '<PASSWORD>'
    options = {"url": url, "driver": driver, "user": user, "password": password}
    src_table_name = 'city_tbl_geom'
    df = sedona.read.format("jdbc")
    		.options(**options)
    		.option('dbtable', src_table_name)
    		.load()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    val url = "<URL>"//jdbc:postgresql://ENDPOINT/DATABASE_NAME
    val driver = "org.postgresql.Driver"
    val user = "<USERNAME>"
    val password = "<PASSWORD>"
    val options = Map("url" -> url, "user" -> user, "password" -> password, "driver" -> driver)
    val src_table_name = "city_tbl_geom"
    val df = sedona.read.format("jdbc")
    								.options(options)
    								.option("dbtable", src_table_name)
    								.load()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import java.util.HashMap;
    HashMap<String, String> options = new HashMap<>();
    options.put("url", url); // url is jdbc:postgresql://ENDPOINT/DATABASE_NAME
    options.put("user", username);
    options.put("password", password);
    options.put("driver", "org.postgresql.Driver");
    String src_table_name = "<SOURCE_TABLE_NAME>";
    Dataset<Row> df = sedona.read.format("jdbc")
    														.options(options)
    														.option("dbtable", src_table_name)
    														.load();
    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>

### Load query results from RDS

The `query` option can be used to load the results of a query.

<Tabs>
  <Tab title="Python">
    ```python theme={"system"}
    url = '<URL>' # jdbc:postgresql://ENDPOINT/DATABASE_NAME
    driver = 'org.postgresql.Driver'
    user = '<USERNAME>'
    password = '<PASSWORD>'
    options = {"url": url, "driver": driver, "user": user, "password": password}
    src_table_name = 'city_tbl_geom'
    query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" #custom query to run
    df = sedona.read.format("jdbc")\
    		.options(**options)\
    		.option('query', query)\
    		.load()
    ```
  </Tab>

  <Tab title="Scala">
    ```scala theme={"system"}
    val url = "<URL>" // jdbc:postgresql://ENDPOINT/DATABASE_NAME
    val driver = "org.postgresql.Driver"
    val user = "<USERNAME>"
    val password = "<PASSWORD>"
    val options = Map("url" -> url, "user" -> user, "password" -> password, "driver" -> driver)
    val src_table_name = "city_tbl_geom"
    val query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'" //Custom query to run
    val df = sedona.read.format("jdbc")
    								.options(options)
    								.option("query", query)
    								.load()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import java.util.HashMap;
    HashMap<String, String> options = new HashMap<>();
    options.put("url", url); // jdbc:postgresql://ENDPOINT/DATABASE_NAME
    options.put("user", username);
    options.put("password", password);
    options.put("driver", "org.postgresql.Driver");
    String src_table_name = "<SOURCE_TABLE_NAME>";
    String query = "SELECT GEOM, CITY_NAME FROM " + src_table_name + " WHERE CITY_NAME = 'Seattle'"; //Custom query to run
    Dataset<Row> df = sedona.read.format("jdbc")
    														.options(options)
    														.option("query", query)
    														.load();
    df.show();
    df.printSchema();
    ```
  </Tab>
</Tabs>
