> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wherobots.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Geospatial Queries

## Transform the Coordinate Reference System

WherobotsDB doesn't control the coordinate unit (degree-based or meter-based) of all geometries in a Geometry column. The unit of all related distances in Spatial SQL is same as the unit of all geometries in a Geometry column.

For more details, please read the `ST_Transform` section in WherobotsDB API References.

To convert Coordinate Reference System of the Geometry column created before, use the following code:

```sql theme={"system"}
SELECT ST_Transform(countyshape, "epsg:4326", "epsg:3857") AS newcountyshape, _c1, _c2, _c3, _c4, _c5, _c6, _c7
FROM spatialdf
```

The first EPSG code EPSG:4326 in `ST_Transform` is the source CRS of the geometries. It is WGS84, the most common degree-based CRS.

The second EPSG code EPSG:3857 in `ST_Transform` is the target CRS of the geometries. It is the most common meter-based CRS.

This `ST_Transform` transform the CRS of these geometries from EPSG:4326 to EPSG:3857. The details CRS information can be found on [EPSG.io](https://epsg.io/)

The coordinates of polygons have been changed. The output will be like this:

```
+--------------------+---+---+--------+-----+-----------+--------------------+---+
|      newcountyshape|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|
+--------------------+---+---+--------+-----+-----------+--------------------+---+
|POLYGON ((-108001...| 31|039|00835841|31039|     Cuming|       Cuming County| 06|
|POLYGON ((-137408...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06|
|POLYGON ((-116403...| 35|011|00933054|35011|    De Baca|      De Baca County| 06|
|POLYGON ((-107880...| 31|109|00835876|31109|  Lancaster|    Lancaster County| 06|

```

## Range query

Use `ST_Contains`, `ST_Intersects`, `ST_Within` to run a range query over a single column.

The following example finds all counties that are within the given polygon:

```sql theme={"system"}
SELECT *
FROM spatialdf
WHERE ST_Contains (ST_PolygonFromEnvelope(1.0,100.0,1000.0,1100.0), newcountyshape)
```

<Note>
  Read [Spatial SQL API](/reference/wherobots-db/geometry-data/geometry-functions) to learn how to create a Geometry type query window
</Note>

## KNN query

Use `ST_Distance` to calculate the distance and rank the distance.

The following code returns the 5 nearest neighbor of the given polygon.

```sql theme={"system"}
SELECT countyname, ST_Distance(ST_PolygonFromEnvelope(1.0,100.0,1000.0,1100.0), newcountyshape) AS distance
FROM spatialdf
ORDER BY distance DESC
LIMIT 5
```

## Range join

<Tip>
  WherobotsDB join performance is heavily affected by the number of partitions. If the join performance is not ideal, please increase the number of partitions by doing `df.repartition(XXX)` right after you create the original DataFrame.
</Tip>

A range join query is to find geometries from A and geometries from B such that each geometry pair satisfies a certain predicate. Most predicates supported by Spatial SQL can trigger a range join.

SQL example:

```sql theme={"system"}
SELECT *
FROM polygondf, pointdf
WHERE ST_Contains(polygondf.polygonshape,pointdf.pointshape)
```

```sql theme={"system"}
SELECT *
FROM polygondf, pointdf
WHERE ST_Intersects(polygondf.polygonshape,pointdf.pointshape)
```

```sql theme={"system"}
SELECT *
FROM pointdf, polygondf
WHERE ST_Within(pointdf.pointshape, polygondf.polygonshape)
```

SQL Physical plan:

```
== Physical Plan ==
RangeJoin polygonshape#20: geometry, pointshape#43: geometry, false
:- Project [st_polygonfromenvelope(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), cast(_c2#2 as decimal(24,20)), cast(_c3#3 as decimal(24,20)), mypolygonid) AS polygonshape#20]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#31 as decimal(24,20)), cast(_c1#32 as decimal(24,20)), myPointId) AS pointshape#43]
   +- *FileScan csv

```

## Distance join

Introduction: Find geometries from A and geometries from B such that the distance of each geometry pair is less or equal than a certain distance. It supports the planar Euclidean distance calculators `ST_Distance`, `ST_HausdorffDistance`, `ST_FrechetDistance` and the meter-based geodesic distance calculators `ST_DistanceSpheroid` and `ST_DistanceSphere`.

Below are SQL examples for planar Euclidean distance predicates.

*Fully within a certain distance:*

```sql theme={"system"}
SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) < 2
```

```sql theme={"system"}
SELECT *
FROM pointDf, polygonDF
WHERE ST_HausdorffDistance(pointDf.pointshape, polygonDf.polygonshape, 0.3) < 2
```

```sql theme={"system"}
SELECT *
FROM pointDf, polygonDF
WHERE ST_FrechetDistance(pointDf.pointshape, polygonDf.polygonshape) < 2
```

*Intersects within a certain distance:*

```sql theme={"system"}
SELECT *
FROM pointdf1, pointdf2
WHERE ST_Distance(pointdf1.pointshape1,pointdf2.pointshape2) <= 2
```

```sql theme={"system"}
SELECT *
FROM pointDf, polygonDF
WHERE ST_HausdorffDistance(pointDf.pointshape, polygonDf.polygonshape) <= 2
```

```sql theme={"system"}
SELECT *
FROM pointDf, polygonDF
WHERE ST_FrechetDistance(pointDf.pointshape, polygonDf.polygonshape) <= 2
```

SQL Physical plan:

```
== Physical Plan ==
DistanceJoin pointshape1#12: geometry, pointshape2#33: geometry, 2.0, true
:- Project [st_point(cast(_c0#0 as decimal(24,20)), cast(_c1#1 as decimal(24,20)), myPointId) AS pointshape1#12]
:  +- *FileScan csv
+- Project [st_point(cast(_c0#21 as decimal(24,20)), cast(_c1#22 as decimal(24,20)), myPointId) AS pointshape2#33]
   +- *FileScan csv
```

<Warning>
  If you use planar euclidean distance functions like `ST_Distance`, `ST_HausdorffDistance` or `ST_FrechetDistance` as the predicate, WherobotsDB doesn't control the distance's unit (degree or meter). It is same with the geometry. If your coordinates are in the longitude and latitude system, the unit of `distance` should be degree instead of meter or mile. To change the geometry's unit, please either transform the coordinate reference system to a meter-based system. See [ST\_Transform](/reference/wherobots-db/geometry-data/spatial-reference/ST_Transform). If you don't want to transform your data, please consider using `ST_DistanceSpheroid` or `ST_DistanceSphere`.
</Warning>

SQL example for meter-based geodesic distance `ST_DistanceSpheroid` (works for `ST_DistanceSphere` too):

*Less than a certain distance:*

```sql theme={"system"}
SELECT *
FROM pointdf1, pointdf2
WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) < 2
```

*Less than or equal to a certain distance:*

```sql theme={"system"}
SELECT *
FROM pointdf1, pointdf2
WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) <= 2
```

<Warning>
  If you use `ST_DistanceSpheroid ` or `ST_DistanceSphere` as the predicate, the unit of the distance is meter. Currently, distance join with geodesic distance calculators work best for point data. For non-point data, it only considers their centroids. The distance join algorithm internally uses an approximate distance buffer which might lead to inaccurate results if your data is close to the poles or antimeridian.
</Warning>

## Broadcast index join

Introduction: Perform a range join or distance join but broadcast one of the sides of the join. This maintains the partitioning of the non-broadcast side and doesn't require a shuffle.

WherobotsDB will create a spatial index on the broadcasted table.

WherobotsDB uses broadcast join only if the correct side has a broadcast hint.
The supported join type - broadcast side combinations are:

* Inner - either side, preferring to broadcast left if both sides have the hint
* Left semi - broadcast right
* Left anti - broadcast right
* Left outer - broadcast right
* Right outer - broadcast left

```scala theme={"system"}
pointDf.alias("pointDf").join(broadcast(polygonDf).alias("polygonDf"), expr("ST_Contains(polygonDf.polygonshape, pointDf.pointshape)"))
```

SQL Physical plan:

```
== Physical Plan ==
BroadcastIndexJoin pointshape#52: geometry, BuildRight, BuildRight, false ST_Contains(polygonshape#30, pointshape#52)
:- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#52]
:  +- FileScan csv
+- SpatialIndex polygonshape#30: geometry, QUADTREE, [id=#62]
   +- Project [st_polygonfromenvelope(cast(_c0#22 as decimal(24,20)), cast(_c1#23 as decimal(24,20)), cast(_c2#24 as decimal(24,20)), cast(_c3#25 as decimal(24,20))) AS polygonshape#30]
      +- FileScan csv
```

This also works for distance joins with `ST_Distance`, `ST_DistanceSpheroid`, `ST_DistanceSphere`, `ST_HausdorffDistance` or `ST_FrechetDistance`:

```scala theme={"system"}
pointDf1.alias("pointDf1").join(broadcast(pointDf2).alias("pointDf2"), expr("ST_Distance(pointDf1.pointshape, pointDf2.pointshape) <= 2"))
```

SQL Physical plan:

```
== Physical Plan ==
BroadcastIndexJoin pointshape#52: geometry, BuildRight, BuildLeft, true, 2.0 ST_Distance(pointshape#52, pointshape#415) <= 2.0
:- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#52]
:  +- FileScan csv
+- SpatialIndex pointshape#415: geometry, QUADTREE, [id=#1068]
   +- Project [st_point(cast(_c0#48 as decimal(24,20)), cast(_c1#49 as decimal(24,20))) AS pointshape#415]
      +- FileScan csv
```

Note: If the distance is an expression, it is only evaluated on the first argument to ST\_Distance (`pointDf1` above).

## Automatic broadcast index join

When one table involved a spatial join query is smaller than a threshold, WherobotsDB will automatically choose broadcast index join instead of WherobotsDB optimized join. The current threshold is controlled by [sedona.join.autoBroadcastJoinThreshold](/reference/wherobots-db/parameters/) and set to the same as `spark.sql.autoBroadcastJoinThreshold`.

## K Nearest-Neighbor join / KNN join

Introduction: WherobotsDB supports nearest-neighbor searching on geospatial data through the K Nearest-Neighbor (KNN) join method. This join operation identifies the k-nearest neighbors of a point or region in one dataset relative to another dataset, based on geographic proximity. The distance metric used can be either Euclidean or great-circle distance, depending on the specific use case.

When either the queries or objects data contain non-point geometries, WherobotsDB calculates the centroid of each geometry to perform the KNN join. This function returns the k-nearest geometries from the object side table for each geometry in the queries side table.

SQL Example:

```sql theme={"system"}
SELECT
    QUERIES.ID AS QUERY_ID,
    QUERIES.GEOMETRY AS QUERIES_GEOM,
    OBJECTS.GEOMETRY AS OBJECTS_GEOM
FROM QUERIES
JOIN OBJECTS
ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 4, FALSE)
```

This query returns the 4 nearest neighbors from the OBJECTS table for each geometry in the QUERIES table based on Euclidean distance measured along sphere.

The first and second arguments are the queries and objects side tables, respectively. The third argument is the number of nearest neighbors to return (k), and the fourth argument is a boolean value that determines whether to use the spheroid distance or not.

Below are more SQL examples for KNN joins:

The knn join returns the 16 nearest neighbors from the OBJECTS table based on distance measured around spheroid.

```sql theme={"system"}
SELECT QUERIES.ID, OBJECTS.GEOMETRY FROM QUERIES JOIN OBJECTS ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 16, TRUE)
```

The knn join works with additional inequality condition.

```sql theme={"system"}
SELECT QUERIES.ID, OBJECTS.GEOMETRY FROM QUERIES JOIN OBJECTS ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 16, TRUE) AND QUERIES.ID < 10
```

The knn join works with additional equality condition.

```sql theme={"system"}
SELECT QUERIES.ID, OBJECTS.GEOMETRY FROM QUERIES JOIN OBJECTS ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 16, TRUE) AND QUERIES.ID = 10
```

The knn join works with query hint.

```sql theme={"system"}
SELECT /*+ BROADCAST(QUERIES) */ QUERIES.ID, OBJECTS.GEOMETRY FROM QUERIES JOIN OBJECTS ON ST_KNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 16, TRUE)

```

The ST\_AKNN function operates similarly to ST\_KNN but employs approximate algorithms to find the k-nearest neighbors. This is particularly useful for large datasets where an exact KNN search might be computationally expensive. While approximate algorithms may not always return the exact nearest neighbors, they provide a good approximation with significantly reduced computation time.

```sql theme={"system"}
SELECT
    QUERIES.ID AS QUERY_ID,
    QUERIES.GEOMETRY AS QUERIES_GEOM,
    OBJECTS.GEOMETRY AS OBJECTS_GEOM
FROM QUERIES
JOIN OBJECTS
ON ST_AKNN(QUERIES.GEOMETRY, OBJECTS.GEOMETRY, 4, FALSE)
```

This query returns the 4 nearest neighbors from the OBJECTS table for each geometry in the QUERIES table using the approximate KNN algorithms.

Performance Considerations:

Similar to range joins, KNN joins can be computationally expensive. To optimize performance, consider the following:

* Partitioning: Increase the number of partitions with df.repartition(n) to ensure better distribution of data across nodes.
* Broadcasting: For smaller datasets, WherobotsDB will use a broadcast join to avoid shuffling large amounts of data.

## Create a User-Defined Function (UDF)

User-Defined Functions (UDFs) are user-created procedures that can perform operations on a single row of information. To cover almost all use cases, we will showcase 4 types of UDFs for a better understanding of how to use geometry with UDFs. Sedona's serializer deserializes the SQL geometry type to [JTS Geometry](https://locationtech.github.io/jts/javadoc-1.18.0/org/locationtech/jts/geom/Geometry.html) (Scala/Java) or [Shapely Geometry](https://shapely.readthedocs.io/en/stable/geometry.html) (Python). You can implement any custom logic using the rich ecosystem around these two libraries.

### Geometry to primitive

This UDF example takes a geometry type input and returns a primitive type output:

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    import org.locationtech.jts.geom.Geometry
    import org.apache.spark.sql.types._

    def lengthPoly(geom: Geometry): Double = {
    		geom.getLength
    }

    sedona.udf.register("udf_lengthPoly", lengthPoly _)

    df.selectExpr("udf_lengthPoly(geom)").show()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import org.apache.spark.sql.api.java.UDF1;
    import org.apache.spark.sql.types.DataTypes;

    // using lambda function to register the UDF
    sparkSession.udf().register(
    				"udf_lengthPoly",
    				(UDF1<Geometry, Double>) Geometry::getLength,
    				DataTypes.DoubleType);

    df.selectExpr("udf_lengthPoly(geom)").show()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    from sedona.spark import *
    from pyspark.sql.types import DoubleType

    def lengthPoly(geom: GeometryType()):
    		return geom.length

    sedona.udf.register("udf_lengthPoly", lengthPoly, DoubleType())

    df.selectExpr("udf_lengthPoly(geom)").show()
    ```
  </Tab>
</Tabs>

Output:

```
+--------------------+
|udf_lengthPoly(geom)|
+--------------------+
|   3.414213562373095|
+--------------------+
```

### Geometry to Geometry

This UDF example takes a geometry type input and returns a geometry type output:

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    import org.locationtech.jts.geom.Geometry
    import org.apache.spark.sql.types._

    def bufferFixed(geom: Geometry): Geometry = {
    	geom.buffer(5.5)
    }

    sedona.udf.register("udf_bufferFixed", bufferFixed _)

    df.selectExpr("udf_bufferFixed(geom)").show()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import org.apache.spark.sql.api.java.UDF1;
    import org.apache.spark.sql.types.DataTypes;

    // using lambda function to register the UDF
    sparkSession.udf().register(
    		"udf_bufferFixed",
    		(UDF1<Geometry, Geometry>) geom ->
    			geom.buffer(5.5),
    		new GeometryUDT());

    df.selectExpr("udf_bufferFixed(geom)").show()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    from sedona.spark import *
    from pyspark.sql.types import DoubleType

    def bufferFixed(geom: GeometryType()):
    	return geom.buffer(5.5)

    sedona.udf.register("udf_bufferFixed", bufferFixed, GeometryType())

    df.selectExpr("udf_bufferFixed(geom)").show()
    ```
  </Tab>
</Tabs>

Output:

```
+--------------------------------------------------+
|                             udf_bufferFixed(geom)|
+--------------------------------------------------+
|POLYGON ((1 -4.5, -0.0729967710887076 -4.394319...|
+--------------------------------------------------+
```

### Geometry, primitive to geometry

This UDF example takes a geometry type input and a primitive type input and returns a geometry type output:

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    import org.locationtech.jts.geom.Geometry
    import org.apache.spark.sql.types._

    def bufferIt(geom: Geometry, distance: Double): Geometry = {
    		geom.buffer(distance)
    }

    sedona.udf.register("udf_buffer", bufferIt _)

    df.selectExpr("udf_buffer(geom, distance)").show()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import org.apache.spark.sql.api.java.UDF2;
    import org.apache.spark.sql.types.DataTypes;

    // using lambda function to register the UDF
    sparkSession.udf().register(
    				"udf_buffer",
    				(UDF2<Geometry, Double, Geometry>) Geometry::buffer,
    				new GeometryUDT());

    df.selectExpr("udf_buffer(geom, distance)").show()
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    from sedona.spark import *
    from pyspark.sql.types import DoubleType

    def bufferIt(geom: GeometryType(), distance: DoubleType()):
    		return geom.buffer(distance)

    sedona.udf.register("udf_buffer", bufferIt, GeometryType())

    df.selectExpr("udf_buffer(geom, distance)").show()
    ```
  </Tab>
</Tabs>

Output:

```
+--------------------------------------------------+
|                        udf_buffer(geom, distance)|
+--------------------------------------------------+
|POLYGON ((1 -9, -0.9509032201612866 -8.80785280...|
+--------------------------------------------------+
```

### Geometry, primitive to Geometry, primitive

This UDF example takes a geometry type input and a primitive type input and returns a geometry type and a primitive type output:

<Tabs>
  <Tab title="Scala">
    ```scala theme={"system"}
    import org.locationtech.jts.geom.Geometry
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.api.java.UDF2

    val schemaUDF = StructType(Array(
    	StructField("buffed", GeometryUDT),
    	StructField("length", DoubleType)
    ))

    val udf_bufferLength = udf(
    	new UDF2[Geometry, Double, (Geometry, Double)] {
    		def call(geom: Geometry, distance: Double): (Geometry, Double) = {
    			val buffed = geom.buffer(distance)
    			val length = geom.getLength
    			(buffed, length)
    		}
    	}, schemaUDF)

    sedona.udf.register("udf_bufferLength", udf_bufferLength)

    data.withColumn("bufferLength", expr("udf_bufferLengths(geom, distance)"))
    	.select("geom", "distance", "bufferLength.*")
    	.show()
    ```
  </Tab>

  <Tab title="Java">
    ```java theme={"system"}
    import org.apache.spark.sql.api.java.UDF2;
    import org.apache.spark.sql.types.DataTypes;
    import org.apache.spark.sql.types.StructType;
    import scala.Tuple2;

    StructType schemaUDF = new StructType()
    			.add("buffedGeom", new GeometryUDT())
    			.add("length", DataTypes.DoubleType);

    // using lambda function to register the UDF
    sparkSession.udf().register("udf_bufferLength",
    			(UDF2<Geometry, Double, Tuple2<Geometry, Double>>) (geom, distance) -> {
    				Geometry buffed = geom.buffer(distance);
    				Double length = buffed.getLength();
    				return new Tuple2<>(buffed, length);
    			},
    			schemaUDF);

    df.withColumn("bufferLength", functions.expr("udf_bufferLength(geom, distance)"))
    			.select("geom", "distance", "bufferLength.*")
    			.show();
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={"system"}
    from sedona.spark import *
    from pyspark.sql.types import *

    schemaUDF = StructType([
    	StructField("buffed", GeometryType()),
    	StructField("length", DoubleType())
    ])

    def bufferAndLength(geom: GeometryType(), distance: DoubleType()):
    	buffed = geom.buffer(distance)
    	length = buffed.length
    	return [buffed, length]

    sedona.udf.register("udf_bufferLength", bufferAndLength, schemaUDF)

    df.withColumn("bufferLength", expr("udf_bufferLength(geom, buffer)")) \
    			.select("geom", "buffer", "bufferLength.*") \
    			.show()
    ```
  </Tab>
</Tabs>

Output:

```
+------------------------------+--------+--------------------------------------------------+-----------------+
|                          geom|distance|                                        buffedGeom|           length|
+------------------------------+--------+--------------------------------------------------+-----------------+
|POLYGON ((1 1, 1 2, 2 1, 1 1))|    10.0|POLYGON ((1 -9, -0.9509032201612866 -8.80785280...|66.14518337329191|
+------------------------------+--------+--------------------------------------------------+-----------------+
```

## Spatial vectorized UDFs (Python only)

By default, when you create a user defined function (UDF) in Python, that function isn't vectorized.

Processing User-Defined Functions (UDFs) one row at a time often leads to slower performance.
A faster alternative is to use vectorized UDFs, which leverage Apache Arrow to handle data in batches.

To create a vectorized UDF, use the `@sedona_vectorized_udf` decorator.
Currently, `@sedona_vectorized_udf` supports scalar UDFs.

Vectorized UDFs offer significantly improved performance, potentially running up to 2x faster than standard UDFs.

<Note>
  When you use geometry as an input type, please include the BaseGeometry type,
  like Point from shapely or geopandas GeoSeries, when you use GEO\_SERIES vectorized UDF.
  That's how Sedona infers the type and knows if the data should be cast.
</Note>

The decorator's signature is the following:

```python theme={"system"}
def sedona_vectorized_udf(udf_type: SedonaUDFType = SedonaUDFType.SHAPELY_SCALAR, return_type: DataType)
```

`udf_type` is the type of the UDF function. Supported UDF types are:

* `SHAPELY_SCALAR`
* `GEO_SERIES`

Let's examine the two examples below, which create buffers from a given geometry:

### Shapely scalar UDF

```python theme={"system"}
import shapely.geometry.base as b
from sedona.spark import *

@sedona_vectorized_udf(return_type=GeometryType())
def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
    return geom.buffer(0.1)
```

### GeoSeries UDF

```python theme={"system"}
import geopandas as gpd
from sedona.spark import *


@sedona_vectorized_udf(udf_type=SedonaUDFType.GEO_SERIES, return_type=GeometryType())
def vectorized_geo_series_buffer(series: gpd.GeoSeries) -> gpd.GeoSeries:
    buffered = series.buffer(0.1)

    return buffered
```

To call the UDFs, you can use the following code:

```python theme={"system"}
# Shapely scalar UDF
df.withColumn("buffered", vectorized_buffer(df.geom)).show()

# GeoSeries UDF
df.withColumn("buffered", vectorized_geo_series_buffer(df.geom)).show()
```

## S2/H3 based approximate equi-join

When the performance of the WherobotsDB-optimized join doesn't suit your workflow needs, possibly because of intricate and overlapping geometries, WherobotsDB also provides Google S2 or Uber H3-based approximate equi-joins as alternatives.

The equi-joins leverage Spark's internal equi-join algorithm and might be more performant given that you can choose to skip the refinement step. However, skipping this step can sacrifice query accuracy.

Please use the following steps:

### 1. Generate cell ids for both tables

Use [ST\_S2CellIds](/reference/wherobots-db/geometry-data/spatial-indexing/ST_S2CellIDs) or [ST\_H3CellIds](/reference/wherobots-db/geometry-data/spatial-indexing/ST_H3CellIDs) to generate cell IDs. Each geometry may produce one or more IDs.

```sql theme={"system"}
SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM lefts
```

```sql theme={"system"}
SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
FROM rights
```

### 2. Perform equi-join

Join the two tables by their cellId

```sql theme={"system"}
SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
```

### 3. Optional: Refine the result

Due to the nature of S2/H3 Cellid, the equi-join results might have a few false-positives depending on the S2/H3 level you choose. A smaller level indicates bigger cells, less exploded rows, but more false positives.

To ensure the correctness, you can use one of the [Spatial Predicates](/reference/wherobots-db/geometry-data/geometry-functions) to filter out them. Use this query instead of the query in Step 2.

```sql theme={"system"}
SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
FROM lcs, rcs
WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)
```

As you see, compared to the query in Step 2, we added one more filter, which is `ST_Contains`, to remove false positives. You can also use `ST_Intersects` and so on.

<Tip>
  You can skip this step if you don't need 100% accuracy and want faster query speed.
</Tip>

### 4. Optional: De-duplicate

Due to the explode function used when we generate S2/H3 Cell Ids, the resulting DataFrame may have several duplicate `<lcs_geom, rcs_geom>` matches. You can remove them by performing a GroupBy query.

```sql theme={"system"}
SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), first(rcs_name)
FROM joinresult
GROUP BY (lcs_id, rcs_id)
```

The `first` function is to take the first value from a number of duplicate values.

If you don't have a unique id for each geometry, you can also group by geometry itself. See below:

```sql theme={"system"}
SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
FROM joinresult
GROUP BY (lcs_geom, rcs_geom)
```

<Note>
  If you are doing point-in-polygon join, this is not a problem and you can safely discard this issue. This issue only happens when you do polygon-polygon, polygon-linestring, linestring-linestring join.
</Note>

### S2/H3 for distance join

This also works for distance join. You first need to use `ST_Buffer(geometry, distance)` to wrap one of your original geometry column. If your original geometry column contains points, this `ST_Buffer` will make them become circles with a radius of `distance`.

For example. run this query first on the left table before Step 1.

```sql theme={"system"}
SELECT id, ST_Buffer(geom, DISTANCE), name
FROM lefts
```

Since the coordinates are in the longitude and latitude system, so the unit of `distance` should be degree instead of meter or mile. You will have to estimate the corresponding degrees based on your meter values. Please use [this calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).

## Other queries

There are lots of other functions can be combined with these queries. Please read [Spatial SQL API](/reference/wherobots-db/geometry-data/geometry-functions).
