Skip to main content
The following content is a read-only preview of an executable Jupyter notebook.To run this notebook interactively:
  1. Go to Wherobots Cloud.
  2. Start a runtime.
  3. Open the notebook.
  4. In the Jupyter Launcher:
    1. Click File > Open Path.
    2. Paste the following path to access this notebook: examples/Analyzing_Data/Getis_Ord_Gi*.ipynb
    3. Click Enter.
Getis and Ord’s Gi and Gi* statistics are popular statistical approaches for finding statistically significant hot and cold spots across space. It compares the value of some numerical variable of a spatial record with those of the neighboring records. The nature of these neighborhoods is controlled by the user. In this example, we will use the Gi* statistic on the Overture places data to identify regions of high and lower “density”. For this exercises we assume that more places data indicates higher density.

Configuration

We configure here the size of the neighborhood, the region in which we want to generate statistics, and the resolution of the grid cells we will generate. Today we will perform this exercise for the region around Seattle and Bellevue, Washington. A good analysis here should reveal the urban core of Seattle and downtowns of Bellevue, Redmond, Kirkland, and Issaquah. With a larger cluster we can set region to None and generate this data for the entire world. For the neighbor radius, we want a value that gives each cell a substantial number of neighbors (maybe at least 10), but does not allow the degree of density in downtown Seattle to obscure Issaquah’s downtown. It’s about finding a balance between generating an accurate and powerful statistic and achieving sufficiently local results. For the zoom level, we want cells that are small enough to resolve the phenomena we are searching for but large enough such that each cell’s statistic is not due to randomness in the spatial distribution of places. Imagine cells that are so small that they are could be contained by the roadway. Each cell might show only 1 or 0 places, which is not going to reveal the trends we are looking for. Selecting these parameters can require some trial and error, and perhaps domain knowledge. We will show some of that selection process in this notebook.
region = "POLYGON ((-122.380829 47.870302, -122.048492 47.759637, -121.982574 47.531111, -122.408295 47.50978, -122.44812 47.668162, -122.380829 47.870302))"
neighbor_search_radius_degrees = .01
h3_zoom_level = 8

Spark Initialization

We will use Spark to run the Gi* algorithm. We initialize a Spark session with Sedona.
from sedona.spark import *

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Filtering and Aggregation

In this notebook we assign an H3 cell to each record and filter down to only the region of interest. We aggregate the places data by the cell idenitier and find the number of places in each cell.
import pyspark.sql.functions as f
places_df = (
    sedona.table("wherobots_open_data.overture_maps_foundation.places_place")
        .select(f.col("geometry"), f.col("categories"))
        .withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)

if region is not None:
    places_df = places_df.filter(ST_Intersects(ST_GeomFromText(f.lit(region)), f.col("geometry"))).repartition(100)


hexes_df = (
    places_df
        .groupBy(f.col("h3Cell"))
        .agg(f.count("*").alias("num_places")) # how many places in this cell
        .withColumn("geometry", ST_H3ToGeom(f.array(f.col("h3Cell")))[0])
)

Sanity Check our Variable

We want to make sure we have a good distribution of values in our variable that we will analyze. Specifically we are ensuring that our cells are not too small which would be indicated by the places counts all being very low. We generate deciles here to make sure that there is some good range of these values. An extreme negative example would be if these values were all zero and one.
hexes_df.select(f.percentile_approx("num_places", [x / 10.0 for x in range(11)])).collect()[0][0]

Generate our Gi* statistic

Finally, we generate our statistic. There are a lot of variables to fine tune here; these are explained in the API documentation. Here we use the most typical parameters. The exception is the search radius which is always domain specific. The output here will show us, among other things, a Z score and P value. A Z score shows how many standard deviations from the mean of the neighborhood the value is and the P score tells us the chance that value is from random variation rather than an actual phenomenon.
from sedona.spark import *


gi_df = g_local(
    add_binary_distance_band_column(
        hexes_df,
        neighbor_search_radius_degrees,
        include_self=True,
    ),
    "num_places",
    "weights",
    star=True
).cache()

gi_df.drop("weights", "h3Cell", "geometry").orderBy(f.col("P").asc()).show()

Visualize

Now we plot our statistics in Kepler. Once Kepler is rendered, you can color the cells by Z score and set the number of bands to 10 with the color palette that goes from blue to red. the bluest are the cold spots and reddest hottest.
from sedona.spark import *

kmap = SedonaKepler.create_map(places_df, "places")

SedonaKepler.add_df(
    kmap,
    gi_df.drop("weights"),
    "cells"
)

kmap