Getis-Ord Gi(*)¶
Getis and Ord's Gi and Gi* statistics are popular spatial statistics approaches for finding significant hot and cold spots across space. These approaches involve comparing a numerical value associated with a spatial record to the values of its neighboring records. The user defines the characteristics of these neighborhoods.
In this example, we'll use the Gi* statistic on the Overture places dataset to identify regions of higher and lower "densities". This example assumes that a region with more places data indicates a higher density.
Configuration¶
In the following code section, we've configured the size of the neighborhood, the region where we want to generate statistics, and the resolution of the generated grid cells. Typically, selecting these parameters in practice can require trial and error, and some domain knowledge. We'll discuss the selection process in this notebook.
The region
coordinates represent the regions around Seattle and Bellevue, Washington. An accurate analysis should reveal the urban core of Seattle as well as the downtowns of Bellevue, Redmond, Kirkland, and Issaquah as high density areas. To generate this data for the entire world, we'd set region
to None
.
For neighbor_search_radius_degrees
, we selected a value that gives each cell a substantial number of neighbors (maybe at least 10), but does not allow the degree of density in a larger downtown like Seattle to obscure a smaller downtown like Issaquah's. The goal is to find a balance between generating an accurate and powerful statistic and achieving enough local results.
For h3_zoom_level
, we chose cells that are small enough to find what we're searching for but large enough to ensure that any statistical patterns within each cell are not the result of random fluctuations in the distribution of places. Cells that are too small might contain only 1 or 0 places, which wouldn't reveal the trends that we're hoping to find.
region = "POLYGON ((-122.380829 47.870302, -122.048492 47.759637, -121.982574 47.531111, -122.408295 47.50978, -122.44812 47.668162, -122.380829 47.870302))"
neighbor_search_radius_degrees = .01
h3_zoom_level = 8
Spark Initialization¶
We are using Spark to run the Gi* algorithm and initializing a Spark session with Sedona.
from sedona.spark import *
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Filtering and Aggregation¶
In this example, we're assigning an H3 cell to each record and filtering so that we only see our desired region. We aggregate the places data by the cell identifier and find the number of places in each cell.
import pyspark.sql.functions as f
places_df = (
sedona.table("wherobots_open_data.overture_2024_07_22.places_place")
.select(f.col("geometry"), f.col("categories"))
.withColumn("h3Cell", ST_H3CellIDs(f.col("geometry"), h3_zoom_level, False)[0])
)
if region is not None:
places_df = places_df.filter(ST_Intersects(ST_GeomFromText(f.lit(region)), f.col("geometry"))).repartition(100)
hexes_df = (
places_df
.groupBy(f.col("h3Cell"))
.agg(f.count("*").alias("num_places")) # how many places in this cell
.withColumn("geometry", ST_H3ToGeom(f.array(f.col("h3Cell")))[0])
)
Verifying the data¶
At this point, we want to verify that we have a good distribution of values in our hexes_df
dataframe. Specifically, we want to confirm that our cells aren't too small, which would be indicated by the each of the places counts being very low.
Here, we generate deciles to confirm that there's an expected range of values. An undesirable scenario would be if all of these values were either zero or one.
hexes_df.select(f.percentile_approx("num_places", [x / 10.0 for x in range(11)])).collect()[0][0]
hexes_df.show()
Generate our Gi* statistic¶
Finally, we'll generate the Gi* statistic with g_local
. The output shows the z-scores and p-values for each H3 cell.
Here, we'll use the most typical parameters. The exception is the search radius which is always domain specific.
The output shows a z-score and p-value. A z-score indicates how far, in terms of standard deviations, a value deviates from the average of its surrounding area. The p-value, on the other hand, quantifies the likelihood that this deviation is due to chance, rather than an actual underlying pattern or phenomenon.
from sedona.stats.hotspot_detection.getis_ord import g_local
from sedona.stats.weighting import add_binary_distance_band_column
gi_df = g_local(
add_binary_distance_band_column(
hexes_df,
neighbor_search_radius_degrees,
include_self=True,
),
"num_places",
"weights",
star=True
).cache()
gi_df.drop("weights", "h3Cell", "geometry").orderBy(f.col("P").asc()).show()
Visualize the data¶
Now, we'll plot our statistics in Kepler. Once Kepler is rendered, you can color the cells by z-score and set the number of bands to 10 with the color palette that goes from blue to red. The bluest are the cold spots and reddest hottest.
from sedona.maps.SedonaKepler import SedonaKepler
kmap = SedonaKepler.create_map(places_df, "places")
SedonaKepler.add_df(
kmap,
gi_df.drop("weights"),
"cells"
)
kmap