Configuration
In the following code section, we’ve configured the size of the neighborhood, the region where we want to generate statistics, and the resolution of the generated grid cells. Typically, selecting these parameters in practice can require trial and error, and some domain knowledge. We’ll discuss the selection process in this notebook. Theregion coordinates represent the regions around Seattle and Bellevue, Washington. An accurate analysis should reveal the urban core of Seattle as well as the downtowns of Bellevue, Redmond, Kirkland, and Issaquah as high density areas. To generate this data for the entire world, we’d set region to None.
For neighbor_search_radius_degrees, we selected a value that gives each cell a substantial number of neighbors (maybe at least 10), but does not allow the degree of density in a larger downtown like Seattle to obscure a smaller downtown like Issaquah’s. The goal is to find a balance between generating an accurate and powerful statistic and achieving enough local results.
For h3_zoom_level, we chose cells that are small enough to find what we’re searching for but large enough to ensure that any statistical patterns within each cell are not the result of random fluctuations in the distribution of places. Cells that are too small might contain only 1 or 0 places, which wouldn’t reveal the trends that we’re hoping to find.
Spark Initialization
We are using Spark to run the Gi* algorithm and initializing a Spark session with Sedona.Filtering and Aggregation
In this example, we’re assigning an H3 cell to each record and filtering so that we only see our desired region. We aggregate the places data by the cell identifier and find the number of places in each cell.Verifying the data
At this point, we want to verify that we have a good distribution of values in ourhexes_df dataframe. Specifically, we want to confirm that our cells aren’t too small, which would be indicated by the each of the places counts being very low.
Here, we generate deciles to confirm that there’s an expected range of values. An undesirable scenario would be if all of these values were either zero or one.
Generate our Gi* statistic
Finally, we’ll generate the Gi* statistic withg_local. The output shows the z-scores and p-values for each H3 cell.
Here, we’ll use the most typical parameters. The exception is the search radius which is always domain specific.
The output shows a z-score and p-value. A z-score indicates how far, in terms of standard deviations, a value deviates from the average of its surrounding area. The p-value, on the other hand, quantifies the likelihood that this deviation is due to chance, rather than an actual underlying pattern or phenomenon.

