Local Outlier Factor¶
Local Outlier Factor (LOF) is a common algorithm for identifying data points that are inliers or outliers relative to their neighbors. The algorithm generates an outlier score that compares the proximity of a data point's density relative to its neighbors. The number of neighbors to use, k, is defined by the user.
Scores much less than 1 are inliers, scores much greater than 1 are outliers, and those near 1 are neither. This demo is derived from the scikit-learn Local Outlier Detection demo.
Define Sedona Context¶
from sedona.spark import SedonaContext
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)
Data Generation¶
The following code generates data with two clusters of inliers and some outliers.
import numpy as np
import pyspark.sql.functions as f
from sedona.stats.outlier_detection.local_outlier_factor import local_outlier_factor
from sedona.sql import ST_MakePoint
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(100, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X_inliers, X_outliers]
Generation LOF¶
The following code uses the LOF implementation in Wherobots to generate an outlier score on the selected data. We set k to 20.
df = sedona.createDataFrame(X).select(ST_MakePoint(f.col("_1"), f.col("_2")).alias("geometry"))
outliers_df = local_outlier_factor(df, 20)
outliers_df.show()
Visualization¶
Finally, we visualize the results using geopandas. Some manipulations are made to the data to improve the clarity of the visualization.
import geopandas as gpd
pdf = (outliers_df
.withColumn("lof", f.col("lof") * 50)
.toPandas()
)
gdf = gpd.GeoDataFrame(pdf, geometry="geometry")
ax = gdf.plot(
figsize=(10, 8),
markersize=gdf['lof'],
edgecolor='r',
facecolors="none",
)
gdf.plot(ax=ax, figsize=(10, 8), color="k", markersize=1, legend=True)
ax.set_title('LOF Scores')
ax.legend(['Outlier Scores', 'Data points'])