Fun with DBSCAN algorithm and GADM-geocoded points of interest.
This reads in a data set of coordinates (latitude and longitude) along with geocoded Global Administrative Area features for these, then performs unsupervised learning to cluster these into zones of interest based on geographic features using the Density-Based Spatial Clustering of Applications with Noise algorithm with a customized distance function.
Update options in settings.py
and run from the command line:
$ python dbscan_gadm_metric.py
You can use one of three modes to calculation the distance between points for DBSCAN clustering
Custom distance metric using Vincenty's Formula.
Set MODE = 'vincenty-basic'
in settings.py
.
Custom distance metric that combines Vincenty's Formula with GADM features to calculate a scored distance (in km). The metric starts with a base Vincenty's Forumal distance calculation, then modifies this based on whether the two points are in the same city and or city neighborhood.
This is just one illustrative method of using GADM features to modify distance. It is "magic numbery" for simplicity. In real-life one would derive values for GADM feature weights -- or use the full proxy method.
Set MODE = 'vincenty-gadm'
in settings.py
.
Custom distance metric that uses a simple proxy ID to fetch attributes from an external data set (for illustrative simplicity in this case, we use the passed POI dataset)
While this Proxy approach replicates the same distance formula of Vincenty-plus-GADM it could be modified to support ANY distance formula. For example, rather that using GADM features one could instead extract a key or GUID used to look up a whole array of features used for a custom distance calculation (even to make a REST call to a route management system to get true driving times between each X and Y).
Set MODE = 'proxy'
in settings.py
.
A sample data file of 10 points of interest is provided in the /sample-data folder to illustrate the results of different DBSCAN distance calculations. These points were selected to show different clustering results based on whether or not GADM features are used.
Here is a map of all 10 points:
When looking at the 10 points from a simple human "how would you cluster these?" perspective, it appears we have five clusters. This is exactly how the the Vincenty Basic distance formula clusters these (when set to allow clusters of 1+ points and to treat anything within a 1.0 km radius as candidates in the same cluster:
Model Performance and Metrics
================================================================================
Estimated number of clusters: 5
Homogeneity: 0.676
Completeness: 1.000
V-measure: 0.807
Adjusted Rand Index: 0.000
Adjusted Mutual Information: 0.000
Silhouette Coefficient: 0.930
You can see the raw results here. A matplotlib
view shows a Mercator projection of five separate clusters:
This appears to "make sense" from a common common sense POV.
However, lets see what happens if we use Vincenty distance modified by GADM features to determine clusters. In this case we have applied two rules based on a very simplistic urban model:
- Urban travel is harder than non-urban. If two points are in the same city, and the city is large enought to have distinct GADM neighborhoods (a.k.a. localities), then provide a 100% distance penality.
- However, assume that intercity neighborhoods are defined based on natural human clusters. As such, if two points are in the same city AND are in the same neighborhood locality, then give a 20% distance bonus.
Applying these rules yields 7 clusters (vs. 5):
Model Performance and Metrics
================================================================================
Estimated number of clusters: 7
Homogeneity: 0.819
Completeness: 1.000
V-measure: 0.901
Adjusted Rand Index: 0.000
Adjusted Mutual Information: 0.000
Silhouette Coefficient: 0.467
You can see the raw results here. Here is the matplotlib
Mercator view:
This view shows two places where simple Vincenty clusters were divided: Arlington, VA and central Washington, DC:
The two POIs in Arlington are within 1 km of each other but in two different neighborhoods (Virginia Square and Waverly Hills). These neighborhoods are separated by Route I-66: Adding the modified GADM rule allowd DBSCAN to "recognize" this using neighborhood as a human proxy input.
Adding GADM features split the three Washington POIs into two clusters (even though all were within 1 km of each other). One cluster was to the west of The White House (in Northwest Washington). The other was north (in Downtown): Anyone who has driven around DC will tell you this makes sense also, as the The White House creates lots of detours.
As expected, using GADM features kept points that are close together AND in the same intercity neighborhood together. Here is Old Town, Alexandria:
Currently the program defines options in a settings.py
file:
Setting | Description | Example Values |
---|---|---|
INPUT_FILE |
Source data file of coordinates | See below |
OUTPUT_FILE |
Output data file name | Qualified filename with path |
ZOA_SUMMARY_TO_SCREEN |
Print ZOA summary to screen | True , False |
MATPLOT_ZOA_CLUSTERS |
Use matplotlib to graph clusters |
True , False |
MODE |
Custom distance metric formula to use. See below. | See above |
DEFAULT_RADIUS |
Default ZOA radius for DBSCAN epsilon, in km | 1.0 |
DEFAULT_ROUNDING |
Default rounding in decimal places, for GPS coordinates. | 4 |
LOCAL * |
Adjustment factor for coordinates in same neighborhood | 0.8 |
X_TOWN * |
Adjustment factor for coordinates in same city | 2.0 |
*These are settings are ignored in vincenty-basic
mode. See Sample Results above for discussion of use of LOCAL
and X_TOWN
settings.
The script can take in any CSV file of POIs, as long as the file contains
the requisite data points. A sample is provided
in the /sample-data
folder.
If you use your own file, simply set column names in the dictionary KEY values
in the settings.py
file as follows
Key | CSV Column name of | Null Allowed in CSV? |
---|---|---|
LAT_KEY |
Latitude (in decimal degrees) | NO |
LNG_KEY |
Longitude (in decimal degrees) | NO |
ADDR_KEY |
Single-line Address (e.g., "77 Massachussetts Avenue, Cambridge, MA 02139") | Yes |
NBHD_KEY |
GADM Neighborhood Name | Yes for vincenty-basic |
CITY_KEY |
GADM City Name | Yes |
NAME_KEY |
POI Location Name | NO |
ZOA_KEY * |
ZOA label you wish for output | N/A |
*Not part of INPUT_FILE
. Used to create OUTPUT_FILE
.