Calculate the mean of the pairwise weighted distances between points using the great circle metric 🌐 for a very big dataset without running out of RAM 💣 and/or waiting till the end of the universe. 😂
This code takes a set of 2D data points X
and calculates the mean of the pairwise weighted distances between points using the great circle metric.
It offers extensive speedup over Python-only implementations, so it is useful when dealing with very big data.
To call use:
mean_distances = c_mean_dist(X, weights)
where X
are your data points, and weights
are the weights or counts (depending on how you want to conceptualise them).
Weights affect the mean of the pairwise distances in the same as including more of the point which the weight corresponds to.
So if a data point with value (0, 1) has a weight of 2, the average pairwise distances will be affected in the same way as if you had added another data point with value (0, 1) to X
and had set both their weights to 1.
It also implements great circle, also known as orthodromic or geodesic, distance metric faster than GeoPy in cdist
.
For an example of both functions see compare.py
.
The C and Python code were written by Olivia Guest — using this tutorial by Dmitrii V Pasechnik to call C functions from Python using Cython, and using the haversine function from Rosetta Code.
Make sure you have Cython and its dependencies installed (refer to requirements.txt
).
Run make
.
Subsequently, run python compare.py
to confirm compilation, and to see the comparison between using the C version and using a Python-only way.
See requirements.txt
in case you need to install GeoPy, etc.
If you want to use this function from outside this directory, e.g., import
, I have not yet found a way of doing so without adding the path to the library to LD_LIBRARY_PATH
, e.g., export LD_LIBRARY_PATH=/local/path/to/this/repo
.
For adding it permanently (so you do not have to do this every time) add it to your ~/.bashrc
or whatever your set-up dictates.
There were many attenpts 😵 to make this work Python-only. 🐍 Alas — none of them worked out, but feel free to play around with the various Python versions. The main stumbling block was the GIL. 😢 For very huge data sometimes Python-only is not the best idea. 😬
- Fill in the
setup.py
(template: https://github.com/uwescience/shablona/blob/master/setup.py) - Submit to pypi (#2 (comment))