-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new KDE function. #1284
Add new KDE function. #1284
Conversation
thanks @tomicapretto for your contribution. This is a big one so I made a first round of revision and I will do a another one ASAP. |
BTW I really like that in the example you posted the new kde is more wiggly than the current version :-) |
It seems you need to rebase, there are a lot of commits unrelated to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicks
arviz/kde_utils.py
Outdated
return bw | ||
|
||
|
||
# Misc utils | ||
# Misc utils ------------------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the dashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the line of comment below, maybe these misc methods should go in numeric utils instead? and the comments to divide in sections can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is amazing work, I actually feel bad for not being able to correctly review right now. I'll try to review again with more time in some days/week
My main big comment is in terms of api, I think we should pay a lot of attention here to define an api for kde functions so it is easy to modify the functions, their components or even add new kde calculation methods.
arviz/kde_utils.py
Outdated
found = 0 | ||
if bw <= 0: | ||
warnings.warn( | ||
"Improved Sheather-Jones did not converge to a positive value. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that whatever we choose we should use custom warnings so even in a worst case scenario they can be ignored without blocking all UserWarnings
arviz/kde_utils.py
Outdated
def circular_mean(x): | ||
sinr = np.sum(np.sin(x)) | ||
cosr = np.sum(np.cos(x)) | ||
mean = np.arctan2(sinr, cosr) | ||
return mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still have numeric utils? Shouldn't this go there? Also, I think Using the @conditional_jit
decorator here will speed the method up further. See https://github.com/arviz-devs/arviz/blob/master/arviz/stats/stats_utils.py#L526 for an example
Also scipy version seems to do extra checks with respect to this one, not sure if nan checking or something like this could be useful for us. source https://github.com/scipy/scipy/blob/v1.5.1/scipy/stats/morestats.py#L3300-L3366 moreover, we should probably test our function gives the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of Scipy version but I decided to keep this one because it is faster. As it is used now, things like nan checking are done in kde_circular()
which is the function that calls circular_mean()
.
I guess you mention checks and all that because you're considering using the function in other parts of arviz.
I think there are two alternatives here:
- Implement a
circular_mean()
function that is going to be used everywhere in arviz, instead of Scipy version. - Use Scipy version in other parts of arviz and keep this function for the circular kde since it is faster and safety checks have already been performed when
circular_mean()
is called.
What is your opinion on this?
BTW, I'm not aware if Scipy circmean()
is currently being used in other parts of arviz, I just guess this is the default alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use a circmean (not sure which) in summary with circular variables, I think we should have a single function used in both places.
I am fine with keeping the fast one, given that it covers our use case and we have tests checking it gives the same result as scipy one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree I can add the following function to numeric_utils.py
and make use of it here.
I will also add a test to check the result matches the one obtained with scipy (as far as I've manually tested, it does).
def circular_mean(x, na_rm=False):
if len(x[np.isnan(x)]) > 0:
if na_rm:
x = x[~np.isnan(x)]
else:
return np.nan
sinr = np.sum(np.sin(x))
cosr = np.sum(np.cos(x))
mean = np.arctan2(sinr, cosr)
return mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I would use np.any
instead of slicing and then checking length, and can you open an issue on using numba for this? There are other functions that could benefit from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about
def circular_mean(x, na_rm=False):
if na_rm:
x = x[~np.isnan(x)]
sinr = np.sum(np.sin(x))
cosr = np.sum(np.cos(x))
mean = np.arctan2(sinr, cosr)
return mean
arviz/kde_utils.py
Outdated
assert isinstance(bw_fct, (int, float)) | ||
assert bw_fct > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to one above, these two should probably be type error and value error respectively, right now they'd raise an assertion error without any custom message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the isinstance int, float will also give problems with numpy dtypes i.e. float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace it with the following
if not isinstance(bw_fct, (int, float, np.integer, np.floating)):
raise TypeError(f"`bw_fct` must be a positive number, not an object of {type(bw_fct)}.")
if not bw_fct > 0:
raise ValueError(f"`bw_fct` must be a positive number, not {bw_fct}.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't floating work with python ints too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you mean? It returns False
isinstance(1, np.floating)
# False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, my bad: with python floats
I think this is (almost) ready to merge, we can keep improving the code and its integration to the rest of ArviZ in future PRs. The only missing part is to add a |
Do we want users to call kde? At least for now our kde code has been only for internal use, and I think there is a good reason for that --> there is a lot of coordination between input <<-->> output For example scipy kde can calculate pdf etc, can predict any point user wants. Our is purely for plotting. |
I sometimes call _fast_kde, but I think in general we do not want user to call it. So I would like to keep the dash. |
from scipy.sparse import coo_matrix | ||
|
||
from .stats.stats_utils import histogram | ||
from .utils import _stack, _dot, _cov | ||
from .kde_utils import _kde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a cyclic import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved things from kde_utils.py to numeric_utils.py (as suggested) but that gives us this cyclic import. It works well because I don't use a relative import in kde_utils, but I don't feel it is very elegant/proper. Any suggestions on how to improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that _fast_kde
is a private method, I think we can leave it in numeric_utils
raising an error and pointing to _kde
.
Moreover, we used to have _fast_kde
available as az._fast_kde
(due to it being imported in plots' module __init__
) and removed it at some point without any complaint/issue about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should have _kde
available again as az._kde
though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should as we used to have _fast_kde
@tomicapretto I really appreciate your very detailed initial PR message with lots of graphs and explanations for code changes at a human level and plots for comparison. It helps quite a bit to help people like me understand whats going on and difference is, again very much appreciate the effort you put in! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @tomicapretto !
@aloctavodia, @OriolAbril, @ahartikainen and @canyon289 thank you all for the reviews and the work!! |
Added a new KDE function that can be used to estimate linear and circular data. Decided to put everything in a new file (instead of numeric_utils) because these functions required many utilities themselves. Also, updated all functions that used the old function. The circular option is not yet passed to higher level functions. That is something to be done.
* In the previous version the function tried to normalize the angles to [-pi, pi) but it was not correct. This commit includes a function that works appropiately.
…received. * Functions _circular_mean() and normalize_angle() now are in numeric_utils.py. * If the array has more than 1 dimension, _check_type() tries to flatten it silenty. If it is not possible, an error is raised. * Improved how bw_fct type and value are checked and reported. * All functions now start with _ because they are all internals. * _select_bw_method() no longer exists. Its functionality is now incorporated into _get_bw(). * Changed internal _bw_* functions to a common signature. * Added DeprecationWarning to _fast_kde() and now it calls _kde(). * Added tests for _normalize_angle() and _circular_mean(). The following functions and their backends have been modified to incorporate the circular argument that is passed to _kde() as well as the new meaning of the argument bw (see below): * plot_density() * plot_kde() * plot_dist() * plot_posterior() * plot_violion() * calculate_point_estimate() In addition, plot_kde() gained a new argument, adaptive, which is used to determine if the bandwidth is adaptive or not. Finally, note the meaning of the argument bw has changed. It used to be a bandwidth multiplication factor Now it can be either a string indicating the bandwidth method or the bandwidth itself. The internal function _kde() has an argument bw_fct which is the multiplicative factor.
* custom_lims in _kde() can be a tuple but then it tried to modify an inmutable object. Now it is converted to a list internally. * test_kde_scipy() had a slicing step that was not necessary an caused an error when unpacking * No error is raised when the data passed has lenght of 0 or 1. It raises a warning and returns np.nan
Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #1284 +/- ##
==========================================
- Coverage 92.89% 91.86% -1.04%
==========================================
Files 101 102 +1
Lines 10192 10471 +279
==========================================
+ Hits 9468 9619 +151
- Misses 724 852 +128
Continue to review full report at Codecov.
|
Thanks for the PR! |
Thank you for this implementation of isj in Python; I have been struggling to find anything other than scott/silverman in Python (it's really only arviz, zfit, and KDEpy that have it, I believe). I'm curious to the motivation behind the "experimental" default of averaging the silverman and isj kernel. I haven't seen that elsewhere; the usual recommendation I come a across is to use isj. I'm not familiar with the literature on density estimation research so that might be why I haven't heard of it, and I'm interested to understand the benefits of this approach. I did stumble upon your notebooks in this repo https://github.com/tomicapretto/density_estimation @tomicapretto and found them illuminating in general (thank you for explaining concepts and results with clarity and in such detail!). However, I did not find an investigation of the silverman/isj average there either. I did notice that you wrote about an issue with isj undersmoothing in sparse regions and suggested using density-adapted kernel methods as a remedy for that: With that background, I could imagine that combining silverman and isj would maybe bring some of the same benefits as an adaptive kernel (balancing under and oversmoothing; although I would guess it gives somewhat inferior results?), but at a lower cost in both complexity and computation. Is this (one of) the reason for the choice of the silverman/isj average here, or is there another motivation behind that decision? I would also be curious if you still find it a more suitable choice after having used it as the default in ArViz for a few years now since this was implemented. |
Hi @joelostblom thanks for the words of recognition! It was a very challenging and fun work. The sustain behind the experimental bandwidth was a simulation study that I performed. I wrote a report and uploaded it to my old webpage. Then I migrated it and that was lost (sorry!). But I went to the repository and found it in an old commit. So have a look at the |
Super, thank you so much for digging that up! That also cleared up a misunderstanding I had around how the adaptive kernel works. It's interesting that Scott and Silverman are used so widely although they appear to have clear oversmoothing drawbacks, especially for sample sizes <~5,000 - 10,000 observations. Thanks again! |
Description
This pull requests replaces the function
_fast_kde()
with a new function calledkde()
that can be used to estimate density functions for both linear and circular data.Major changes
kde_utils.py
instead ofnumeric_utils.py
. The new function requires so many utilities itself that it is worth a new file._fast_kde()
used to returndensity
,xmin
andxmax
, which required creating a new grid after estimating the density.The new function returns
grid
andpdf
. It can also return the estimated bandwidth by callingkde()
with the argumentbw_return
set toTrue
.kde()
supports circular data via its boolean argumentcircular
.bw
used to represent the bandwidth scaling factor in the old function_fast_kde()
. Now it represents either the bandwidth method or the bandwidth itself.It is still possible to pass a bandwidth scaling factor via the argument
bw_fct
which is 1 by default.kde()
supports adaptive bandwidth for linear data. This is enabled by settingadaptive
toTrue
.kde()
supports several bandwidth estimation methods (see below)._fast_kde()
used to perform boundary correction without extending the observed domain in all the cases.This is also the default for
kde()
but these settings can be changed with the argumentsbound_correction
,extend
,extend_fct
, andcustom_lims
.Bandwidth estimation methods supported
For linear data:
bw = 'scott'
).bw = 'silverman'
).bw = 'isj'
).bw = 'experimental'
). This is the default.For circular data:
bw = 'taylor'
). It's like the Scott's rule of the circular world.Comments:
kde_utils.py
does not contain anything related to 2D KDE yet. It would be good to put all that here too.circular
argument is not passed to higher level functions yet. This is something to do in the near future._fast_kde()
now callkde()
and have been updated to receive the new objects.Plots
Updating the KDE function impacts in many different plots. Here is an example that compares the old version versus the new one. Original is here.
Old:
![old](https://user-images.githubusercontent.com/25507629/86832905-2ef60780-c06f-11ea-882a-2e690ebe33e8.JPG)
New:
![new](https://user-images.githubusercontent.com/25507629/86832918-34535200-c06f-11ea-91a9-ce881c5dd571.JPG)
We also have the following comparison which inclues the new version, the old version and the KDE offered by scipy.
5000 samples are used in all the cases.
Checklist