-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
map
applies function to unused pool levels
#36
Comments
A couple of thoughts:
|
Yes at least we could avoid adding the |
But won't we run into weird clashes if they have a legit error thrown in their mapping function? Like, it should fail, but now it silently succeeds and they get an empty array or something? |
Well the approach I propose would assume the function is pure. So if it fails that's not an issue as we won't use its result. |
This tripped me up in a slightly different situation: I had taken a slice of a PooledArray, then I wanted to map its values through a dictionary that (due to the program's logic) would only have values for entries in the slice. It errors because it is trying to look up values that were in the original PooledArray but aren't in the slice. Perhaps a function to "trim"/ remove unused refs from a PooledArray would be generally useful? |
Yes - this function would be needed in many other situations also. Also a function to reorder the pool would be useful. Both are supported in CategoricalArrays.jl and it would be best if we had them exposed in DataAPI.jl so that we can write generic code. |
Addresses #36. This PR proposes ways to address the issue in #36, namely that unused pool levels can cause unexpected results when mapping (and possibly other operations). It introduces a new exported `dropunusedlevels!` function that will scan a `PooledArray` and remove any pooled values not being reference in the ref array any more. It also updates the `map` implementation to allow a keyword argument `dropunusedlevels=true` to call this before mapping. We could perhaps consider making that `true` by default to avoid the original issue happening, though it comes with a performance cost of checking for unused levels. This PR also updates the implementation of `map`; it was previously doing quite a bit of extra work: mapping over keys twice for some reason, in addition to checking the resulting mapped values for duplicates. As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely. Anyway, still need to add tests here, but figured I would put up the PR as-is for discussion on the approach.
map
applies the function to all values in the pool, whether they are used or not. This is notably a problem if you replace missing values:This is a difficult problem to fix, as it would require going over all values to check which ones are not used. One possibility would be to wrap the call in
try... catch
, store a special value in case of error, and check at the end whether that value was used. If so, call the function again to trigger the error.The text was updated successfully, but these errors were encountered: