-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add invrefpool #33
Add invrefpool #33
Conversation
Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612. Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`: 1. we immediately can drop values from short table not present in long table. 2. later we can do join on integer columns which is way faster than joining on e.g. string column. Also since we do mapping of short table this operation should be fast. In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values. For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel. I am not sure what would have to be defined in Arrow.jl.
Codecov Report
@@ Coverage Diff @@
## main #33 +/- ##
==========================================
- Coverage 95.23% 90.90% -4.33%
==========================================
Files 1 1
Lines 21 22 +1
==========================================
Hits 20 20
- Misses 1 2 +1
Continue to review full report at Codecov.
|
Also could we add a restriction that ref values have to be non-negative (I am not sure if for Arrow.jl it would be acceptable). This would simplify code for me as negative value could be used as a sentinel (just like in |
Ah - now I have realized that we even do not require "ref value" to be an integer. So this is a question again if we want to add such a restriction. If not then At least I would add a restriction that some sentinel, e.g. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Two points:
- Do we really need to require
invrefpool(A)[x]
to returnnothing
whenx
isn't found? - Maybe we should actually require that
invrefpool(A)[A[i]]
give the same value asrefarray(A)[i]
? That would be stricter, and is probably what we expect. Maybe you even rely on that in your PR?
No. We just need some way to tell if |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
OK, then maybe requiring What do you think about my other point?
Sorry I had missed these comments. This is an orthogonal issue so we can maybe discuss it separately. Maybe you could use |
I would go for
But I worked around it and As for:
we can add also this requirement, but it is actually looser than what I required I think (as I have updated the docstring to be precise. Hopefully you are OK with it. |
OK. Actually what I do in the DataFrames grouping code is that I only use the fast path when |
I agree. We just need to remember about it.
I know, but what is the benefit of not requiring If |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
I'm not sure there's a benefit, but I'm not sure there's a benefit to requiring julia> NaN === 0/0
false
julia> isequal(NaN, 0/0)
true More generally, any immutable type can implement |
Ah yes - the @quinnj - is it OK to merge it as is? The next step would be to add the support in PooledArrays.jl, CategoricalArrays.jl and Arrow.jl and make their releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LGTM.
I've opened apache/arrow-julia#120 since we haven't implemented any DataAPI methods for Arrow.DictEncoded
yet and I keep forgetting about it.
OK - I will wait till tomorrow and if there are no more comments then merge this PR and tag a release. Then I would open PRs to PooledArrays.jl and CategoricalArrays.jl to provide the API. |
Having a generic access to
invrefpool
is needed in JuliaData/DataFrames.jl#2612.Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for
innerjoin
:Also since we do mapping of short table this operation should be fast.
In particular if short table defines
refarray
it is particularly fast, as we only need to map the reference values.For CategoricalArrays.jl and PooledArrays.jl
invrefpool
is simplyget
on the inverted poolDict
withnothing
as a sentinel.I am not sure what would have to be defined in Arrow.jl.