Skip to content

Commit

Permalink
Add invrefpool
Browse files Browse the repository at this point in the history
Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612.

Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`:
1. we immediately can drop values from short table not present in long table.
2. later we can do join on integer columns which is way faster than joining on e.g. string column.

Also since we do mapping of short table this operation should be fast.

In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values.

For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel.

I am not sure what would have to be defined in Arrow.jl.
  • Loading branch information
bkamins authored Jan 29, 2021
1 parent 3bff060 commit 1500f61
Showing 1 changed file with 17 additions and 0 deletions.
17 changes: 17 additions & 0 deletions src/DataAPI.jl
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,23 @@ default definition.
function refpool end
refpool(A::AbstractArray) = nothing

"""
invrefpool(A)
Whenever available, return an indexable object `invpool` such that, given the *original*
array `A` and a "value" `x`, `pool(A)[invpool(A)[x]]` is equal to `x`.
Return `nothing` if such "ref value" is not available.
By default, `refpool(A)` returns `nothing`.
If `invrefpool(A)` is not `nothing`, then `pool(A)` also muts not be `nothing`.
This generic function is owned by DataAPI.jl itself, which is the sole provider of the
default definition.
"""
function invrefpool end
invrefpool(A::AbstractArray) = nothing

"""
describe(io::IO, x)
Expand Down

0 comments on commit 1500f61

Please sign in to comment.