Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add invrefpool #33

Merged
merged 7 commits into from
Feb 1, 2021
Merged

Add invrefpool #33

merged 7 commits into from
Feb 1, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jan 29, 2021

Having a generic access to invrefpool is needed in JuliaData/DataFrames.jl#2612.

Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for innerjoin:

  1. we immediately can drop values from short table not present in long table.
  2. later we can do join on integer columns which is way faster than joining on e.g. string column.

Also since we do mapping of short table this operation should be fast.

In particular if short table defines refarray it is particularly fast, as we only need to map the reference values.

For CategoricalArrays.jl and PooledArrays.jl invrefpool is simply get on the inverted pool Dict with nothing as a sentinel.

I am not sure what would have to be defined in Arrow.jl.

Having a generic access to `invrefpool` is needed in JuliaData/DataFrames.jl#2612.

Consider a short table and a long table joined on some column. In order to be fast we need to map values from short table key to ref values of long table key. This allows two things for `innerjoin`:
1. we immediately can drop values from short table not present in long table.
2. later we can do join on integer columns which is way faster than joining on e.g. string column.

Also since we do mapping of short table this operation should be fast.

In particular if short table defines `refarray` it is particularly fast, as we only need to map the reference values.

For CategoricalArrays.jl and PooledArrays.jl `invrefpool` is simply `get` on the inverted pool `Dict` with `nothing` as a sentinel.

I am not sure what would have to be defined in Arrow.jl.
@bkamins bkamins requested review from quinnj and nalimilan January 29, 2021 19:18
@codecov
Copy link

codecov bot commented Jan 29, 2021

Codecov Report

Merging #33 (0a880c3) into main (3bff060) will decrease coverage by 4.32%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
- Coverage   95.23%   90.90%   -4.33%     
==========================================
  Files           1        1              
  Lines          21       22       +1     
==========================================
  Hits           20       20              
- Misses          1        2       +1     
Impacted Files Coverage Δ
src/DataAPI.jl 90.90% <0.00%> (-4.33%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bff060...4297269. Read the comment docs.

@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

Also could we add a restriction that ref values have to be non-negative (I am not sure if for Arrow.jl it would be acceptable). This would simplify code for me as negative value could be used as a sentinel (just like in Dict in Base, when negative value is internally returned when lookup fails).

src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 29, 2021

Ah - now I have realized that we even do not require "ref value" to be an integer. So this is a question again if we want to add such a restriction. If not then invrefpool(A) should support not only getindex but also haskey and get.

At least I would add a restriction that some sentinel, e.g. nothing is not allowed to be "ref value".

What do you think?

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Two points:

  • Do we really need to require invrefpool(A)[x] to return nothing when x isn't found?
  • Maybe we should actually require that invrefpool(A)[A[i]] give the same value as refarray(A)[i]? That would be stricter, and is probably what we expect. Maybe you even rely on that in your PR?

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 30, 2021

Do we really need to require invrefpool(A)[x] to return nothing when x isn't found?

No. We just need some way to tell if x is in the pool or not. Therefore requiring that haskey or supporting get is an alternative.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan
Copy link
Member

No. We just need some way to tell if x is in the pool or not. Therefore requiring that haskey or supporting get is an alternative.

OK, then maybe requiring haskey is better. Otherwise that prevents implementing invrefpool using a standard Dict.

What do you think about my other point?

Also could we add a restriction that ref values have to be non-negative (I am not sure if for Arrow.jl it would be acceptable). This would simplify code for me as negative value could be used as a sentinel (just like in Dict in Base, when negative value is internally returned when lookup fails).

Ah - now I have realized that we even do not require "ref value" to be an integer. So this is a question again if we want to add such a restriction. If not then invrefpool(A) should support not only getindex but also haskey and get.

At least I would add a restriction that some sentinel, e.g. nothing is not allowed to be "ref value".

What do you think?

Sorry I had missed these comments. This is an orthogonal issue so we can maybe discuss it separately. Maybe you could use first(refpool(A))-1 as a sentinel?

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 30, 2021

I would go for haskey then. We cannot use first(refpool(A))-1 for sentinel as:

  1. we do not have a guarantee that references are numbers now.
  2. even if we had this guarantee subtracting 1 could be problematic in theory (e.g. for unsigned values we could get -1 and type instability)

But I worked around it and haskey is enough for me.

As for:

invrefpool(A)[A[i]] give the same value as refarray(A)[i]

we can add also this requirement, but it is actually looser than what I required I think (as refpool can contain values not present in A and we still want invrefpool should also have a mapping for it).

I have updated the docstring to be precise. Hopefully you are OK with it.

@nalimilan
Copy link
Member

I would go for haskey then. We cannot use first(refpool(A))-1 for sentinel as:

1. we do not have a guarantee that references are numbers now.

2. even if we had this guarantee subtracting `1` could be problematic in theory (e.g. for unsigned values we could get `-1` and type instability)

But I worked around it and haskey is enough for me.

OK. Actually what I do in the DataFrames grouping code is that I only use the fast path when refarray(A) isa AbstractArray{Integer}. That way we're safe but we still allow other types in case that could be useful for some packages.

src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 31, 2021

That way we're safe but we still allow other types in case that could be useful for some packages.

I agree. We just need to remember about it.

Currently we don't require === in the docstring for refpool (only isequal and equal type)

I know, but what is the benefit of not requiring ===? (we can discuss it in other PR, I accepted your suggestion in this PR).

If isequal is implemented correctly "isequal and equal type" should be equivalent to === for immutable types.
And I think we should disallow mutable types in PooledArrays.jl (we already disallow them in CategoricalArrays.jl). I am not sure about Arrow.jl (cc @quinnj). See JuliaData/PooledArrays.jl#51.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan
Copy link
Member

I know, but what is the benefit of not requiring ===? (we can discuss it in other PR, I accepted your suggestion in this PR).

If isequal is implemented correctly "isequal and equal type" should be equivalent to === for immutable types.
And I think we should disallow mutable types in PooledArrays.jl (we already disallow them in CategoricalArrays.jl). I am not sure about Arrow.jl (cc @quinnj). See JuliaData/PooledArrays.jl#51.

I'm not sure there's a benefit, but I'm not sure there's a benefit to requiring === either. :-) Anyway it's not equivalent, for example:

julia> NaN === 0/0
false

julia> isequal(NaN, 0/0)
true

More generally, any immutable type can implement isequal in a less strict way than ===, e.g. if it includes references to mutable objects like CategoricalValue. There may also be corner cases with strings.

src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jan 31, 2021

Ah yes - the NaN thing is nasty.

@quinnj - is it OK to merge it as is? The next step would be to add the support in PooledArrays.jl, CategoricalArrays.jl and Arrow.jl and make their releases.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM.

I've opened apache/arrow-julia#120 since we haven't implemented any DataAPI methods for Arrow.DictEncoded yet and I keep forgetting about it.

@bkamins
Copy link
Member Author

bkamins commented Jan 31, 2021

OK - I will wait till tomorrow and if there are no more comments then merge this PR and tag a release. Then I would open PRs to PooledArrays.jl and CategoricalArrays.jl to provide the API.

@bkamins bkamins merged commit 241a203 into main Feb 1, 2021
@bkamins bkamins deleted the bkamins-patch-2 branch February 1, 2021 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants