-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
left, right and outerjoin rewrite #2622
Conversation
I have finished the implementation. What is needed to be done is:
But before doing these let us agree that what I propose is OK. In particular, as expected earlier, I propose to change the row order in the output. Now the row-order is:
(previously we mixed |
@nalimilan - currently using CategoricalArrays.jl is prohibitive for large joins. Below I share you the timings. For the time being I will disable testing CategoricalArrays.jl in benchmarks. Do you think these problems can be resolved (they are related to the things we discussed - mainly (the timings are on this PR and include int, string, PooledArray and CategoricalArray tests to show how they differ) |
Below I am uploading a CSV with benchmark comparisons of this PR vs 0.22.5. Conclusions:
Here are benchmark results. The meaning of columns is as follows:
|
An extract showing how inefficient indexing in PooledArrays.jl kills performance (note that
CC @quinnj as I do not know if you are watching this and probably Arrow.jl will have similar considerations. |
Yeah these are really bad. But it shouldn't be too hard to fix by changing Looking at the comparisons between 0.22 and this PR, the only two serious regressions I see are these, right? Other results are usually incredibly faster.
|
Yes - and they are related to
|
Now the analysis of CategoricalArrays.jl performance. It is tested on Julia nightly:
and here are timings of
Conclusions, we are:
(I am using |
In the nl/valindex branch the following timings are prohibitive still:
The same for PooledArrays.jl is:
And for
|
test/join.jl
Outdated
|
||
@test innerjoin(DataFrame(id=[missing]), DataFrame(id=[1]), on=:id, matchmissing=:equal) == | ||
@test innerjoin(DataFrame(id=Union{Int, Missing}[]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ≅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually test the eltype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the intention of this test is to check if we do not error in this case (as there are errors in 0.22.5 for joins in similar cases - maybe not in this one; I wanted to ensure full coverage). Do you want me to add eltype
tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added them. Just to summarize the rules:
- inner and left join take type from left data frame
- right join takes from right data frame
- outer join joins types
The plan before the release is the following:
|
TODO: add checking of eltype returned columns in tests |
With PooledArrays.jl 1.2.1 joining on EDIT: it must be slower in the corner case described above as for two |
@nalimilan - this should be good to review. Thank you! |
lc_et <: Real && rc_et <: Real && continue | ||
lc_et <: AbstractString && rc_et <: AbstractString && continue | ||
|
||
# otherwise we require non-missing eltype of both sides to be the same and concrete | ||
lc_et === rc_et && isconcretetype(lc_et) && continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - this is a set of conditions I propose to use. It should be safe. If it is not met then we will not be fast anyway as it means that some weird join is done (most likely with Any
containers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be fine in most common cases, but ideally we would remove this condition once CategoricalArrays are fixed. One of DataFrames.jl's strengths is that it's efficient even for custom types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - after we merge this PR I will open an issue to track it.
src/join/core.jl
Outdated
# this is a workaround for https://github.com/JuliaData/CategoricalArrays.jl/issues/319 | ||
# this path will be triggered only in rare cases when the refpool code above | ||
# fails to convert CategoricalArray into refpool | ||
# although isless should be transitive it is safer not to rely on it if we do not have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is ambiguous: it could mean that we don't rely on transitivity at all. It's not that we "have to", is that we decide "not to" for nonstandard cases. But we're still not safe if a custom numeric or string types uses an incorrect definition of isless
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I remove it. I mean "we rely on transitivity and we know of cases when it is not transitive (although it should be) so for safety we disable fast path"
Thank you! |
Continuation of work on #2340.
In this PR I will add left, right and outer joins. For now I have added tests that should pass. Currently they fail because we have corner-case bugs in joining code in 0.22.5.