Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

left, right and outerjoin rewrite #2622

Merged
merged 13 commits into from
Mar 4, 2021
Merged

left, right and outerjoin rewrite #2622

merged 13 commits into from
Mar 4, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Feb 15, 2021

Continuation of work on #2340.

In this PR I will add left, right and outer joins. For now I have added tests that should pass. Currently they fail because we have corner-case bugs in joining code in 0.22.5.

@bkamins
Copy link
Member Author

bkamins commented Feb 17, 2021

I have finished the implementation. What is needed to be done is:

  • review docstrings and manual if order of rows is not affected
  • write and run benchmarks

But before doing these let us agree that what I propose is OK.

In particular, as expected earlier, I propose to change the row order in the output. Now the row-order is:

  • everything that innerjoin finds matching
  • everything that is only in left table
  • everything that is only in right table

(previously we mixed innerjoin and "left table" parts and then "right table" followed, and I found it confusing and the rules were quite complex - what I propose I find easier to reason about, but maybe the old behavior had some benefits so please comment; in particular now indicator column that @oxinabox proposed to add some time ago is super easy to read as it is always sorted -first goes "both" then "left_only" then "right_only")

@bkamins
Copy link
Member Author

bkamins commented Feb 18, 2021

@nalimilan - currently using CategoricalArrays.jl is prohibitive for large joins. Below I share you the timings. For the time being I will disable testing CategoricalArrays.jl in benchmarks. Do you think these problems can be resolved (they are related to the things we discussed - mainly copyto!).

join_timing_with_cat.txt

(the timings are on this PR and include int, string, PooledArray and CategoricalArray tests to show how they differ)

@bkamins
Copy link
Member Author

bkamins commented Feb 18, 2021

Below I am uploading a CSV with benchmark comparisons of this PR vs 0.22.5. Conclusions:

  • I did not benchmark CategoricalArrays.jl as they are problematic to join on with many levels
  • the way I propose to handle left, right and outer join is faster than what was previously used
  • For PooledArrays.jl we are slower in cases when the new algorithm copies pool from the longer table (this hopefully can be fixed in PooledArrays.jl - if we want to keep generic code in DataFrames.jl; otherwise we can do special casing for PooledArrays.jl) - this case is rare
  • the only case when we are not visibly faster are innerjoin, on String with unbalanced length of left and right data frames and random order of values (here a review of this part of the algorithm would be welcome - one of the reasons is for sure the decision to use only exported API of Dict - but the fact is that that the joining algorithm based on groupby is indeed efficient in the case I described above at the cost of allocating much more)

Here are benchmark results. The meaning of columns is as follows:

  • left: nrow of left table
  • right: nrow of right table
  • type: what data type is the column that we join on
  • counts: if the data has unique values, duplicates, or many duplicates
  • order: if the data is sorted or shuffled
  • oncols: if we join on 1 or 2 columns
  • join: join type
  • timeleft_pr: time taken by the PR in left, right join order
  • allocleft_pr: allocations by the PR in left, right join order
  • timeright_pr: time taken by the PR in right, left join order
  • allocright_pr: allocations by the PR in right, left join order
  • timeleft_022: time taken by current release in left, right join order
  • allocleft_022: allocations by current release in left, right join order
  • timeright_022: time taken by current release in right, left join order
  • allocright_022: allocations by current release in right, left join order

comparison.txt

@bkamins
Copy link
Member Author

bkamins commented Feb 18, 2021

An extract showing how inefficient indexing in PooledArrays.jl kills performance (note that outerjoin does most work but I use different column creation strategy):

left right type counts order oncols join timeleft_pr timeright_pr
100000 50000000 pool dup sort 2 inner 0.340316 12.731161
100000 50000000 pool dup sort 2 left 0.359088 74.606119
100000 50000000 pool dup sort 2 right 75.194765 0.35665
100000 50000000 pool dup sort 2 outer 1.026409 1.025745

CC @quinnj as I do not know if you are watching this and probably Arrow.jl will have similar considerations.

@nalimilan
Copy link
Member

Yeah these are really bad. But it shouldn't be too hard to fix by changing getindex not to copy the pool when the length of the extracted slice is smaller than the pool (with some threshold)?

Looking at the comparisons between 0.22 and this PR, the only two serious regressions I see are these, right? Other results are usually incredibly faster.

left right type counts order oncols join timeleft_pr allocleft_pr timeright_pr allocright_pr timeleft_022 allocleft_022 timeright_022 allocright_022
5000000 10000000 pool manydup rand 1 left 20,65165 311 allocations: 2.005 GiB, 0.65% gc time 11,500368 320 allocations: 2.076 GiB, 2.30% gc time 17,766773 274 allocations: 6.095 GiB, 3.93% gc time 22,165854 287 allocations: 8.444 GiB, 2.91% gc time
5000000 10000000 pool manydup rand 2 left 40,482344 393 allocations: 3.160 GiB, 0.85% gc time 22,231336 410 allocations: 3.265 GiB, 1.78% gc time 25,02675 351 allocations: 6.481 GiB, 2.33% gc time 30,894943 371 allocations: 8.864 GiB, 2.76% gc time

@bkamins
Copy link
Member Author

bkamins commented Feb 18, 2021

Yes - and they are related to getindex with PooledArray.jl. We have a small regression in this case (as commented above), which probably can be fixed later, but it is minor:

the only case when we are not visibly faster are innerjoin, on String with unbalanced length of left and right data frames and random order of values (here a review of this part of the algorithm would be welcome - one of the reasons is for sure the decision to use only exported API of Dict - but the fact is that that the joining algorithm based on groupby is indeed efficient in the case I described above at the cost of allocating much more)

@bkamins
Copy link
Member Author

bkamins commented Feb 18, 2021

Now the analysis of CategoricalArrays.jl performance. It is tested on Julia nightly:

[ Info: ["5000000", "5000000", "cat", "dup", "rand", "1", "outer"] # nl/valindex
  9.875335 seconds (275 allocations: 1.005 GiB, 0.12% gc time)
  9.901705 seconds (275 allocations: 1.005 GiB, 0.01% gc time)

julia join_performance.jl 5000000 5000000 cat dup rand 1 outer # 0.9.2 release
[ Info: ["5000000", "5000000", "cat", "dup", "rand", "1", "outer"]
 14.798354 seconds (2.50 M allocations: 1.098 GiB, 0.08% gc time)
 16.966914 seconds (2.50 M allocations: 1.098 GiB, 0.37% gc time)

[ Info: ["5000000", "5000000", "str", "dup", "rand", "1", "outer"]
  7.026243 seconds (208 allocations: 908.015 MiB, 0.19% gc time)
  6.981867 seconds (208 allocations: 908.015 MiB, 0.19% gc time)

[ Info: ["5000000", "5000000", "pool", "dup", "rand", "1", "outer"]
 12.769761 seconds (292 allocations: 1004.706 MiB, 0.10% gc time)
 12.975913 seconds (292 allocations: 1004.706 MiB, 0.01% gc time)

[ Info: ["5000000", "5000000", "int", "dup", "rand", "1", "outer"]
  5.211807 seconds (208 allocations: 917.552 MiB, 1.85% gc time)
  5.195771 seconds (208 allocations: 917.552 MiB, 1.84% gc time)

and here are timings of data.table for the same structure of data (single threaded, as data.table uses threading here which we do not yet):

> system.time(merge(a, b, all=T)) # number
   user  system elapsed 
  3.515   0.156   3.671 

> system.time(merge(a, b, all=T)) # character
   user  system elapsed 
 26.414   0.351  26.760 

> system.time(merge(a, b, all=T)) # factor on character
   user  system elapsed 
  5.514   0.059   5.578 

Conclusions, we are:

  • a bit slower for integers
  • faster for strings
  • slower for categorical/pooled data, and nl/valindex helps

(I am using outerjoin as it is not affected by the problem of copying pools which are present in innerjoin, leftjoin and rightjoin)

@bkamins
Copy link
Member Author

bkamins commented Feb 19, 2021

In the nl/valindex branch the following timings are prohibitive still:

[ Info: ["100000", "50000000", "cat", "dup", "sort", "1", "outer"]
555.769180 seconds (200.41 k allocations: 2.851 GiB, 0.13% gc time)
 52.146842 seconds (200.46 k allocations: 5.340 GiB, 2.50% gc time)
[ Info: ["100000", "50000000", "cat", "dup", "sort", "2", "outer"]
1102.413404 seconds (612 allocations: 5.309 GiB, 0.07% gc time)
102.891563 seconds (706 allocations: 10.285 GiB, 3.82% gc time)

The same for PooledArrays.jl is:

[ Info: ["100000", "50000000", "pool", "dup", "sort", "1", "outer"]
  0.830089 seconds (243 allocations: 773.888 MiB)
  0.836661 seconds (243 allocations: 773.888 MiB)
[ Info: ["100000", "50000000", "pool", "dup", "sort", "2", "outer"]
  1.339374 seconds (282 allocations: 1.129 GiB)
  1.313691 seconds (282 allocations: 1.129 GiB)

And for Int (which is a lower bound):

[ Info: ["100000", "50000000", "int", "dup", "sort", "1", "outer"]
  0.605286 seconds (238 allocations: 773.507 MiB, 0.15% gc time)
  0.600281 seconds (238 allocations: 773.507 MiB, 0.10% gc time)
[ Info: ["100000", "50000000", "int", "dup", "sort", "2", "outer"]
  0.985609 seconds (272 allocations: 1.129 GiB, 0.10% gc time)
  0.979220 seconds (272 allocations: 1.129 GiB, 0.06% gc time)

test/join.jl Outdated

@test innerjoin(DataFrame(id=[missing]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ==
@test innerjoin(DataFrame(id=Union{Int, Missing}[]), DataFrame(id=[1]), on=:id, matchmissing=:equal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually test the eltype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intention of this test is to check if we do not error in this case (as there are errors in 0.22.5 for joins in similar cases - maybe not in this one; I wanted to ensure full coverage). Do you want me to add eltype tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added them. Just to summarize the rules:

  • inner and left join take type from left data frame
  • right join takes from right data frame
  • outer join joins types

@bkamins
Copy link
Member Author

bkamins commented Feb 20, 2021

The plan before the release is the following:

  • implement Add pool sharing with copy on write PooledArrays.jl#56 (i.e. whenever possible do not copy pool immediately; the only problematic scenario will be not-on columns that require eltype widening, which will have to copy pool); then release PooledArrays.jl
  • implemement speedups in CategoricalArrays.jl (@nalimilan is working on this)
  • re-run benchmarks
  • potentially add more tests as prompted by @nalimilan

@bkamins
Copy link
Member Author

bkamins commented Feb 21, 2021

TODO: add checking of eltype returned columns in tests

@bkamins
Copy link
Member Author

bkamins commented Mar 1, 2021

With PooledArrays.jl 1.2.1 joining on PooledVector is always faster than String vector except for sorted case when both left and right data frames are long. I will investigate this case and apply comments by @nalimilan to the tests and this PR should be then finished.

EDIT: it must be slower in the corner case described above as for two PooledArrays we do ref-mapping between both PooledArrays and when they are of equal length this is a bad corner case. But I think this is acceptable case.

@bkamins
Copy link
Member Author

bkamins commented Mar 2, 2021

@nalimilan - this should be good to review. Thank you!

src/join/core.jl Outdated Show resolved Hide resolved
Comment on lines +479 to +483
lc_et <: Real && rc_et <: Real && continue
lc_et <: AbstractString && rc_et <: AbstractString && continue

# otherwise we require non-missing eltype of both sides to be the same and concrete
lc_et === rc_et && isconcretetype(lc_et) && continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - this is a set of conditions I propose to use. It should be safe. If it is not met then we will not be fast anyway as it means that some weird join is done (most likely with Any containers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it should be fine in most common cases, but ideally we would remove this condition once CategoricalArrays are fixed. One of DataFrames.jl's strengths is that it's efficient even for custom types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - after we merge this PR I will open an issue to track it.

src/join/core.jl Outdated
# this is a workaround for https://github.com/JuliaData/CategoricalArrays.jl/issues/319
# this path will be triggered only in rare cases when the refpool code above
# fails to convert CategoricalArray into refpool
# although isless should be transitive it is safer not to rely on it if we do not have to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is ambiguous: it could mean that we don't rely on transitivity at all. It's not that we "have to", is that we decide "not to" for nonstandard cases. But we're still not safe if a custom numeric or string types uses an incorrect definition of isless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I remove it. I mean "we rely on transitivity and we know of cases when it is not transitive (although it should be) so for safety we disable fast path"

src/join/core.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Mar 4, 2021

Thank you!

@bkamins bkamins merged commit 7cf7d26 into main Mar 4, 2021
@bkamins bkamins deleted the bk/outerjoin_rewrite branch March 4, 2021 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants