left, right and outerjoin rewrite #2622

bkamins · 2021-02-15T23:23:29Z

Continuation of work on #2340.

In this PR I will add left, right and outer joins. For now I have added tests that should pass. Currently they fail because we have corner-case bugs in joining code in 0.22.5.

bkamins · 2021-02-17T09:20:30Z

I have finished the implementation. What is needed to be done is:

review docstrings and manual if order of rows is not affected
write and run benchmarks

But before doing these let us agree that what I propose is OK.

In particular, as expected earlier, I propose to change the row order in the output. Now the row-order is:

everything that innerjoin finds matching
everything that is only in left table
everything that is only in right table

(previously we mixed innerjoin and "left table" parts and then "right table" followed, and I found it confusing and the rules were quite complex - what I propose I find easier to reason about, but maybe the old behavior had some benefits so please comment; in particular now indicator column that @oxinabox proposed to add some time ago is super easy to read as it is always sorted -first goes "both" then "left_only" then "right_only")

bkamins · 2021-02-18T08:53:29Z

@nalimilan - currently using CategoricalArrays.jl is prohibitive for large joins. Below I share you the timings. For the time being I will disable testing CategoricalArrays.jl in benchmarks. Do you think these problems can be resolved (they are related to the things we discussed - mainly copyto!).

join_timing_with_cat.txt

(the timings are on this PR and include int, string, PooledArray and CategoricalArray tests to show how they differ)

bkamins · 2021-02-18T15:14:04Z

Below I am uploading a CSV with benchmark comparisons of this PR vs 0.22.5. Conclusions:

I did not benchmark CategoricalArrays.jl as they are problematic to join on with many levels
the way I propose to handle left, right and outer join is faster than what was previously used
For PooledArrays.jl we are slower in cases when the new algorithm copies pool from the longer table (this hopefully can be fixed in PooledArrays.jl - if we want to keep generic code in DataFrames.jl; otherwise we can do special casing for PooledArrays.jl) - this case is rare
the only case when we are not visibly faster are innerjoin, on String with unbalanced length of left and right data frames and random order of values (here a review of this part of the algorithm would be welcome - one of the reasons is for sure the decision to use only exported API of Dict - but the fact is that that the joining algorithm based on groupby is indeed efficient in the case I described above at the cost of allocating much more)

Here are benchmark results. The meaning of columns is as follows:

left: nrow of left table
right: nrow of right table
type: what data type is the column that we join on
counts: if the data has unique values, duplicates, or many duplicates
order: if the data is sorted or shuffled
oncols: if we join on 1 or 2 columns
join: join type
timeleft_pr: time taken by the PR in left, right join order
allocleft_pr: allocations by the PR in left, right join order
timeright_pr: time taken by the PR in right, left join order
allocright_pr: allocations by the PR in right, left join order
timeleft_022: time taken by current release in left, right join order
allocleft_022: allocations by current release in left, right join order
timeright_022: time taken by current release in right, left join order
allocright_022: allocations by current release in right, left join order

comparison.txt

bkamins · 2021-02-18T18:07:01Z

An extract showing how inefficient indexing in PooledArrays.jl kills performance (note that outerjoin does most work but I use different column creation strategy):

left	right	type	counts	order	oncols	join	timeleft_pr	timeright_pr
100000	50000000	pool	dup	sort	2	inner	0.340316	12.731161
100000	50000000	pool	dup	sort	2	left	0.359088	74.606119
100000	50000000	pool	dup	sort	2	right	75.194765	0.35665
100000	50000000	pool	dup	sort	2	outer	1.026409	1.025745

CC @quinnj as I do not know if you are watching this and probably Arrow.jl will have similar considerations.

nalimilan · 2021-02-18T18:19:09Z

Yeah these are really bad. But it shouldn't be too hard to fix by changing getindex not to copy the pool when the length of the extracted slice is smaller than the pool (with some threshold)?

Looking at the comparisons between 0.22 and this PR, the only two serious regressions I see are these, right? Other results are usually incredibly faster.

left	right	type	counts	order	oncols	join	timeleft_pr	allocleft_pr	timeright_pr	allocright_pr	timeleft_022	allocleft_022	timeright_022	allocright_022
5000000	10000000	pool	manydup	rand	1	left	20,65165	311 allocations: 2.005 GiB, 0.65% gc time	11,500368	320 allocations: 2.076 GiB, 2.30% gc time	17,766773	274 allocations: 6.095 GiB, 3.93% gc time	22,165854	287 allocations: 8.444 GiB, 2.91% gc time
5000000	10000000	pool	manydup	rand	2	left	40,482344	393 allocations: 3.160 GiB, 0.85% gc time	22,231336	410 allocations: 3.265 GiB, 1.78% gc time	25,02675	351 allocations: 6.481 GiB, 2.33% gc time	30,894943	371 allocations: 8.864 GiB, 2.76% gc time

bkamins · 2021-02-18T18:36:22Z

Yes - and they are related to getindex with PooledArray.jl. We have a small regression in this case (as commented above), which probably can be fixed later, but it is minor:

the only case when we are not visibly faster are innerjoin, on String with unbalanced length of left and right data frames and random order of values (here a review of this part of the algorithm would be welcome - one of the reasons is for sure the decision to use only exported API of Dict - but the fact is that that the joining algorithm based on groupby is indeed efficient in the case I described above at the cost of allocating much more)

bkamins · 2021-02-18T18:48:05Z

Now the analysis of CategoricalArrays.jl performance. It is tested on Julia nightly:

[ Info: ["5000000", "5000000", "cat", "dup", "rand", "1", "outer"] # nl/valindex
  9.875335 seconds (275 allocations: 1.005 GiB, 0.12% gc time)
  9.901705 seconds (275 allocations: 1.005 GiB, 0.01% gc time)

julia join_performance.jl 5000000 5000000 cat dup rand 1 outer # 0.9.2 release
[ Info: ["5000000", "5000000", "cat", "dup", "rand", "1", "outer"]
 14.798354 seconds (2.50 M allocations: 1.098 GiB, 0.08% gc time)
 16.966914 seconds (2.50 M allocations: 1.098 GiB, 0.37% gc time)

[ Info: ["5000000", "5000000", "str", "dup", "rand", "1", "outer"]
  7.026243 seconds (208 allocations: 908.015 MiB, 0.19% gc time)
  6.981867 seconds (208 allocations: 908.015 MiB, 0.19% gc time)

[ Info: ["5000000", "5000000", "pool", "dup", "rand", "1", "outer"]
 12.769761 seconds (292 allocations: 1004.706 MiB, 0.10% gc time)
 12.975913 seconds (292 allocations: 1004.706 MiB, 0.01% gc time)

[ Info: ["5000000", "5000000", "int", "dup", "rand", "1", "outer"]
  5.211807 seconds (208 allocations: 917.552 MiB, 1.85% gc time)
  5.195771 seconds (208 allocations: 917.552 MiB, 1.84% gc time)

and here are timings of data.table for the same structure of data (single threaded, as data.table uses threading here which we do not yet):

> system.time(merge(a, b, all=T)) # number
   user  system elapsed 
  3.515   0.156   3.671 

> system.time(merge(a, b, all=T)) # character
   user  system elapsed 
 26.414   0.351  26.760 

> system.time(merge(a, b, all=T)) # factor on character
   user  system elapsed 
  5.514   0.059   5.578

Conclusions, we are:

a bit slower for integers
faster for strings
slower for categorical/pooled data, and nl/valindex helps

(I am using outerjoin as it is not affected by the problem of copying pools which are present in innerjoin, leftjoin and rightjoin)

bkamins · 2021-02-19T17:06:08Z

In the nl/valindex branch the following timings are prohibitive still:

[ Info: ["100000", "50000000", "cat", "dup", "sort", "1", "outer"]
555.769180 seconds (200.41 k allocations: 2.851 GiB, 0.13% gc time)
 52.146842 seconds (200.46 k allocations: 5.340 GiB, 2.50% gc time)
[ Info: ["100000", "50000000", "cat", "dup", "sort", "2", "outer"]
1102.413404 seconds (612 allocations: 5.309 GiB, 0.07% gc time)
102.891563 seconds (706 allocations: 10.285 GiB, 3.82% gc time)

The same for PooledArrays.jl is:

[ Info: ["100000", "50000000", "pool", "dup", "sort", "1", "outer"]
  0.830089 seconds (243 allocations: 773.888 MiB)
  0.836661 seconds (243 allocations: 773.888 MiB)
[ Info: ["100000", "50000000", "pool", "dup", "sort", "2", "outer"]
  1.339374 seconds (282 allocations: 1.129 GiB)
  1.313691 seconds (282 allocations: 1.129 GiB)

And for Int (which is a lower bound):

[ Info: ["100000", "50000000", "int", "dup", "sort", "1", "outer"]
  0.605286 seconds (238 allocations: 773.507 MiB, 0.15% gc time)
  0.600281 seconds (238 allocations: 773.507 MiB, 0.10% gc time)
[ Info: ["100000", "50000000", "int", "dup", "sort", "2", "outer"]
  0.985609 seconds (272 allocations: 1.129 GiB, 0.10% gc time)
  0.979220 seconds (272 allocations: 1.129 GiB, 0.06% gc time)

src/join/composer.jl

nalimilan · 2021-02-16T08:38:36Z

test/join.jl


-    @test innerjoin(DataFrame(id=[missing]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ==
+    @test innerjoin(DataFrame(id=Union{Int, Missing}[]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ≅


This doesn't actually test the eltype.

Yes, the intention of this test is to check if we do not error in this case (as there are errors in 0.22.5 for joins in similar cases - maybe not in this one; I wanted to ensure full coverage). Do you want me to add eltype tests?

I have added them. Just to summarize the rules:

inner and left join take type from left data frame

right join takes from right data frame

outer join joins types

bkamins · 2021-02-20T10:23:28Z

The plan before the release is the following:

implement Add pool sharing with copy on write PooledArrays.jl#56 (i.e. whenever possible do not copy pool immediately; the only problematic scenario will be not-on columns that require eltype widening, which will have to copy pool); then release PooledArrays.jl
implemement speedups in CategoricalArrays.jl (@nalimilan is working on this)
re-run benchmarks
potentially add more tests as prompted by @nalimilan

bkamins · 2021-02-21T16:56:16Z

TODO: add checking of eltype returned columns in tests

bkamins · 2021-03-01T20:10:30Z

With PooledArrays.jl 1.2.1 joining on PooledVector is always faster than String vector except for sorted case when both left and right data frames are long. I will investigate this case and apply comments by @nalimilan to the tests and this PR should be then finished.

EDIT: it must be slower in the corner case described above as for two PooledArrays we do ref-mapping between both PooledArrays and when they are of equal length this is a bad corner case. But I think this is acceptable case.

bkamins · 2021-03-02T00:09:03Z

@nalimilan - this should be good to review. Thank you!

src/join/core.jl

bkamins · 2021-03-02T13:32:27Z

src/join/core.jl

+        lc_et <: Real && rc_et <: Real && continue
+        lc_et <: AbstractString && rc_et <: AbstractString && continue
+
+        # otherwise we require non-missing eltype of both sides to be the same and concrete
+        lc_et === rc_et && isconcretetype(lc_et) && continue


@nalimilan - this is a set of conditions I propose to use. It should be safe. If it is not met then we will not be fast anyway as it means that some weird join is done (most likely with Any containers)

I agree it should be fine in most common cases, but ideally we would remove this condition once CategoricalArrays are fixed. One of DataFrames.jl's strengths is that it's efficient even for custom types.

OK - after we merge this PR I will open an issue to track it.

nalimilan · 2021-03-02T17:45:10Z

src/join/core.jl

-    # this is a workaround for https://github.com/JuliaData/CategoricalArrays.jl/issues/319
-    # this path will be triggered only in rare cases when the refpool code above
-    # fails to convert CategoricalArray into refpool
+    # although isless should be transitive it is safer not to rely on it if we do not have to


This comment is ambiguous: it could mean that we don't rely on transitivity at all. It's not that we "have to", is that we decide "not to" for nonstandard cases. But we're still not safe if a custom numeric or string types uses an incorrect definition of isless.

OK - I remove it. I mean "we rely on transitivity and we know of cases when it is not transitive (although it should be) so for safety we disable fast path"

src/join/core.jl

bkamins · 2021-03-04T13:25:50Z

Thank you!

bkamins added 2 commits February 16, 2021 00:19

add tests before implementation

86d9e24

remove not needed code

187d45f

bkamins added bug priority performance labels Feb 15, 2021

bkamins added this to the 1.0 milestone Feb 15, 2021

bkamins added 3 commits February 16, 2021 15:24

refactor code in preparation for making it more flexible

29951a4

fix tests up to line 430

96ab94b

implement left, right and outer join functionality

6266987

bkamins added 3 commits February 17, 2021 12:03

clean up tests

a5b098c

update benchmarks

1ee9289

fix typo

3653234

remove CategoricalArrays.jl from benchmarks

2341be3

bkamins commented Feb 19, 2021

View reviewed changes

src/join/composer.jl Show resolved Hide resolved

bkamins commented Feb 19, 2021

View reviewed changes

src/join/composer.jl Show resolved Hide resolved

nalimilan reviewed Feb 19, 2021

View reviewed changes

bkamins mentioned this pull request Feb 20, 2021

Add more threading supoprt #2626

Closed

bkamins added 2 commits March 1, 2021 21:19

bump PooledArray dependency

56c37ea

add column type checks

04e6462

bkamins commented Mar 2, 2021

View reviewed changes

src/join/core.jl Outdated Show resolved Hide resolved

change the condition for allowing using sorted branch

df8bbdd

bkamins commented Mar 2, 2021

View reviewed changes

nalimilan reviewed Mar 2, 2021

View reviewed changes

bkamins commented Mar 2, 2021

View reviewed changes

src/join/core.jl Outdated Show resolved Hide resolved

Update src/join/core.jl

d063265

nalimilan approved these changes Mar 3, 2021

View reviewed changes

bkamins merged commit 7cf7d26 into main Mar 4, 2021

bkamins deleted the bk/outerjoin_rewrite branch March 4, 2021 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

left, right and outerjoin rewrite #2622

left, right and outerjoin rewrite #2622

bkamins commented Feb 15, 2021

bkamins commented Feb 17, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021 •

edited

Loading

nalimilan commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 19, 2021

nalimilan Feb 16, 2021

bkamins Feb 19, 2021

bkamins Mar 2, 2021

bkamins commented Feb 20, 2021

bkamins commented Feb 21, 2021

bkamins commented Mar 1, 2021 •

edited

Loading

bkamins commented Mar 2, 2021

bkamins Mar 2, 2021

nalimilan Mar 2, 2021

bkamins Mar 2, 2021

nalimilan Mar 2, 2021

bkamins Mar 2, 2021

bkamins commented Mar 4, 2021


		@test innerjoin(DataFrame(id=[missing]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ==
		@test innerjoin(DataFrame(id=Union{Int, Missing}[]), DataFrame(id=[1]), on=:id, matchmissing=:equal) ≅

left, right and outerjoin rewrite #2622

left, right and outerjoin rewrite #2622

Conversation

bkamins commented Feb 15, 2021

bkamins commented Feb 17, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021 • edited Loading

nalimilan commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 18, 2021

bkamins commented Feb 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Feb 20, 2021

bkamins commented Feb 21, 2021

bkamins commented Mar 1, 2021 • edited Loading

bkamins commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Mar 4, 2021

bkamins commented Feb 18, 2021 •

edited

Loading

bkamins commented Mar 1, 2021 •

edited

Loading