Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement faster innerjoin #2612

Merged
merged 63 commits into from
Feb 13, 2021
Merged
Changes from 1 commit
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1509f63
implement faster innerjoin
bkamins Jan 26, 2021
2b222b7
add handling of sorted tables
bkamins Jan 27, 2021
0eb911e
fix eltype test
bkamins Jan 27, 2021
a16b6f2
use strategy with single index pool in case of duplicates
bkamins Jan 28, 2021
14652f0
add tests for innerjoin
bkamins Jan 29, 2021
6a5b6ca
fast path for PooledArrays
bkamins Jan 29, 2021
c9385da
update handling of PooledArrays and CategoricalArrays
bkamins Jan 30, 2021
b8907ae
add more tests
bkamins Jan 31, 2021
c4d2c46
Apply suggestions from code review
bkamins Feb 2, 2021
3f8c49f
Update src/abstractdataframe/join.jl
bkamins Feb 2, 2021
e306de1
add more comments and optimistically try sorted join algorithm
bkamins Feb 2, 2021
47fe234
fix lookup
bkamins Feb 2, 2021
c24f678
Apply suggestions from code review
bkamins Feb 2, 2021
928f372
use DataAPI.invrefpool
bkamins Feb 3, 2021
3842c3d
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 3, 2021
dba0f36
Merge branch 'main' into new_faster_innerjoin
bkamins Feb 3, 2021
9cf62af
Apply suggestions from code review
bkamins Feb 3, 2021
3912ae6
use nothing as sentinel
bkamins Feb 3, 2021
68a8eaa
Apply suggestions from code review
bkamins Feb 3, 2021
c05410b
remove PooledArrays.jl specific code
bkamins Feb 4, 2021
a8b2702
Apply suggestions from code review
bkamins Feb 4, 2021
5ec767f
corrections after the review
bkamins Feb 4, 2021
1db13ef
Apply suggestions from code review
bkamins Feb 4, 2021
7f2d897
add OnCol
bkamins Feb 5, 2021
9208bff
add faster processing of integer columns
bkamins Feb 5, 2021
1735712
Apply suggestions from code review
bkamins Feb 5, 2021
bb7e8f1
minor changes
bkamins Feb 6, 2021
6d46f1b
fix test coverage
bkamins Feb 6, 2021
1ef1362
fix test coverage
bkamins Feb 6, 2021
eb50756
revert change for better clarity
bkamins Feb 6, 2021
7cfb5b4
another small fix
bkamins Feb 6, 2021
d9dd15f
fix method definition
bkamins Feb 6, 2021
fd03587
Apply suggestions from code review
bkamins Feb 6, 2021
e30f51a
change hash implementation
bkamins Feb 6, 2021
6aa95e3
fix typo
bkamins Feb 6, 2021
bdcaeef
fix tests
bkamins Feb 6, 2021
1150126
consistent detection of CategoricalArrays.jl types
bkamins Feb 6, 2021
0c1e8b6
Apply suggestions from code review
bkamins Feb 6, 2021
ca02cc9
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 6, 2021
558129d
add hash test
bkamins Feb 6, 2021
cbe214e
in printing we might have union
bkamins Feb 6, 2021
e01c1fe
fix typo
bkamins Feb 6, 2021
f50b9a1
simplify isless and isequal for OnCol
bkamins Feb 6, 2021
a592b09
Update test/join.jl
bkamins Feb 7, 2021
1515c07
add more tests
bkamins Feb 7, 2021
75560b1
Merge remote-tracking branch 'origin/new_faster_innerjoin' into new_f…
bkamins Feb 7, 2021
d8f1fe4
additional tests
bkamins Feb 7, 2021
bb83527
add innerjoin benchmark
bkamins Feb 7, 2021
56c4c5e
more tests to ensure full coverage
bkamins Feb 7, 2021
f697177
add linebreaks at @info
bkamins Feb 7, 2021
fed4570
simplify loop in sorted case
bkamins Feb 9, 2021
d0fb0b9
use resize!
bkamins Feb 9, 2021
020eaae
add sizehint!
bkamins Feb 10, 2021
a3de1c2
avoid using internal functions
bkamins Feb 11, 2021
659ec7c
improved benchmark design
bkamins Feb 11, 2021
07ecb0a
Revert "avoid using internal functions"
bkamins Feb 11, 2021
58bdcf3
fix dict sizehint
bkamins Feb 11, 2021
d7bb989
add benchmark runner
bkamins Feb 11, 2021
f9882f8
Revert "Revert "avoid using internal functions""
bkamins Feb 11, 2021
8a31d99
clean up script
bkamins Feb 11, 2021
91df0e4
Update test/join.jl
bkamins Feb 11, 2021
0b21972
improve tests
bkamins Feb 12, 2021
1a9e664
Update NEWS.md
bkamins Feb 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
clean up script
  • Loading branch information
bkamins committed Feb 11, 2021
commit 8a31d99e7999266c22032ea9483216ea6d971e53
35 changes: 17 additions & 18 deletions benchmarks/innerjoin_performance.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ using DataFrames
using PooledArrays
using Random

fullgc() = (GC.gc(); GC.gc(); GC.gc(); GC.gc())
fullgc() = (GC.gc(true); GC.gc(true); GC.gc(true); GC.gc(true))

@assert length(ARGS) == 6
@assert ARGS[3] in ["int", "pool", "cat", "str"]
Expand Down Expand Up @@ -75,23 +75,22 @@ else
end

if ARGS[6] == "1"
df1 = DataFrame(id1 = col1);
df2 = DataFrame(id1 = col2);
innerjoin(df1[1:1000, :], df2[1:2000, :], on=:id1);
innerjoin(df2[1:2000, :], df1[1:1000, :], on=:id1);
fullgc();
@time innerjoin(df1, df2, on=:id1);
fullgc();
@time innerjoin(df2, df1, on=:id1);
df1 = DataFrame(id1 = col1)
df2 = DataFrame(id1 = col2)
innerjoin(df1[1:1000, :], df2[1:2000, :], on=:id1)
innerjoin(df2[1:2000, :], df1[1:1000, :], on=:id1)
fullgc()
@time innerjoin(df1, df2, on=:id1)
fullgc()
@time innerjoin(df2, df1, on=:id1)
else
@assert ARGS[6] == "2"
df1 = DataFrame(id1 = col1, id2 = col1);
df2 = DataFrame(id1 = col1, id2 = col1);
innerjoin(df1[1:1000, :], df2[1:2000, :], on=[:id1, :id2]);
innerjoin(df2[1:2000, :], df1[1:1000, :], on=[:id1, :id2]);
fullgc();
@time innerjoin(df1, df2, on=[:id1, :id2]);
fullgc();
@time innerjoin(df2, df1, on=[:id1, :id2]);
df2 = DataFrame(id1 = col2, id2 = col2);
df1 = DataFrame(id1 = col1, id2 = col1)
df2 = DataFrame(id1 = col1, id2 = col1)
innerjoin(df1[1:1000, :], df2[1:2000, :], on=[:id1, :id2])
innerjoin(df2[1:2000, :], df1[1:1000, :], on=[:id1, :id2])
fullgc()
@time innerjoin(df1, df2, on=[:id1, :id2])
fullgc()
@time innerjoin(df2, df1, on=[:id1, :id2])
end