Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia: use interned strings #222

Merged
merged 1 commit into from
Jun 3, 2021
Merged

Julia: use interned strings #222

merged 1 commit into from
Jun 3, 2021

Conversation

bkamins
Copy link
Contributor

@bkamins bkamins commented Jun 2, 2021

This change switches to use interned objects instead of standard strings (as standard strings are not interned) in joins.

@jangorecki - would it be possible to run the benchmarks with this change on your machine to see the impact? I changed it only for joins as it seems to be the most interesting case currently.

Thank you!

@bkamins
Copy link
Contributor Author

bkamins commented Jun 2, 2021

cc @nalimilan @quinnj

@jangorecki jangorecki merged commit cc0daa6 into h2oai:master Jun 3, 2021
@jangorecki
Copy link
Contributor

Thanks, started juliadf now, only join task, so should be more early. From brief look, J1_1e7_NA_5_0 has exactly the same total timing 18s.

@bkamins bkamins deleted the patch-2 branch June 3, 2021 18:25
@bkamins
Copy link
Contributor Author

bkamins commented Jun 3, 2021

Thank you. This change should only affect GC time. We have not implemented the changes in CSV.jl yet, but this is the current standard approach to reduce GC time.

@jangorecki
Copy link
Contributor

taken from ./_utils/time.R

> tail.time("juliadf", "join", i=c(1L, 2L))
    in_rows               knasorted               question 20210515_66ee4fe 20210603_66ee4fe
 1:     1e7   0% NAs, unsorted data     small inner on int            0.751            0.797
 2:     1e7   0% NAs, unsorted data    medium inner on int            0.399            0.419
 3:     1e7   0% NAs, unsorted data    medium outer on int            1.965            1.890
 4:     1e7   0% NAs, unsorted data medium inner on factor            0.578            0.737
 5:     1e7   0% NAs, unsorted data       big inner on int            1.377            1.395
 6:     1e7   5% NAs, unsorted data     small inner on int            0.863            0.903
 7:     1e7   5% NAs, unsorted data    medium inner on int            0.703            0.754
 8:     1e7   5% NAs, unsorted data    medium outer on int            2.076            2.271
 9:     1e7   5% NAs, unsorted data medium inner on factor            0.741            0.746
10:     1e7   5% NAs, unsorted data       big inner on int            3.169            3.018
11:     1e7 0% NAs, pre-sorted data     small inner on int            0.662            0.643
12:     1e7 0% NAs, pre-sorted data    medium inner on int            0.412            0.410
13:     1e7 0% NAs, pre-sorted data    medium outer on int            1.878            1.834
14:     1e7 0% NAs, pre-sorted data medium inner on factor            0.459            0.600
15:     1e7 0% NAs, pre-sorted data       big inner on int            0.623            0.583
16:     1e8   0% NAs, unsorted data     small inner on int           31.547           37.773
17:     1e8   0% NAs, unsorted data    medium inner on int           27.440           20.118
18:     1e8   0% NAs, unsorted data    medium outer on int           58.285           53.417
19:     1e8   0% NAs, unsorted data medium inner on factor           37.202           31.975
20:     1e8   0% NAs, unsorted data       big inner on int           50.273           45.615
21:     1e8   5% NAs, unsorted data     small inner on int           37.778           48.157
22:     1e8   5% NAs, unsorted data    medium inner on int           28.907           28.997
23:     1e8   5% NAs, unsorted data    medium outer on int           47.130           62.510
24:     1e8   5% NAs, unsorted data medium inner on factor           37.742           47.860
25:     1e8   5% NAs, unsorted data       big inner on int           69.493           61.993
26:     1e8 0% NAs, pre-sorted data     small inner on int           29.921           31.294
27:     1e8 0% NAs, pre-sorted data    medium inner on int           25.802           21.033
28:     1e8 0% NAs, pre-sorted data    medium outer on int           63.371           39.580
29:     1e8 0% NAs, pre-sorted data medium inner on factor           35.144           28.662
30:     1e8 0% NAs, pre-sorted data       big inner on int           43.911           31.395

@bkamins
Copy link
Contributor Author

bkamins commented Jun 4, 2021

Thank you for theis. In 1e7 case as expected there is no change (as GC is not triggered then anyway).
We are a bit better in 1e8 case, as we spend a bit less time in GC (but I expected that we would still be affected by GC, see JuliaLang/julia#40840 (comment) if you are interested in the reason).

@quinnj is finishing a PR to CSV.jl that should resolve GC issues without requiring changes in Julia Base. Then I will open another PR with a new setting of reading data in.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants