Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Chunk traverse #1952

Merged
merged 4 commits into from
Jul 12, 2020

Conversation

johnynek
Copy link
Contributor

I noticed that Stream.evalMapChunk, which is a common operation for me, goes through Chunk.traverse, however that is implemented in a pretty expensive way: via foldRight which wraps everything in another layer of Eval.

I instead implemented a tree-based approach (to keep stack safety) but without wrapping in another Eval layer.

Copy link
Collaborator

@SystemFw SystemFw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments

  1. Interestingly, this code is not referentially transparent strictly speaking, but you cannot observe it. That's pretty clever.
  2. Any idea on what the perf impact of this change is?

@johnynek
Copy link
Contributor Author

johnynek commented Jul 11, 2020

Thanks for the comment @SystemFw .

One difference here is that it won't blow the stack on very big chunks for non-trampolining F[_]. In practice, that might be rare for Stream since generally people are using an effect type that is trampolined, but still, for a few folks that may notice.

For anyone using Chunk on its own, they might really care that they can traverse them for sizes over 100k (really unbounded with this, but before it would have blown up), and non-trampolined F[_].

I didn't benchmark this, but my guess is that if you have an evalMapChunk that is not doing a "real" IO (e.g. not doing network or file access), maybe something that is just catching the possibility of exception, then the difference will be pretty large (maybe 2x or more?) since you don't have do use both the Eval and the IO evaluation.

@SystemFw
Copy link
Collaborator

The old traverse will also be stack safe (just tried traversing with Option on a Chunk with 1 million elems), but I'm okay with merging if we're confident there is going to be a perf benefit :)

@johnynek
Copy link
Contributor Author

I'll make a benchmark before we merge. Let's be rigorous.

@johnynek
Copy link
Contributor Author

Option may short circuit on None so unless you have no Nones I'm not sure that is a real test.

@SystemFw
Copy link
Collaborator

This is what I've used:

 val source = (0 to 1000000).toVector
 val tt = Chunk.vector(source).traverse(_.pure[Option])

@johnynek
Copy link
Contributor Author

On main:

scala> Chunk.seq(0 until 1000000).traverse { i => () => i }
val res0: () => fs2.Chunk[Int] = cats.instances.Function0Instances$$anon$4$$Lambda$6313/809375313@287dcd6

scala> res0()
java.lang.StackOverflowError
  ... 1024 elided

with this change:

scala> Chunk.seq(0 until 1000000).traverse { i => () => i }
val res0: () => fs2.Chunk[Int] = cats.instances.Function0Instances$$anon$4$$Lambda$7732/324635468@3467ff7a

scala> res0()
val res1: fs2.Chunk[Int] = Chunk(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169...

Not sure the difference, but the Applicative[F] is only seeing log depth in my code so even a highly unsafe monad like Function0 works.

@johnynek
Copy link
Contributor Author

I was wrong. This is about the same speed (maybe modestly faster), the win is the stack safety alone it looks like.

Also, this design has N log N calls to F.map or F.map2 because it builds a tree up. The original implementation does use Eval, but it only calls F.map2 N times (in a linear chain). So, this PR is a bit behind the 8-ball due to that.

I think it is worth merging for the added stack safety, but the performance motivation is not there.

If we don't care about stack safety (which we don't seem to have in all cases currently), I bet you can go faster with a linear chain build without Eval and foldRight: just iterate through the list right to left building up an inner List then 1 final .map(Chunk.list) at the end.

Here are benchmark results:

on main:

[info] Benchmark                (chunkCount)  (chunkSize)   Mode  Cnt       Score        Error  Units
[info] ChunkBenchmark.traverse             1           16  thrpt    6  881546.345 ± 140421.631  ops/s
[info] ChunkBenchmark.traverse             1          256  thrpt    6   44430.572 ±   2828.650  ops/s
[info] ChunkBenchmark.traverse            20           16  thrpt    6   36710.054 ±    217.760  ops/s
[info] ChunkBenchmark.traverse            50           16  thrpt    6   15346.930 ±     38.888  ops/s
[info] ChunkBenchmark.traverse           100           16  thrpt    6    7039.792 ±    153.096  ops/s
// the ones omitted had stack overflows

on this PR:
[info] Benchmark                (chunkCount)  (chunkSize)   Mode  Cnt       Score      Error  Units
[info] ChunkBenchmark.traverse             1           16  thrpt    6  852749.128 ± 5820.295  ops/s
[info] ChunkBenchmark.traverse             1          256  thrpt    6   48765.634 ±  403.656  ops/s
[info] ChunkBenchmark.traverse            20           16  thrpt    6   39116.667 ±  203.117  ops/s
[info] ChunkBenchmark.traverse            50           16  thrpt    6   14817.411 ±  385.322  ops/s
[info] ChunkBenchmark.traverse           100           16  thrpt    6    7027.628 ±  944.514  ops/s

// these complete on this PR, but not on main:

[info] ChunkBenchmark.traverse             1         4096  thrpt    6    2944.750 ±   10.933  ops/s
[info] ChunkBenchmark.traverse            20          256  thrpt    6    2322.770 ±   65.179  ops/s
[info] ChunkBenchmark.traverse            20         4096  thrpt    6      89.735 ±   17.816  ops/s
[info] ChunkBenchmark.traverse            50          256  thrpt    6     816.425 ±   91.386  ops/s
[info] ChunkBenchmark.traverse            50         4096  thrpt    6      30.275 ±    1.905  ops/s
[info] ChunkBenchmark.traverse           100          256  thrpt    6     228.106 ±   14.465  ops/s
[info] ChunkBenchmark.traverse           100         4096  thrpt    6      12.225 ±    1.451  ops/s

@mpilquist mpilquist merged commit b6ef728 into typelevel:main Jul 12, 2020
@johnynek
Copy link
Contributor Author

It occurred to me in a comment here:

typelevel/cats#3517 (comment)

the magic here is the tree vs the linear structure, not the fact that it is a binary tree. It is a bit more complex to implement, but a fan-out of say 128 children per node or even 256 will preserve the log growth in stack size but will reduce the overhead of building the tree since we will build up linear chains of 128 or 256.

This may allow us to actually see a performance boost in addition to the stack safety achieved.

@mpilquist mpilquist added this to the 2.4.3 milestone Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants