Optimize Chunk traverse #1952

johnynek · 2020-07-11T19:09:31Z

I noticed that Stream.evalMapChunk, which is a common operation for me, goes through Chunk.traverse, however that is implemented in a pretty expensive way: via foldRight which wraps everything in another layer of Eval.

I instead implemented a tree-based approach (to keep stack safety) but without wrapping in another Eval layer.

SystemFw

A couple of comments

Interestingly, this code is not referentially transparent strictly speaking, but you cannot observe it. That's pretty clever.
Any idea on what the perf impact of this change is?

johnynek · 2020-07-11T20:29:18Z

Thanks for the comment @SystemFw .

One difference here is that it won't blow the stack on very big chunks for non-trampolining F[_]. In practice, that might be rare for Stream since generally people are using an effect type that is trampolined, but still, for a few folks that may notice.

For anyone using Chunk on its own, they might really care that they can traverse them for sizes over 100k (really unbounded with this, but before it would have blown up), and non-trampolined F[_].

I didn't benchmark this, but my guess is that if you have an evalMapChunk that is not doing a "real" IO (e.g. not doing network or file access), maybe something that is just catching the possibility of exception, then the difference will be pretty large (maybe 2x or more?) since you don't have do use both the Eval and the IO evaluation.

SystemFw · 2020-07-11T20:40:55Z

The old traverse will also be stack safe (just tried traversing with Option on a Chunk with 1 million elems), but I'm okay with merging if we're confident there is going to be a perf benefit :)

johnynek · 2020-07-11T20:58:11Z

I'll make a benchmark before we merge. Let's be rigorous.

johnynek · 2020-07-11T20:58:57Z

Option may short circuit on None so unless you have no Nones I'm not sure that is a real test.

SystemFw · 2020-07-11T21:03:12Z

This is what I've used:

 val source = (0 to 1000000).toVector
 val tt = Chunk.vector(source).traverse(_.pure[Option])

johnynek · 2020-07-11T21:40:36Z

On main:

scala> Chunk.seq(0 until 1000000).traverse { i => () => i }
val res0: () => fs2.Chunk[Int] = cats.instances.Function0Instances$$anon$4$$Lambda$6313/809375313@287dcd6

scala> res0()
java.lang.StackOverflowError
  ... 1024 elided

with this change:

scala> Chunk.seq(0 until 1000000).traverse { i => () => i }
val res0: () => fs2.Chunk[Int] = cats.instances.Function0Instances$$anon$4$$Lambda$7732/324635468@3467ff7a

scala> res0()
val res1: fs2.Chunk[Int] = Chunk(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169...

Not sure the difference, but the Applicative[F] is only seeing log depth in my code so even a highly unsafe monad like Function0 works.

johnynek · 2020-07-11T22:18:13Z

I was wrong. This is about the same speed (maybe modestly faster), the win is the stack safety alone it looks like.

Also, this design has N log N calls to F.map or F.map2 because it builds a tree up. The original implementation does use Eval, but it only calls F.map2 N times (in a linear chain). So, this PR is a bit behind the 8-ball due to that.

I think it is worth merging for the added stack safety, but the performance motivation is not there.

If we don't care about stack safety (which we don't seem to have in all cases currently), I bet you can go faster with a linear chain build without Eval and foldRight: just iterate through the list right to left building up an inner List then 1 final .map(Chunk.list) at the end.

Here are benchmark results:

on main:

[info] Benchmark                (chunkCount)  (chunkSize)   Mode  Cnt       Score        Error  Units
[info] ChunkBenchmark.traverse             1           16  thrpt    6  881546.345 ± 140421.631  ops/s
[info] ChunkBenchmark.traverse             1          256  thrpt    6   44430.572 ±   2828.650  ops/s
[info] ChunkBenchmark.traverse            20           16  thrpt    6   36710.054 ±    217.760  ops/s
[info] ChunkBenchmark.traverse            50           16  thrpt    6   15346.930 ±     38.888  ops/s
[info] ChunkBenchmark.traverse           100           16  thrpt    6    7039.792 ±    153.096  ops/s
// the ones omitted had stack overflows

on this PR:
[info] Benchmark                (chunkCount)  (chunkSize)   Mode  Cnt       Score      Error  Units
[info] ChunkBenchmark.traverse             1           16  thrpt    6  852749.128 ± 5820.295  ops/s
[info] ChunkBenchmark.traverse             1          256  thrpt    6   48765.634 ±  403.656  ops/s
[info] ChunkBenchmark.traverse            20           16  thrpt    6   39116.667 ±  203.117  ops/s
[info] ChunkBenchmark.traverse            50           16  thrpt    6   14817.411 ±  385.322  ops/s
[info] ChunkBenchmark.traverse           100           16  thrpt    6    7027.628 ±  944.514  ops/s

// these complete on this PR, but not on main:

[info] ChunkBenchmark.traverse             1         4096  thrpt    6    2944.750 ±   10.933  ops/s
[info] ChunkBenchmark.traverse            20          256  thrpt    6    2322.770 ±   65.179  ops/s
[info] ChunkBenchmark.traverse            20         4096  thrpt    6      89.735 ±   17.816  ops/s
[info] ChunkBenchmark.traverse            50          256  thrpt    6     816.425 ±   91.386  ops/s
[info] ChunkBenchmark.traverse            50         4096  thrpt    6      30.275 ±    1.905  ops/s
[info] ChunkBenchmark.traverse           100          256  thrpt    6     228.106 ±   14.465  ops/s
[info] ChunkBenchmark.traverse           100         4096  thrpt    6      12.225 ±    1.451  ops/s

johnynek · 2020-07-14T01:41:21Z

It occurred to me in a comment here:

typelevel/cats#3517 (comment)

the magic here is the tree vs the linear structure, not the fact that it is a binary tree. It is a bit more complex to implement, but a fan-out of say 128 children per node or even 256 will preserve the log growth in stack size but will reduce the overhead of building the tree since we will build up linear chains of 128 or 256.

This may allow us to actually see a performance boost in addition to the stack safety achieved.

johnynek added 2 commits July 11, 2020 09:04

Optimize Chunk traverse

4903565

optimize the empty case

38700ed

SystemFw reviewed Jul 11, 2020

View reviewed changes

optimize the leafs in traverse a bit

a139c81

Add benchmark for traverse

c851681

mpilquist merged commit b6ef728 into typelevel:main Jul 12, 2020

johnynek mentioned this pull request Jul 12, 2020

stack safe traverse typelevel/cats#3517

Closed

johnynek mentioned this pull request Jul 14, 2020

Optimize Chunk.traverse take 2 #1957

Merged

mpilquist added this to the 2.4.3 milestone Aug 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Chunk traverse #1952

Optimize Chunk traverse #1952

johnynek commented Jul 11, 2020

SystemFw left a comment

johnynek commented Jul 11, 2020 •

edited

Loading

SystemFw commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 11, 2020

SystemFw commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 14, 2020

Optimize Chunk traverse #1952

Optimize Chunk traverse #1952

Conversation

johnynek commented Jul 11, 2020

SystemFw left a comment

Choose a reason for hiding this comment

johnynek commented Jul 11, 2020 • edited Loading

SystemFw commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 11, 2020

SystemFw commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 11, 2020

johnynek commented Jul 14, 2020

johnynek commented Jul 11, 2020 •

edited

Loading