faster speed for decompressSequencesLong #2614

Cyan4973 · 2021-05-05T17:12:26Z

by using a deeper prefetching pipeline, increased from 4 to 8 slots.

This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s

I also checked the "cold dictionary" scenario, with largeNbDicts,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).

This is a follow up from #2547,
though it's separate because in this case, the benefits are much more clear cut.

pipeline increased from 4 to 8 slots. This change substantially improves decompression speed when there are long distance offsets. example with enwik9 compressed at level 22 : gcc-9 : 947 -> 1039 MB/s clang-10: 884 -> 946 MB/s I also checked the "cold dictionary" scenario, and found a smaller benefit, around ~2% (measurements are more noisy for this scenario).

senhuang42 · 2021-05-05T17:21:33Z

What are the scenarios in which this prefetching might be left unused/wasted?

Cyan4973 · 2021-05-05T17:29:28Z

What are the scenarios in which this prefetching might be left unused/wasted?

The decoder only prefetches memory regions
that are effectively going to be copied later for a match operations.

That being said, in some cases, when a match's offset is really small,
the source region may not yet be filled (at the time prefetching is issued).

That should not matter much because it means this memory region is very fresh,
hence likely already in L1, and would have been in L1 anyway, even without prefetching.

So we could say that, in this case, the unconditional prefetching was "useless".
However, branching on the speculative presence of a memory region within L1
is way more expensive than merely prefetching unconditionally inside the hot loop.
So it's preferable to always prefetch, even when a memory region is likely already present in L1.

facebook-github-bot added the CLA Signed label May 5, 2021

terrelln approved these changes May 5, 2021

View reviewed changes

Cyan4973 merged commit fed8589 into dev May 5, 2021

senhuang42 mentioned this pull request May 11, 2021

🎉 Zstd 1.5.0 Release 🎉 #2636

Merged

Cyan4973 deleted the dlong8 branch December 9, 2021 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster speed for decompressSequencesLong #2614

faster speed for decompressSequencesLong #2614

Cyan4973 commented May 5, 2021

senhuang42 commented May 5, 2021

Cyan4973 commented May 5, 2021

faster speed for decompressSequencesLong #2614

faster speed for decompressSequencesLong #2614

Conversation

Cyan4973 commented May 5, 2021

senhuang42 commented May 5, 2021

Cyan4973 commented May 5, 2021