Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster speed for decompressSequencesLong #2614

Merged
merged 1 commit into from
May 5, 2021
Merged

faster speed for decompressSequencesLong #2614

merged 1 commit into from
May 5, 2021

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented May 5, 2021

by using a deeper prefetching pipeline, increased from 4 to 8 slots.

This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s

I also checked the "cold dictionary" scenario, with largeNbDicts,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).

This is a follow up from #2547,
though it's separate because in this case, the benefits are much more clear cut.

pipeline increased from 4 to 8 slots.
This change substantially improves decompression speed when there are long distance offsets.
example with enwik9 compressed at level 22 :
gcc-9 : 947 -> 1039 MB/s
clang-10: 884 -> 946 MB/s

I also checked the "cold dictionary" scenario,
and found a smaller benefit, around ~2%
(measurements are more noisy for this scenario).
@senhuang42
Copy link
Contributor

What are the scenarios in which this prefetching might be left unused/wasted?

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented May 5, 2021

What are the scenarios in which this prefetching might be left unused/wasted?

The decoder only prefetches memory regions
that are effectively going to be copied later for a match operations.

That being said, in some cases, when a match's offset is really small,
the source region may not yet be filled (at the time prefetching is issued).

That should not matter much because it means this memory region is very fresh,
hence likely already in L1, and would have been in L1 anyway, even without prefetching.

So we could say that, in this case, the unconditional prefetching was "useless".
However, branching on the speculative presence of a memory region within L1
is way more expensive than merely prefetching unconditionally inside the hot loop.
So it's preferable to always prefetch, even when a memory region is likely already present in L1.

@Cyan4973 Cyan4973 merged commit fed8589 into dev May 5, 2021
@Cyan4973 Cyan4973 deleted the dlong8 branch December 9, 2021 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants