-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fix #2966 part 2 : do not initialize tag space #2971
Conversation
(note : this might break due to the need to also track the starting candidate nb per row)
|
but there is still a remaining issue within This, btw, is a good demo of what could happen if any other application linking |
That is not supported by MSAN. I'm surprised it was working as-is. So we should fix that test to compiler libzstd with MSAN. MSAN needs all code to be compiled with MSAN in order to work correctly. Unlike ASAN, which can work with linked code that isn't compiled with ASAN, it will just miss bugs in the non-ASAN code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be totally fine for determinism. But we should make sure that we benchmark with context reuse, to ensure that eliding the memset doesn't slow down large file compression with context reuse.
Oh, thats a valgrind check, not MSAN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a way around the problems with valgrind. We could add a suppression file, but then it would break for other people running valgrind with zstd that don't have our suppressions.
And I worry that we may get spurious bug reports, and concern from people who run valgrind, and think that zstd is buggy because of that report.
Would it make sense to do the same thing for the tag table as we do for the tables, and keep track of what's been initialized?
I thought about this a little bit and probably the simplest/best way to do this is to keep a known-initialized range. When the tag table is in that range, we can do nothing. When it's outside that range, we memset the new tag table area and everything between it and the existing known-good range, and then record that expansion of the range. In the common case where the table ends up in the same position over and over, we don't have to do any work to reset the cctx, and even if things shift around, we only do incremental clearing as needed. |
The big question is whether we also want to get this in for 1.5.2. |
This solution looks good to me. |
Yes, we do, |
Although, it's not guaranteed that these movements of the tag space generate a single expanding memory region, they may end up defining multiple distant regions, in which case tracking becomes impractical, so we'll have to accept some inefficiency (i.e. initialization will happen more often than it could if tracking of initialized bytes was perfect). I wonder now how competitive could be initialization with edit: on first look, it seems it could be made speed-equivalent. |
I think a sane approach would be to change the order that the cwksp stores elements. Instead of:
Make it:
Now we just need to keep track of where the boundary between aligned uninitialized & aligned initialized is, and if the initialized section grows, do that initialization. |
I have a branch that I modified to take that approach. I do have a problem though... With context reuse compression is slower when I don't memset the tag table. I think whats happening is that we're getting tags that match from the previous compression, but are out of our window. This causes extra work in cases where there otherwise wouldn't be any matches. Basically, I think that this branch: Line 1208 in 26d88c0
rarely happens when the tables are sized to fit the input. Since we filter out at the tag step instead. But, when we remove the memset, and are compressing similar data, we may find matches that were from the previous file. And the tag won't filter out those matches, so we will hit this branch instead. I'm going to look into it. |
Running Line 1208 in 26d88c0
is taken 50751 times when the tagTable is memset, and 7566147 times when it isn't. Thats an increase of 149x. |
Testing To be fair, I was expecting worse given the x150 increase in branch account . When removing the Which makes |
Yeah, I guess so. The reason it was laid out this way was so that even if the buffers require a non-round number of bytes, all of the allocations could fit in a workspace with no extra padding bytes while respecting the alignment requirements of everything. I guess we've already backed away from this though with aligning the tables to 64 bytes. You will probably need to add 3 more padding bytes though to the workspace budget. |
This PR started as a Consequently, we may need more time to analyze trade off and consider alternatives. |
What about :
I presume the issue is that creating a mask from the index table is a complex (& costly) operation. |
It involves looping over the table. I think that this idea could work, and even reduce branches in all cases. It is definitely possible this could be a speed gain in general. But it remains to be tested. |
More complete solution merged, see #3528 |
This diff disables
tag
initialization when using therowHash
mode. This initialization was unconditional, but it becomes the dominating operation when compressing small data in streaming mode (see issue #2966).I could not find a good reason to initialize
tags
. It just makestag
values start at0
, but0
is just a regulartag
value, it's not more significant than any othertag
value. Worst case, there will be a wrong hint of match presence, but even that should be filtered out by distance analysis, which remains active through indices validation. So this won't impact compression result.Now, initially, I was suspicious that it would work, because the
tag
space is 2x larger than it should be, suggesting additional space is used for something else thantag
values, like determining the starting position in the row (which would be an overkill memory budget, but that's a different topic). But to my surprise, this change passes all tests successfully, suggestingrowHash
is even resilient to a random start position.Edit : it seems to finally break on
msan + fuzzer
test below, so that's something worth looking into.The end result is significant. When combined with #2969, the compression speed of
rowHash
on small data increases dramatically, as can be seen below (#2969 is required, as otherwise the impact oftag
initialization is just lost as part of a larger initialization issue).The following measurement is taken on a core
i7-9700K
(turbo disabled) withfullbench -b41
, usinggeldings.txt
as sample (a small text file). The test corresponds to a scenario usingZSTD_compressStream()
without the benefit of knowing the small sample size beforehand.rowHash
rowHash
rowHash
rowHash
rowHash
rowHash
rowHash
rowHash
cc @terrelln : there might be some side-effects which are not properly captured by the tests, such as potential reproducibility issues but with a low enough probability that it's too difficult to reproduce during CI tests. And maybe other side effects worth looking into.
Note : WIP, not for merge