-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert no division by 0 in ZSTD_entropyCost()
, fix superblocks no sequences case
#2592
Conversation
Aha, I guess not, for superblocks. |
Well, even for superblocks, Regarding the issue #2591 : what matters specifically is to ensure there is no division by zero. |
b039be0
to
b568354
Compare
Yeah, looks like there is a bug in superblock mode. When it calls zstd/lib/compress/zstd_compress_superblock.c Lines 555 to 560 in 10e5513
zstd/lib/compress/zstd_compress.c Lines 3159 to 3164 in 10e5513
zstd/lib/compress/zstd_compress.c Lines 3125 to 3128 in 10e5513
zstd/lib/compress/zstd_compress.c Line 2405 in 10e5513
|
I presume the suspected bug in superblock mode will be a separate issue / PR ? |
b568354
to
8b94116
Compare
ZSTD_entropyCost()
ZSTD_entropyCost()
, fix superblocks nbseq==0 case
ZSTD_entropyCost()
, fix superblocks nbseq==0 caseZSTD_entropyCost()
, fix superblocks no sequences case
Updated to fix superblocks with |
lib/compress/zstd_compress.c
Outdated
@@ -3103,6 +3104,12 @@ static size_t ZSTD_buildBlockEntropyStats_sequences(seqStore_t* seqStorePtr, | |||
size_t entropyWorkspaceSize = wkspSize - (MaxSeq + 1) * sizeof(*countWorkspace); | |||
ZSTD_symbolEncodingTypeStats_t stats; | |||
|
|||
if (nbSeq == 0) { | |||
/* Do not build sequences statistics if nbSeq is 0, just return a default */ | |||
fseMetadata->llType = fseMetadata->mlType = fseMetadata->ofType = set_compressed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences of settings these fseMedata
to set_compressed
?
We know that there is no sequence, so it's not going to be used in this block.
It also will not be written into the compressed frame.
But what about the next block ?
Will it consider the previous block as featuring "compressed sequences", and therefore entropy headers ?
Asking as it might impact the algorithm which tries to decide between "repeating previous stats" or "issuing new stats".
Alternatively, this information might possibly be discarded later on and not even read (in which case, why setting them ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically regressiontest
only passes with the default set to set_compressed
or set_rle
.
There are three places where the encodingType
is read:
zstd/lib/compress/zstd_compress_superblock.c
Lines 193 to 194 in e7e4b74
if (writeEntropy) { | |
const U32 LLtype = fseMetadata->llType; |
zstd/lib/compress/zstd_compress_superblock.c
Lines 368 to 371 in e7e4b74
size_t sequencesSectionHeaderSize = 3; /* Use hard coded size of 3 bytes */ | |
size_t cSeqSizeEstimate = 0; | |
cSeqSizeEstimate += ZSTD_estimateSubBlockSize_symbolType(fseMetadata->ofType, ofCodeTable, MaxOff, | |
nbSeq, fseTables->offcodeCTable, NULL, |
zstd/lib/compress/zstd_compress_superblock.c
Lines 405 to 407 in e7e4b74
static int ZSTD_needSequenceEntropyTables(ZSTD_fseCTablesMetadata_t const* fseMetadata) | |
{ | |
if (fseMetadata->llType == set_compressed || fseMetadata->llType == set_rle) |
The first two cases do not ever get read, due to nbSeq == 0
. The last case still does, so basically whatever we decide to set fseMetadata->llType
to, it's basically to affect how this function returns. It seems like even with nbSeq == 0
, we'd want ZSTD_needSequenceEntropyTables()
to evaluate to true right? At least the fuzzer tests seem to indicate so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is correct, ZSTD_needSequenceEntropyTables()
is invoked in only a single place :
zstd/lib/compress/zstd_compress_superblock.c
Line 521 in e7e4b74
if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata)) { |
This place seems to check if there is a discrepancy between needing entropy tables, and writing entropy tables.
This feels weird to me.
If nbSeq==0
, then there should be no need to write any sequence entropy header.
Hence, we should have writeSeqEntropy == 0
.
But it's set to 1
unconditionally at the beginning of the function.
The only place where it could be updated to 0
is here, and that's for an unrelated reason :
zstd/lib/compress/zstd_compress_superblock.c
Line 512 in e7e4b74
writeSeqEntropy = 0; |
This code is very complex, and difficult to follow.
It looks to me that there should be no need to write Sequences Entropy is there is no sequence.
So my first idea would be to change that initialization to 0
in this case.
Unfortunately, the "number of Sequences" is not provided in a straightforward manner,
but is accessible through (unsigned)(send-sstart)
(discovered from the traces) .
Since the code is complex, it could be that I misinterpret the objective or hidden usages of this variable.
Given that it's then used in multiple places (ZSTD_estimateSubBlockSize()
and ZSTD_compressSubBlock()
)
it feels like a dangerous change.
Worth testing, and maybe worth following what happens to this variable in these sub-functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the code is confusing.
Actually, in both ZSTD_estimateSubBlockSize()
and ZSTD_compressSubBlock()
, for the nbSeq == 0
case, we don't use writeSeqEntropy
. In both cases, we return early if nbSeq == 0
, leaving that variable unread.
So really, I think it's just this decision of whether or not to emit uncompressed that we end up affecting.
writeSeqEntropy == 0
if seqEntropyWritten = 1
, so we can think of writeSeqEntropy == 1
as saying "entropy has not been written". So basically the check of
if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata))
is saying that if we did not write seqEntropyTables, but did need them, we should emit uncompressed. So my setting of llType
to set_compressed/rle
is not correct, since we're saying that we always need them.
But at the same time, setting writeSeqEntropy = (send-sstart) == 0 ? 0 : 1;
doesn't seem to produce the correct results either, as it fails the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pretty bad situation.
It seems we don't understand well enough what's happening in this unit,
which becomes a maintenance liability.
And since it has never worked correctly, and is not used anywhere,
I'm wondering if we would be better off without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I've figured out is that the test failures are a result of the added early return in ZSTD_buildBlockEntropyStats_sequences()
for nbSeq==0
causing us to not run ZSTD_buildSequencesStatistics()
, which causes us to miss out on some side effects which are apparently necessary - it doesn't actually matter what the llType, etc.
are when nbSeq == 0
. I'll modify the PR accordingly to see if I can get that working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this with a fix that should make sense (basically selectEncodingType()
should evaluate to set_basic
if nbSeq==0
, so we just set that explicitly here. Though yeah, I'm not sure whether the cost of maintaining this code is worth its general complexity, especially considering that I don't think anyone really uses it.
The added branch should be hardwired for the typical compression path - I measured no speed difference at level 1.
8b94116
to
f612483
Compare
lib/compress/zstd_compress.c
Outdated
@@ -2388,7 +2388,8 @@ ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq, | |||
const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy, | |||
BYTE* dst, const BYTE* const dstEnd, | |||
ZSTD_strategy strategy, unsigned* countWorkspace, | |||
void* entropyWorkspace, size_t entropyWkspSize) { | |||
void* entropyWorkspace, size_t entropyWkspSize, | |||
const U32 selectEncodingTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm annoyed that this selectEncodingTypes
concept gets externalized in the function prototype.
It looks to me that it would make more sense to keep these topics separated.
What you could have is something like
ZSTD_symbolEncodingTypeStats_t ZSTD_buildDummyStatistics()
and it would do the same thing as what's inside the if (selectEncodingTypes) { } else { ... }
statement.
It would then be selected directly within ZSTD_buildBlockEntropyStats_sequences()
, instead of ZSTD_buildSequencesStatistics()
, depending on nbSeq==0
or not.
33aa84f
to
ecb7a20
Compare
Discovered in #2591
Not possible to have the division by 0, so we should
assert()
that.