Assert no division by 0 in `ZSTD_entropyCost()`, fix superblocks no sequences case #2592

senhuang42 · 2021-04-29T17:27:16Z

Discovered in #2591

Not possible to have the division by 0, so we should assert() that.

senhuang42 · 2021-04-29T17:36:10Z

Aha, I guess not, for superblocks.

Cyan4973 · 2021-04-29T17:42:32Z

Well, even for superblocks,
with the objective of creating inner blocks with approximately constant compressed size,
assuming that one such inner block has zero sequence, hence is all literals,
I would have expected such case to not attempt to generate statistics for sequences.
Maybe something to improve there ...

Regarding the issue #2591 : what matters specifically is to ensure there is no division by zero.
Hence, the more important check is assert(total > 0).

terrelln · 2021-05-05T06:03:45Z

Yeah, looks like there is a bug in superblock mode. When it calls ZSTD_buildBlockEntropyStats() nbSeq may be zero, and it isn't checked later. But regular compression looks fine.

zstd/lib/compress/zstd_compress_superblock.c

Lines 555 to 560 in 10e5513

    
           FORWARD_IF_ERROR(ZSTD_buildBlockEntropyStats(&zc->seqStore, 
        
                 &zc->blockState.prevCBlock->entropy, 
        
                 &zc->blockState.nextCBlock->entropy, 
        
                 &zc->appliedParams, 
        
                 &entropyMetadata, 
        
                 zc->entropyWorkspace, ENTROPY_WORKSPACE_SIZE /* statically allocated in resetCCtx */), "");

zstd/lib/compress/zstd_compress.c

Lines 3159 to 3164 in 10e5513

    
           entropyMetadata->fseMetadata.fseTablesSize = 
        
               ZSTD_buildBlockEntropyStats_sequences(seqStorePtr, 
        
                                                     &prevEntropy->fse, &nextEntropy->fse, 
        
                                                     cctxParams, 
        
                                                     &entropyMetadata->fseMetadata, 
        
                                                     workspace, wkspSize);

zstd/lib/compress/zstd_compress.c

Lines 3125 to 3128 in 10e5513

    
           stats = ZSTD_buildSequencesStatistics(seqStorePtr, nbSeq, 
        
                                               prevEntropy, nextEntropy, op, oend, 
        
                                               strategy, countWorkspace, 
        
                                               entropyWorkspace, entropyWorkspaceSize);

zstd/lib/compress/zstd_compress.c

Line 2405 in 10e5513

ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq,

Cyan4973 · 2021-05-06T15:30:52Z

I presume the suspected bug in superblock mode will be a separate issue / PR ?

senhuang42 · 2021-05-06T20:39:02Z

Updated to fix superblocks with nbSeq == 0

Cyan4973 · 2021-05-06T21:10:41Z

lib/compress/zstd_compress.c

@@ -3103,6 +3104,12 @@ static size_t ZSTD_buildBlockEntropyStats_sequences(seqStore_t* seqStorePtr,
    size_t entropyWorkspaceSize = wkspSize - (MaxSeq + 1) * sizeof(*countWorkspace);
    ZSTD_symbolEncodingTypeStats_t stats;

+    if (nbSeq == 0) {
+        /* Do not build sequences statistics if nbSeq is 0, just return a default */
+        fseMetadata->llType = fseMetadata->mlType = fseMetadata->ofType = set_compressed;


What are the consequences of settings these fseMedata to set_compressed ?
We know that there is no sequence, so it's not going to be used in this block.
It also will not be written into the compressed frame.

But what about the next block ?
Will it consider the previous block as featuring "compressed sequences", and therefore entropy headers ?
Asking as it might impact the algorithm which tries to decide between "repeating previous stats" or "issuing new stats".

Alternatively, this information might possibly be discarded later on and not even read (in which case, why setting them ?).

So basically regressiontest only passes with the default set to set_compressed or set_rle.
There are three places where the encodingType is read:

zstd/lib/compress/zstd_compress_superblock.c

Lines 193 to 194 in e7e4b74

if (writeEntropy) {

const U32 LLtype = fseMetadata->llType;

zstd/lib/compress/zstd_compress_superblock.c

Lines 368 to 371 in e7e4b74

size_t sequencesSectionHeaderSize = 3; /* Use hard coded size of 3 bytes */

size_t cSeqSizeEstimate = 0;

cSeqSizeEstimate += ZSTD_estimateSubBlockSize_symbolType(fseMetadata->ofType, ofCodeTable, MaxOff,

nbSeq, fseTables->offcodeCTable, NULL,

zstd/lib/compress/zstd_compress_superblock.c

Lines 405 to 407 in e7e4b74

static int ZSTD_needSequenceEntropyTables(ZSTD_fseCTablesMetadata_t const* fseMetadata)

{

if (fseMetadata->llType == set_compressed || fseMetadata->llType == set_rle)

The first two cases do not ever get read, due to nbSeq == 0. The last case still does, so basically whatever we decide to set fseMetadata->llType to, it's basically to affect how this function returns. It seems like even with nbSeq == 0, we'd want ZSTD_needSequenceEntropyTables() to evaluate to true right? At least the fuzzer tests seem to indicate so.

If my understanding is correct, ZSTD_needSequenceEntropyTables() is invoked in only a single place :

zstd/lib/compress/zstd_compress_superblock.c

Line 521 in e7e4b74

if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata)) {

This place seems to check if there is a discrepancy between needing entropy tables, and writing entropy tables.

This feels weird to me.
If nbSeq==0, then there should be no need to write any sequence entropy header.
Hence, we should have writeSeqEntropy == 0.

But it's set to 1 unconditionally at the beginning of the function.
The only place where it could be updated to 0 is here, and that's for an unrelated reason :

zstd/lib/compress/zstd_compress_superblock.c

Line 512 in e7e4b74

writeSeqEntropy = 0;

This code is very complex, and difficult to follow.
It looks to me that there should be no need to write Sequences Entropy is there is no sequence.
So my first idea would be to change that initialization to 0 in this case.
Unfortunately, the "number of Sequences" is not provided in a straightforward manner,
but is accessible through (unsigned)(send-sstart) (discovered from the traces) .

Since the code is complex, it could be that I misinterpret the objective or hidden usages of this variable.
Given that it's then used in multiple places (ZSTD_estimateSubBlockSize() and ZSTD_compressSubBlock())
it feels like a dangerous change.
Worth testing, and maybe worth following what happens to this variable in these sub-functions.

Yeah the code is confusing.

Actually, in both ZSTD_estimateSubBlockSize() and ZSTD_compressSubBlock(), for the nbSeq == 0 case, we don't use writeSeqEntropy. In both cases, we return early if nbSeq == 0, leaving that variable unread.

So really, I think it's just this decision of whether or not to emit uncompressed that we end up affecting.

writeSeqEntropy == 0 if seqEntropyWritten = 1, so we can think of writeSeqEntropy == 1 as saying "entropy has not been written". So basically the check of
if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata)) is saying that if we did not write seqEntropyTables, but did need them, we should emit uncompressed. So my setting of llType to set_compressed/rle is not correct, since we're saying that we always need them.

But at the same time, setting writeSeqEntropy = (send-sstart) == 0 ? 0 : 1; doesn't seem to produce the correct results either, as it fails the tests.

It's a pretty bad situation.
It seems we don't understand well enough what's happening in this unit,
which becomes a maintenance liability.
And since it has never worked correctly, and is not used anywhere,
I'm wondering if we would be better off without it.

One thing I've figured out is that the test failures are a result of the added early return in ZSTD_buildBlockEntropyStats_sequences() for nbSeq==0 causing us to not run ZSTD_buildSequencesStatistics(), which causes us to miss out on some side effects which are apparently necessary - it doesn't actually matter what the llType, etc. are when nbSeq == 0. I'll modify the PR accordingly to see if I can get that working.

I've updated this with a fix that should make sense (basically selectEncodingType() should evaluate to set_basic if nbSeq==0, so we just set that explicitly here. Though yeah, I'm not sure whether the cost of maintaining this code is worth its general complexity, especially considering that I don't think anyone really uses it.

The added branch should be hardwired for the typical compression path - I measured no speed difference at level 1.

Cyan4973 · 2021-05-07T01:58:50Z

lib/compress/zstd_compress.c

@@ -2388,7 +2388,8 @@ ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq,
                        const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy,
                              BYTE* dst, const BYTE* const dstEnd,
                              ZSTD_strategy strategy, unsigned* countWorkspace,
-                              void* entropyWorkspace, size_t entropyWkspSize) {
+                              void* entropyWorkspace, size_t entropyWkspSize,
+                              const U32 selectEncodingTypes) {


I'm annoyed that this selectEncodingTypes concept gets externalized in the function prototype.

It looks to me that it would make more sense to keep these topics separated.

What you could have is something like
ZSTD_symbolEncodingTypeStats_t ZSTD_buildDummyStatistics()
and it would do the same thing as what's inside the if (selectEncodingTypes) { } else { ... } statement.

It would then be selected directly within ZSTD_buildBlockEntropyStats_sequences(), instead of ZSTD_buildSequencesStatistics(), depending on nbSeq==0 or not.

facebook-github-bot added the CLA Signed label Apr 29, 2021

senhuang42 force-pushed the assert_nbseq_not_0 branch 2 times, most recently from b039be0 to b568354 Compare April 29, 2021 18:17

senhuang42 force-pushed the assert_nbseq_not_0 branch from b568354 to 8b94116 Compare May 6, 2021 19:04

senhuang42 changed the title ~~Assert that we aren't dividing by 0 in ZSTD_entropyCost()~~ Assert that we aren't dividing by 0 in ZSTD_entropyCost(), fix superblocks nbseq==0 case May 6, 2021

senhuang42 changed the title ~~Assert that we aren't dividing by 0 in ZSTD_entropyCost(), fix superblocks nbseq==0 case~~ Assert no division by 0 in ZSTD_entropyCost(), fix superblocks no sequences case May 6, 2021

Cyan4973 reviewed May 6, 2021

View reviewed changes

senhuang42 force-pushed the assert_nbseq_not_0 branch from 8b94116 to f612483 Compare May 6, 2021 23:23

Cyan4973 reviewed May 7, 2021

View reviewed changes

senhuang42 force-pushed the assert_nbseq_not_0 branch 2 times, most recently from 33aa84f to ecb7a20 Compare May 7, 2021 15:23

Assert no divison by 0, correct superblocks 0 sequences case

ecb7a20

Cyan4973 approved these changes May 7, 2021

View reviewed changes

senhuang42 merged commit 9e94b7c into facebook:dev May 7, 2021

senhuang42 mentioned this pull request May 11, 2021

🎉 Zstd 1.5.0 Release 🎉 #2636

Merged

terrelln mentioned this pull request Nov 24, 2021

A possible divide by zero bug? #2591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert no division by 0 in `ZSTD_entropyCost()`, fix superblocks no sequences case #2592

Assert no division by 0 in `ZSTD_entropyCost()`, fix superblocks no sequences case #2592

senhuang42 commented Apr 29, 2021

senhuang42 commented Apr 29, 2021

Cyan4973 commented Apr 29, 2021 •

edited

Loading

terrelln commented May 5, 2021 •

edited

Loading

Cyan4973 commented May 6, 2021

senhuang42 commented May 6, 2021

Cyan4973 May 6, 2021 •

edited

Loading

senhuang42 May 6, 2021 •

edited

Loading

Cyan4973 May 6, 2021

senhuang42 May 6, 2021

Cyan4973 May 6, 2021 •

edited

Loading

senhuang42 May 6, 2021

senhuang42 May 6, 2021 •

edited

Loading

Cyan4973 May 7, 2021

	size_t sequencesSectionHeaderSize = 3; /* Use hard coded size of 3 bytes */
	size_t cSeqSizeEstimate = 0;
	cSeqSizeEstimate += ZSTD_estimateSubBlockSize_symbolType(fseMetadata->ofType, ofCodeTable, MaxOff,
	nbSeq, fseTables->offcodeCTable, NULL,

	static int ZSTD_needSequenceEntropyTables(ZSTD_fseCTablesMetadata_t const* fseMetadata)
	{
	if (fseMetadata->llType == set_compressed \|\| fseMetadata->llType == set_rle)

Assert no division by 0 in ZSTD_entropyCost(), fix superblocks no sequences case #2592

Assert no division by 0 in ZSTD_entropyCost(), fix superblocks no sequences case #2592

Conversation

senhuang42 commented Apr 29, 2021

senhuang42 commented Apr 29, 2021

Cyan4973 commented Apr 29, 2021 • edited Loading

terrelln commented May 5, 2021 • edited Loading

Cyan4973 commented May 6, 2021

senhuang42 commented May 6, 2021

Cyan4973 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

senhuang42 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 May 6, 2021

Choose a reason for hiding this comment

senhuang42 May 6, 2021

Choose a reason for hiding this comment

Cyan4973 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

senhuang42 May 6, 2021

Choose a reason for hiding this comment

senhuang42 May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 May 7, 2021

Choose a reason for hiding this comment

Assert no division by 0 in `ZSTD_entropyCost()`, fix superblocks no sequences case #2592

Assert no division by 0 in `ZSTD_entropyCost()`, fix superblocks no sequences case #2592

Cyan4973 commented Apr 29, 2021 •

edited

Loading

terrelln commented May 5, 2021 •

edited

Loading

Cyan4973 May 6, 2021 •

edited

Loading

senhuang42 May 6, 2021 •

edited

Loading

Cyan4973 May 6, 2021 •

edited

Loading

senhuang42 May 6, 2021 •

edited

Loading