Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert no division by 0 in ZSTD_entropyCost(), fix superblocks no sequences case #2592

Merged
merged 1 commit into from
May 7, 2021

Conversation

senhuang42
Copy link
Contributor

Discovered in #2591

Not possible to have the division by 0, so we should assert() that.

@senhuang42
Copy link
Contributor Author

Aha, I guess not, for superblocks.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 29, 2021

Well, even for superblocks,
with the objective of creating inner blocks with approximately constant compressed size,
assuming that one such inner block has zero sequence, hence is all literals,
I would have expected such case to not attempt to generate statistics for sequences.
Maybe something to improve there ...

Regarding the issue #2591 : what matters specifically is to ensure there is no division by zero.
Hence, the more important check is assert(total > 0).

@senhuang42 senhuang42 force-pushed the assert_nbseq_not_0 branch 2 times, most recently from b039be0 to b568354 Compare April 29, 2021 18:17
@terrelln
Copy link
Contributor

terrelln commented May 5, 2021

Yeah, looks like there is a bug in superblock mode. When it calls ZSTD_buildBlockEntropyStats() nbSeq may be zero, and it isn't checked later. But regular compression looks fine.

FORWARD_IF_ERROR(ZSTD_buildBlockEntropyStats(&zc->seqStore,
&zc->blockState.prevCBlock->entropy,
&zc->blockState.nextCBlock->entropy,
&zc->appliedParams,
&entropyMetadata,
zc->entropyWorkspace, ENTROPY_WORKSPACE_SIZE /* statically allocated in resetCCtx */), "");

entropyMetadata->fseMetadata.fseTablesSize =
ZSTD_buildBlockEntropyStats_sequences(seqStorePtr,
&prevEntropy->fse, &nextEntropy->fse,
cctxParams,
&entropyMetadata->fseMetadata,
workspace, wkspSize);

stats = ZSTD_buildSequencesStatistics(seqStorePtr, nbSeq,
prevEntropy, nextEntropy, op, oend,
strategy, countWorkspace,
entropyWorkspace, entropyWorkspaceSize);

ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq,

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 6, 2021

I presume the suspected bug in superblock mode will be a separate issue / PR ?

@senhuang42 senhuang42 force-pushed the assert_nbseq_not_0 branch from b568354 to 8b94116 Compare May 6, 2021 19:04
@senhuang42 senhuang42 changed the title Assert that we aren't dividing by 0 in ZSTD_entropyCost() Assert that we aren't dividing by 0 in ZSTD_entropyCost(), fix superblocks nbseq==0 case May 6, 2021
@senhuang42 senhuang42 changed the title Assert that we aren't dividing by 0 in ZSTD_entropyCost(), fix superblocks nbseq==0 case Assert no division by 0 in ZSTD_entropyCost(), fix superblocks no sequences case May 6, 2021
@senhuang42
Copy link
Contributor Author

Updated to fix superblocks with nbSeq == 0

@@ -3103,6 +3104,12 @@ static size_t ZSTD_buildBlockEntropyStats_sequences(seqStore_t* seqStorePtr,
size_t entropyWorkspaceSize = wkspSize - (MaxSeq + 1) * sizeof(*countWorkspace);
ZSTD_symbolEncodingTypeStats_t stats;

if (nbSeq == 0) {
/* Do not build sequences statistics if nbSeq is 0, just return a default */
fseMetadata->llType = fseMetadata->mlType = fseMetadata->ofType = set_compressed;
Copy link
Contributor

@Cyan4973 Cyan4973 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences of settings these fseMedata to set_compressed ?
We know that there is no sequence, so it's not going to be used in this block.
It also will not be written into the compressed frame.

But what about the next block ?
Will it consider the previous block as featuring "compressed sequences", and therefore entropy headers ?
Asking as it might impact the algorithm which tries to decide between "repeating previous stats" or "issuing new stats".

Alternatively, this information might possibly be discarded later on and not even read (in which case, why setting them ?).

Copy link
Contributor Author

@senhuang42 senhuang42 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically regressiontest only passes with the default set to set_compressed or set_rle.
There are three places where the encodingType is read:

if (writeEntropy) {
const U32 LLtype = fseMetadata->llType;

size_t sequencesSectionHeaderSize = 3; /* Use hard coded size of 3 bytes */
size_t cSeqSizeEstimate = 0;
cSeqSizeEstimate += ZSTD_estimateSubBlockSize_symbolType(fseMetadata->ofType, ofCodeTable, MaxOff,
nbSeq, fseTables->offcodeCTable, NULL,

static int ZSTD_needSequenceEntropyTables(ZSTD_fseCTablesMetadata_t const* fseMetadata)
{
if (fseMetadata->llType == set_compressed || fseMetadata->llType == set_rle)

The first two cases do not ever get read, due to nbSeq == 0. The last case still does, so basically whatever we decide to set fseMetadata->llType to, it's basically to affect how this function returns. It seems like even with nbSeq == 0, we'd want ZSTD_needSequenceEntropyTables() to evaluate to true right? At least the fuzzer tests seem to indicate so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, ZSTD_needSequenceEntropyTables() is invoked in only a single place :

if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata)) {

This place seems to check if there is a discrepancy between needing entropy tables, and writing entropy tables.

This feels weird to me.
If nbSeq==0, then there should be no need to write any sequence entropy header.
Hence, we should have writeSeqEntropy == 0.

But it's set to 1 unconditionally at the beginning of the function.
The only place where it could be updated to 0 is here, and that's for an unrelated reason :

writeSeqEntropy = 0;

This code is very complex, and difficult to follow.
It looks to me that there should be no need to write Sequences Entropy is there is no sequence.
So my first idea would be to change that initialization to 0 in this case.
Unfortunately, the "number of Sequences" is not provided in a straightforward manner,
but is accessible through (unsigned)(send-sstart) (discovered from the traces) .

Since the code is complex, it could be that I misinterpret the objective or hidden usages of this variable.
Given that it's then used in multiple places (ZSTD_estimateSubBlockSize() and ZSTD_compressSubBlock())
it feels like a dangerous change.
Worth testing, and maybe worth following what happens to this variable in these sub-functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the code is confusing.

Actually, in both ZSTD_estimateSubBlockSize() and ZSTD_compressSubBlock(), for the nbSeq == 0 case, we don't use writeSeqEntropy. In both cases, we return early if nbSeq == 0, leaving that variable unread.

So really, I think it's just this decision of whether or not to emit uncompressed that we end up affecting.

writeSeqEntropy == 0 if seqEntropyWritten = 1, so we can think of writeSeqEntropy == 1 as saying "entropy has not been written". So basically the check of
if (writeSeqEntropy && ZSTD_needSequenceEntropyTables(&entropyMetadata->fseMetadata)) is saying that if we did not write seqEntropyTables, but did need them, we should emit uncompressed. So my setting of llType to set_compressed/rle is not correct, since we're saying that we always need them.

But at the same time, setting writeSeqEntropy = (send-sstart) == 0 ? 0 : 1; doesn't seem to produce the correct results either, as it fails the tests.

Copy link
Contributor

@Cyan4973 Cyan4973 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pretty bad situation.
It seems we don't understand well enough what's happening in this unit,
which becomes a maintenance liability.
And since it has never worked correctly, and is not used anywhere,
I'm wondering if we would be better off without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've figured out is that the test failures are a result of the added early return in ZSTD_buildBlockEntropyStats_sequences() for nbSeq==0 causing us to not run ZSTD_buildSequencesStatistics(), which causes us to miss out on some side effects which are apparently necessary - it doesn't actually matter what the llType, etc. are when nbSeq == 0. I'll modify the PR accordingly to see if I can get that working.

Copy link
Contributor Author

@senhuang42 senhuang42 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this with a fix that should make sense (basically selectEncodingType() should evaluate to set_basic if nbSeq==0, so we just set that explicitly here. Though yeah, I'm not sure whether the cost of maintaining this code is worth its general complexity, especially considering that I don't think anyone really uses it.

The added branch should be hardwired for the typical compression path - I measured no speed difference at level 1.

@senhuang42 senhuang42 force-pushed the assert_nbseq_not_0 branch from 8b94116 to f612483 Compare May 6, 2021 23:23
@@ -2388,7 +2388,8 @@ ZSTD_buildSequencesStatistics(seqStore_t* seqStorePtr, size_t nbSeq,
const ZSTD_fseCTables_t* prevEntropy, ZSTD_fseCTables_t* nextEntropy,
BYTE* dst, const BYTE* const dstEnd,
ZSTD_strategy strategy, unsigned* countWorkspace,
void* entropyWorkspace, size_t entropyWkspSize) {
void* entropyWorkspace, size_t entropyWkspSize,
const U32 selectEncodingTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm annoyed that this selectEncodingTypes concept gets externalized in the function prototype.

It looks to me that it would make more sense to keep these topics separated.

What you could have is something like
ZSTD_symbolEncodingTypeStats_t ZSTD_buildDummyStatistics()
and it would do the same thing as what's inside the if (selectEncodingTypes) { } else { ... } statement.

It would then be selected directly within ZSTD_buildBlockEntropyStats_sequences(), instead of ZSTD_buildSequencesStatistics(), depending on nbSeq==0 or not.

@senhuang42 senhuang42 force-pushed the assert_nbseq_not_0 branch 2 times, most recently from 33aa84f to ecb7a20 Compare May 7, 2021 15:23
@senhuang42 senhuang42 merged commit 9e94b7c into facebook:dev May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants