Refactor offset+repcode sumtype #2962

Cyan4973 · 2021-12-29T01:32:28Z

The format of the offset parameter used by ZSTD_storeSeq() interface is weird.
To begin, with it's not an offset, it's actually a sum type, combining offset | repcode.

The numerical format of this sum-type is fully exposed to callers of ZSTD_storeSeq(), which consist of many units within the zstd project. They must know this numerical format for the parameter to make sense.

Due to historical reasons, this format is as follows :

0,1,2 : represent repcodes 1,2,3 (respectively)
3+ : represent real offsets, but with a +2 (offset 1 ==> 3)

Not only do all call sites must know and use this numerical format directly,
but due to its pervasive presence in many parts of the code,
it has also become the de-facto numerical format of the sum-type for a growing set of other usages,
such as integration of 3rd-party match finders.

And the funny part is : it's not even the format actually stored in the seqStore, called offBase,
which is slightly different (+1), making this format effectively transient.

This PR tries to break this invisible dependency
by using instead a centralized set of macros,
STORE_OFFSET() and STORE_REPCODE(),
to convert from raw repcode and offset values into the sumtype required by ZSTD_storeSeq().

Because the numerical format is also used for other purposes in other functions,
this set is complemented by accessor macros,
such as STORED_IS_OFFSET(),
so that future evolutions of the format can happen while preserving correctness of these other functions.

The end result of this PR is functionally equivalent to current version in dev branch,
with the intermediate transient format still present exactly as it were.
However, with all interactions with this transient numerical format now under control,
it makes it possible possible to later update this format,
for example by removing the intermediate transient representation to target directly the offBase format.
This would make the life of this field similar to existing matchLength => mlBase => mlCode + mlBits.

Such a second step would be the topic of a later PR, as this one is already complex enough as it is.

to better reflect the value stored in this field.

this meant to abstract the sumtype representation required to transfert `offcode` to `ZSTD_storeSeq()`. Unfortunately, the sumtype numeric representation is currently a leaky abstraction that has permeated many other parts of the code, especially within `zstd_lazy.c` and also within `zstd_opt.c` and `zstd_compress.c`. While this PR makes a good job a transfering a large nb of call sites to using the new macros, there are still a few sites where this transformation is more complex, or where the numeric representation itself it used "as is". One of the problematics area is the decision to use the numeric format of the sumtype within the match finders of `zstd_lazy`. This commit doesn't change the behavior, it only introduces and employes the macros, but eventually the resulting code remains identical. At target, if the numeric representation of the sumtype can be completely abstracted and no other part of the code depends on it, it will be possible to move it towards something slightly more efficient.

to act on values stored / expressed in the sumtype numeric representation required by `storedSeq()`. This makes it possible to abstract away this representation by using the macros to extract these values. First user : ZSTD_updateRep() .

optLdm->offset might be == 0 in invalid case. Only use STORE_OFFSET() after validating it's a correct case.

the new contracts seems to make more sense : updateRep() updates an array of repeat offsets _in place_, while newRep() generates a new structure with the updated repeat-offset array. Most callers are actually expecting the in-place variant, and a limited sub-section, in `zstd_opt.c` mainly, prefer `newRep()`.

…ype numeric representation

felixhandte

Overall this makes sense to me. I guess things will improve further in the next PR. I don't love the naming of the macros.

felixhandte · 2021-12-29T20:28:01Z

lib/compress/zstd_compress_internal.h

+#define STORE_REPCODE_2 STORE_REPCODE(2)
+#define STORE_REPCODE_3 STORE_REPCODE(3)
+#define STORE_REPCODE(r) (assert((r)>=1), assert((r)<=3), (r)-1)
+#define STORE_OFFSET(o)  (assert((o)>0), o + ZSTD_REP_MOVE)


Note that this pattern means evaluating the argument must not have any side effects.

Yes, this is a known side effect of macros.
Indeed, these macros shall only be used with scalar variables, not invoking function calls.
And it happens this condition is well respected, throughout the whole code base.

For more defensive properties, a possible alternative would be to swap these macros with inline functions,
although this second solution introduces its own set of potential casting issues.

felixhandte · 2021-12-29T20:28:05Z

lib/compress/zstd_compress_internal.h

+#define STORE_OFFSET(o)  (assert((o)>0), o + ZSTD_REP_MOVE)
+#define STORED_IS_OFFSET(o)  ((o) > ZSTD_REP_MOVE)
+#define STORED_IS_REPCODE(o) ((o) <= ZSTD_REP_MOVE)
+#define STORED_OFFSET(o)  (assert(STORED_IS_OFFSET(o)), (o)-ZSTD_REP_MOVE)


STORE_... and STORED_... are easily confused. Better antonyms of STORE_... might be LOAD_... or RETRIEVE_... or DECODE_....

Agreed.
Initially, only the STORE_*() macros were supposed to exist.
The other ones (STORED_*()) were added later, when realizing that the numeric format of the sum type was relied upon in various parts of the code base. So this was a quick way to bring these usages under control. And then, the nb of STORED_*() macros kept increasing...

It's more a temporary situation.
In the following PR, macro names will be updated significantly.

Cyan4973 · 2021-12-29T21:33:24Z

Overall this makes sense to me. I guess things will improve further in the next PR. I don't love the naming of the macros.

The macro names will indeed be changed in next PR,
they will likely be clearer, it will also make more sense to fine-tune naming in this following PR.

Cyan4973 added 14 commits December 23, 2021 17:56

change seqDef.offset into seqDef.offBase

aeff128

to better reflect the value stored in this field.

Merge branch 'dev' into seqStore_off

bec7bbb

created STORED_*() macros

2068889

to act on values stored / expressed in the sumtype numeric representation required by `storedSeq()`. This makes it possible to abstract away this representation by using the macros to extract these values. First user : ZSTD_updateRep() .

fixed regression test assert

435f5a2

optLdm->offset might be == 0 in invalid case. Only use STORE_OFFSET() after validating it's a correct case.

abstracted usage of offBase sumtype within zstd_lazy.c

b7630a4

fixed minor typecast warnings

321583c

abstracted storeSeq() sumtype numeric representation from decodecorpus.c

681c81f

abstracted storeSeq() sumtype numeric representation from zstd_opt.c

e909fa6

abstracted storeSeq() sumtype numeric representation from zstd_lazy.c

92a08ee

fixed minor conversion warnings

a34ccad

regroup all mentions of ZSTD_REP_MOVE within zstd_compress_internal.h

de9f52e

found a few more places which were dependent on seqStore offcode sumt…

8da4142

…ype numeric representation

facebook-github-bot added the CLA Signed label Dec 29, 2021

use ZSTD_memcpy(), for proper redirection within Linux Kernel

ad7c9fc

felixhandte reviewed Dec 29, 2021

View reviewed changes

felixhandte approved these changes Dec 29, 2021

View reviewed changes

Cyan4973 merged commit fb14e22 into dev Dec 30, 2021

Cyan4973 mentioned this pull request Dec 30, 2021

Converge sumtype (offset | repcode) numeric representation towards offBase #2965

Merged

Cyan4973 deleted the seqStore_off branch January 13, 2023 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor offset+repcode sumtype #2962

Refactor offset+repcode sumtype #2962

Cyan4973 commented Dec 29, 2021 •

edited

Loading

felixhandte left a comment

felixhandte Dec 29, 2021

Cyan4973 Dec 29, 2021 •

edited

Loading

felixhandte Dec 29, 2021

Cyan4973 Dec 29, 2021 •

edited

Loading

Cyan4973 commented Dec 29, 2021 •

edited

Loading

Refactor offset+repcode sumtype #2962

Refactor offset+repcode sumtype #2962

Conversation

Cyan4973 commented Dec 29, 2021 • edited Loading

felixhandte left a comment

Choose a reason for hiding this comment

felixhandte Dec 29, 2021

Choose a reason for hiding this comment

Cyan4973 Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

felixhandte Dec 29, 2021

Choose a reason for hiding this comment

Cyan4973 Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 commented Dec 29, 2021 • edited Loading

Cyan4973 commented Dec 29, 2021 •

edited

Loading

Cyan4973 Dec 29, 2021 •

edited

Loading

Cyan4973 Dec 29, 2021 •

edited

Loading

Cyan4973 commented Dec 29, 2021 •

edited

Loading