-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor offset+repcode sumtype #2962
Conversation
to better reflect the value stored in this field.
this meant to abstract the sumtype representation required to transfert `offcode` to `ZSTD_storeSeq()`. Unfortunately, the sumtype numeric representation is currently a leaky abstraction that has permeated many other parts of the code, especially within `zstd_lazy.c` and also within `zstd_opt.c` and `zstd_compress.c`. While this PR makes a good job a transfering a large nb of call sites to using the new macros, there are still a few sites where this transformation is more complex, or where the numeric representation itself it used "as is". One of the problematics area is the decision to use the numeric format of the sumtype within the match finders of `zstd_lazy`. This commit doesn't change the behavior, it only introduces and employes the macros, but eventually the resulting code remains identical. At target, if the numeric representation of the sumtype can be completely abstracted and no other part of the code depends on it, it will be possible to move it towards something slightly more efficient.
to act on values stored / expressed in the sumtype numeric representation required by `storedSeq()`. This makes it possible to abstract away this representation by using the macros to extract these values. First user : ZSTD_updateRep() .
optLdm->offset might be == 0 in invalid case. Only use STORE_OFFSET() after validating it's a correct case.
the new contracts seems to make more sense : updateRep() updates an array of repeat offsets _in place_, while newRep() generates a new structure with the updated repeat-offset array. Most callers are actually expecting the in-place variant, and a limited sub-section, in `zstd_opt.c` mainly, prefer `newRep()`.
…ype numeric representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense to me. I guess things will improve further in the next PR. I don't love the naming of the macros.
#define STORE_REPCODE_2 STORE_REPCODE(2) | ||
#define STORE_REPCODE_3 STORE_REPCODE(3) | ||
#define STORE_REPCODE(r) (assert((r)>=1), assert((r)<=3), (r)-1) | ||
#define STORE_OFFSET(o) (assert((o)>0), o + ZSTD_REP_MOVE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this pattern means evaluating the argument must not have any side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a known side effect of macros.
Indeed, these macros shall only be used with scalar variables, not invoking function calls.
And it happens this condition is well respected, throughout the whole code base.
For more defensive properties, a possible alternative would be to swap these macros with inline
functions,
although this second solution introduces its own set of potential casting issues.
#define STORE_OFFSET(o) (assert((o)>0), o + ZSTD_REP_MOVE) | ||
#define STORED_IS_OFFSET(o) ((o) > ZSTD_REP_MOVE) | ||
#define STORED_IS_REPCODE(o) ((o) <= ZSTD_REP_MOVE) | ||
#define STORED_OFFSET(o) (assert(STORED_IS_OFFSET(o)), (o)-ZSTD_REP_MOVE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STORE_...
and STORED_...
are easily confused. Better antonyms of STORE_...
might be LOAD_...
or RETRIEVE_...
or DECODE_...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Initially, only the STORE_*()
macros were supposed to exist.
The other ones (STORED_*()
) were added later, when realizing that the numeric format of the sum type was relied upon in various parts of the code base. So this was a quick way to bring these usages under control. And then, the nb of STORED_*()
macros kept increasing...
It's more a temporary situation.
In the following PR, macro names will be updated significantly.
The macro names will indeed be changed in next PR, |
The format of the
offset
parameter used byZSTD_storeSeq()
interface is weird.To begin, with it's not an
offset
, it's actually a sum type, combiningoffset
|repcode
.The numerical format of this sum-type is fully exposed to callers of
ZSTD_storeSeq()
, which consist of many units within thezstd
project. They must know this numerical format for the parameter to make sense.Due to historical reasons, this format is as follows :
Not only do all call sites must know and use this numerical format directly,
but due to its pervasive presence in many parts of the code,
it has also become the de-facto numerical format of the sum-type for a growing set of other usages,
such as integration of 3rd-party match finders.
And the funny part is : it's not even the format actually stored in the
seqStore
, calledoffBase
,which is slightly different (+1), making this format effectively transient.
This PR tries to break this invisible dependency
by using instead a centralized set of macros,
STORE_OFFSET()
andSTORE_REPCODE()
,to convert from raw
repcode
andoffset
values into the sumtype required byZSTD_storeSeq()
.Because the numerical format is also used for other purposes in other functions,
this set is complemented by accessor macros,
such as
STORED_IS_OFFSET()
,so that future evolutions of the format can happen while preserving correctness of these other functions.
The end result of this PR is functionally equivalent to current version in
dev
branch,with the intermediate transient format still present exactly as it were.
However, with all interactions with this transient numerical format now under control,
it makes it possible possible to later update this format,
for example by removing the intermediate transient representation to target directly the
offBase
format.This would make the life of this field similar to existing
matchLength => mlBase => mlCode + mlBits
.Such a second step would be the topic of a later PR, as this one is already complex enough as it is.