Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor offset+repcode sumtype #2962

Merged
merged 15 commits into from
Dec 30, 2021
Merged

Refactor offset+repcode sumtype #2962

merged 15 commits into from
Dec 30, 2021

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Dec 29, 2021

The format of the offset parameter used by ZSTD_storeSeq() interface is weird.
To begin, with it's not an offset, it's actually a sum type, combining offset | repcode.

The numerical format of this sum-type is fully exposed to callers of ZSTD_storeSeq(), which consist of many units within the zstd project. They must know this numerical format for the parameter to make sense.

Due to historical reasons, this format is as follows :

  • 0,1,2 : represent repcodes 1,2,3 (respectively)
  • 3+ : represent real offsets, but with a +2 (offset 1 ==> 3)

Not only do all call sites must know and use this numerical format directly,
but due to its pervasive presence in many parts of the code,
it has also become the de-facto numerical format of the sum-type for a growing set of other usages,
such as integration of 3rd-party match finders.

And the funny part is : it's not even the format actually stored in the seqStore, called offBase,
which is slightly different (+1), making this format effectively transient.

This PR tries to break this invisible dependency
by using instead a centralized set of macros,
STORE_OFFSET() and STORE_REPCODE(),
to convert from raw repcode and offset values into the sumtype required by ZSTD_storeSeq().

Because the numerical format is also used for other purposes in other functions,
this set is complemented by accessor macros,
such as STORED_IS_OFFSET(),
so that future evolutions of the format can happen while preserving correctness of these other functions.

The end result of this PR is functionally equivalent to current version in dev branch,
with the intermediate transient format still present exactly as it were.
However, with all interactions with this transient numerical format now under control,
it makes it possible possible to later update this format,
for example by removing the intermediate transient representation to target directly the offBase format.
This would make the life of this field similar to existing matchLength => mlBase => mlCode + mlBits.

Such a second step would be the topic of a later PR, as this one is already complex enough as it is.

to better reflect the value stored in this field.
this meant to abstract the sumtype representation required
to transfert `offcode` to `ZSTD_storeSeq()`.

Unfortunately, the sumtype numeric representation is currently a leaky abstraction
that has permeated many other parts of the code,
especially within `zstd_lazy.c` and also within `zstd_opt.c` and `zstd_compress.c`.

While this PR makes a good job a transfering a large nb of call sites
to using the new macros, there are still a few sites where this transformation is more complex,
or where the numeric representation itself it used "as is".

One of the problematics area is the decision to use the numeric format of the sumtype
within the match finders of `zstd_lazy`.

This commit doesn't change the behavior, it only introduces and employes the macros,
but eventually the resulting code remains identical.

At target, if the numeric representation of the sumtype can be completely abstracted
and no other part of the code depends on it,
it will be possible to move it towards something slightly more efficient.
to act on values stored / expressed in the sumtype numeric representation required by `storedSeq()`.

This makes it possible to abstract away this representation by using the macros to extract these values.

First user : ZSTD_updateRep() .
optLdm->offset might be == 0 in invalid case.
Only use STORE_OFFSET() after validating it's a correct case.
the new contracts seems to make more sense :
updateRep() updates an array of repeat offsets _in place_,
while newRep() generates a new structure with the updated repeat-offset array.

Most callers are actually expecting the in-place variant,
and a limited sub-section, in `zstd_opt.c` mainly, prefer `newRep()`.
Copy link
Contributor

@felixhandte felixhandte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense to me. I guess things will improve further in the next PR. I don't love the naming of the macros.

#define STORE_REPCODE_2 STORE_REPCODE(2)
#define STORE_REPCODE_3 STORE_REPCODE(3)
#define STORE_REPCODE(r) (assert((r)>=1), assert((r)<=3), (r)-1)
#define STORE_OFFSET(o) (assert((o)>0), o + ZSTD_REP_MOVE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this pattern means evaluating the argument must not have any side effects.

Copy link
Contributor Author

@Cyan4973 Cyan4973 Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a known side effect of macros.
Indeed, these macros shall only be used with scalar variables, not invoking function calls.
And it happens this condition is well respected, throughout the whole code base.

For more defensive properties, a possible alternative would be to swap these macros with inline functions,
although this second solution introduces its own set of potential casting issues.

#define STORE_OFFSET(o) (assert((o)>0), o + ZSTD_REP_MOVE)
#define STORED_IS_OFFSET(o) ((o) > ZSTD_REP_MOVE)
#define STORED_IS_REPCODE(o) ((o) <= ZSTD_REP_MOVE)
#define STORED_OFFSET(o) (assert(STORED_IS_OFFSET(o)), (o)-ZSTD_REP_MOVE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORE_... and STORED_... are easily confused. Better antonyms of STORE_... might be LOAD_... or RETRIEVE_... or DECODE_....

Copy link
Contributor Author

@Cyan4973 Cyan4973 Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.
Initially, only the STORE_*() macros were supposed to exist.
The other ones (STORED_*()) were added later, when realizing that the numeric format of the sum type was relied upon in various parts of the code base. So this was a quick way to bring these usages under control. And then, the nb of STORED_*() macros kept increasing...

It's more a temporary situation.
In the following PR, macro names will be updated significantly.

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Dec 29, 2021

Overall this makes sense to me. I guess things will improve further in the next PR. I don't love the naming of the macros.

The macro names will indeed be changed in next PR,
they will likely be clearer, it will also make more sense to fine-tune naming in this following PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants