Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added documentation for LDM + dictionary compatibility #3553

Merged
merged 1 commit into from
Mar 16, 2023
Merged

Conversation

Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Mar 15, 2023

As mentioned by @reyqn in #2835 (comment) , the Long Distance Mode works best with dictionaries loaded with ZSTD_CCtx_refPrefix().
LDM is effectively incompatible with ZSTD_CDict, and by extension with ZSTD_CCtx_loadDictionary(), so results are disappointing when trying to combine them.

Added documentation to nudge users towards ZSTD_CCtx_refPrefix() when they want to use a dictionary as large "reference image" which requires LDM for proper indexing.

@Cyan4973 Cyan4973 merged commit e220824 into dev Mar 16, 2023
@ghost
Copy link

ghost commented Mar 18, 2023

May I ask, why prefix is only used once?
If there is an API that allows cctx to use a prefix infinitely, what negative effects does it have?

From ZSTD_CCtx_refPrefix() doc:

  • Reference a prefix (single-usage dictionary) for next compressed frame.
  • A prefix is only used once. Tables are discarded at end of frame (ZSTD_e_end).

@Cyan4973
Copy link
Contributor Author

Cyan4973 commented Mar 18, 2023

A prefix must be loaded into the match finder tables.
The match finder tables are then mutated during the rest of the compression process,
so the initial "state" of the match finder tables, after loading the prefix, is effectively lost.

Therefore, using a prefix a second time requires loading its content into the match finder tables again. This is a non-trivial cost.

This situation is in contrast with CDict, which are created once, and are then immutable,
allowing their usage by any number of CCtx afterwards,
without any initialization cost.

Employing a prefix rather than a full-feature CDict makes sense when it's only going to be used once.

@ghost
Copy link

ghost commented Mar 24, 2023

Thanks for your explaination.

It seems this sentence is a bit misleading, may be "regarded as a prefix" rather than "called a prefix".

zstd/lib/zstd.h

Line 904 in 3e0550e

* A dictionary can be any arbitrary data segment (also called a prefix),

I have another question, if a prefix is loaded as ZSTD_dct_fullDict, how is it different from a dictionary?

zstd/lib/zstd.h

Lines 1895 to 1898 in 3e0550e

/*! ZSTD_CCtx_refPrefix_advanced() :
* Same as ZSTD_CCtx_refPrefix(), but gives finer control over
* how to interpret prefix content (automatic ? force raw mode (default) ? full mode only ?) */
ZSTDLIB_STATIC_API size_t ZSTD_CCtx_refPrefix_advanced(ZSTD_CCtx* cctx, const void* prefix, size_t prefixSize, ZSTD_dictContentType_e dictContentType);

@Cyan4973
Copy link
Contributor Author

I have another question, if a prefix is loaded as ZSTD_dct_fullDict, how is it different from a dictionary?

It will still be loaded directly into the match search tables, and therefore the initial state will be lost during the compression process. So it's only suitable when used once.

The point of ZSTD_dct_fullDict is to explicitly tell the compressor to expect only a well-formed trained dictionary, with a conformant header. If not, it will error out.

The difference with a default auto mode is that, in auto mode, if the dictionary doesn't contain a well-formed header, it default to raw mode, where the entire input is considered as "raw content", and no statistics are present (and no dictionary ID is present). This could be a missed opportunity to detect an error scenario earlier.

@Cyan4973 Cyan4973 deleted the ldm_dict branch March 29, 2023 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants