Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: better explanations on compressor behaviour, compression levels and parameters are welcome #3698

Closed
zougloub opened this issue Jul 14, 2023 · 4 comments
Assignees

Comments

@zougloub
Copy link

When my kids are going to need to compress stuff, I will tell them to use zstd of course, and I will probably tell them to RTFM.
Now I wanted to double-check the zstd man page so as to be sure that the documentation will be straightforward, and I have some comments. Below I will sometimes use the word "should" but keep in mind it's all suggestions.

In the zstd executable man page, the introductory DESCRIPTION is missing some high-level information:

  • fast modes is mentioned, strong modes excellent compression ratios are mentioned, but maybe something should be said also about zstd's greatness on the metric of end-to-end latency. We all know it's zstd's strongest selling point, but for some reason it's written nowhere.
  • Decompression speed is preserved and remain roughly the same at all settings, is something that is obvious with background experience but may not be to newcomers ; this is written on the website, but should be written prominently after decompression speed is mentioned.
  • Now for all the kids reading this in a few years down the line, in order to be timeless, maybe the documentation could include a script that patches the man page to include current performance numbers, or maybe measure the compression performance in instructions per uncompressed byte (I mean, with default options I get ~ 35 on ARMv8 or x86_64 and ~100 on ARMv6)...

Later the man page first mentions compression levels:

   Operation Modifiers
       ○   -#: selects # compression level [1-19] (default: 3)

One may find this a bit terse. I mean, there's not even a terminating period on this line.
And this bullet point in the manual, I think, should be augmented with a concise sentence or paragraph, mentioning:

  • that the compression level knob is a quick way to adjust the trade-off between compression effort (CPU time, memory) versus the compression ratio, and that, generally but not always, when the level is increased, the effort is increased and the data will be compressed tighter (but probably with diminishing returns as the level increases).
  • that the compression levels have been fixed in order to obtain an exponential progression in effort (is it?)
  • that the compression levels each correspond to a set of important advanced compression parameters, referring to the ADVANCED COMPRESSION OPTIONS § for more information on compression behavior.

Then, later, the manual has ADVANCED COMPRESSION OPTIONS, which currently says:

ADVANCED COMPRESSION OPTIONS
       ###  -B#:  Specify  the  size  of  each  compression  job. This parameter is only available when
       multi-threading is enabled. Each compression job is run in parallel, so  this  value  indirectly
       impacts the nb of active threads. Default job size varies depending on compression level (gener‐
       ally 4 * windowSize). -B# makes it possible to manually select a custom size. Note that job size
       must  respect a minimum value which is enforced transparently. This minimum is either 512 KB, or
       overlapSize, whichever is largest. Different job sizes will  lead  to  non-identical  compressed
       frames.

There must be a typo here, and I think that:

  • zstd provides 22 predefined regular compression levels plus the fast levels. A compression level is translated internally into a number of specific parameters that actually control the behavior of the compressor. (You can see the result of this translation with --show-de‐fault-cparams.) These specific parameters can be overridden with advanced compression options. should be an introductory sentence after the section title.
  • --zstd=options item should be the first entry in the section, it's arguably more important than -B
  • I also think it would be nice to say that the some "weaker" strategies may give better results than stronger ones, for example here's an example where we compress "\n".join(str(i) for i in range(1<<20)).encode() (7277497 bytes uncompressed):
level compressed size compression time
0 329820 0.073
1 2880396 0.032
2 727942 0.058
3 329820 0.074
4 273090 0.076
5 339937 0.063
6 307160 0.147
7 317692 0.180
8 523833 0.248
9 476452 0.242
10 444441 0.373
11 460774 0.541
12 496534 0.608
13 463175 0.336
14 463175 0.378
15 513365 0.414
16 507389 1.133
17 428186 1.921
18 339902 2.762
19 442419 3.857
20 442419 4.141
21 442419 4.026
22 442409 5.560

We can see that the compression time may decrease, and/or the compression ration decreases as the compression level is raised.

The DICTIONARY BUILDER and BENCHMARK sections should be moved after the compression options.

The BENCHMARK section should feature an introductory statement, such as the zstd CLI provides a benchmarking mode that can be used to easily find suitable compression parameters, or alternatively to benchmark a computer's performance. Maybe something also statiing that benchmarking for finding compression options should be performed on a representative data set could be useful.

SEE ALSO should point to the zstd manual, which should be installed with zstd, and maybe to the website, since later some other websites are mentioned.

@Chaython
Copy link

Chaython commented Sep 7, 2023

In your test it seems level 6 is most compressed
I wonder how often that is reproducible? How many passes did you run? Singular?
I wonder if it varies more based on the content of the archive.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 4, 2024

For reference,
is a the current performance on the suggested sample :

 1#issue3698.txt     :   7277497 ->   2880400 (x2.527),  
 2#issue3698.txt     :   7277497 ->    718164 (x10.13),  
 3#issue3698.txt     :   7277497 ->    329820 (x22.07),  
 4#issue3698.txt     :   7277497 ->    273090 (x26.65),  
 5#issue3698.txt     :   7277497 ->    339937 (x21.41),  
 6#issue3698.txt     :   7277497 ->    307160 (x23.69),  
 7#issue3698.txt     :   7277497 ->    317692 (x22.91),   
 8#issue3698.txt     :   7277497 ->    523833 (x13.89),   
 9#issue3698.txt     :   7277497 ->    476452 (x15.27),   
10#issue3698.txt     :   7277497 ->    444441 (x16.37),   
11#issue3698.txt     :   7277497 ->    460774 (x15.79),   
12#issue3698.txt     :   7277497 ->    496534 (x14.66),  
13#issue3698.txt     :   7277497 ->    463175 (x15.71),   
14#issue3698.txt     :   7277497 ->    463175 (x15.71),   
15#issue3698.txt     :   7277497 ->    513365 (x14.18),   
16#issue3698.txt     :   7277497 ->    507389 (x14.34),   
17#issue3698.txt     :   7277497 ->    450909 (x16.14),   
18#issue3698.txt     :   7277497 ->    212426 (x34.26),  
19#issue3698.txt     :   7277497 ->    219162 (x33.21),   
20#issue3698.txt     :   7277497 ->    219162 (x33.21),   
21#issue3698.txt     :   7277497 ->    219162 (x33.21),   
22#issue3698.txt     :   7277497 ->    210478 (x34.58),   

Compression performance is still all over the place across most of the range,
with notably fast level 4 offering incredibly good performance,
but at least higher compression levels (18+) now perform best, instead of worse.

@Cyan4973 Cyan4973 self-assigned this Mar 12, 2024
Cyan4973 added a commit that referenced this issue Mar 12, 2024
following recommendations by @zougloub at #3698
@Cyan4973
Copy link
Contributor

These are great recommendations @zougloub !

They have been employed to refactor the documentation at #3958 .

@Cyan4973
Copy link
Contributor

documentation updated

hswong3i pushed a commit to alvistack/facebook-zstd that referenced this issue Mar 27, 2024
following recommendations by @zougloub at facebook#3698
hswong3i pushed a commit to alvistack/facebook-zstd that referenced this issue Jan 5, 2025
following recommendations by @zougloub at facebook#3698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants