-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstandard decompressor #14394
Zstandard decompressor #14394
Conversation
The xxhash implementation could be split out into a separate PR if there is a desire to get that merged earlier or just separate the PRs/discussion. |
Hi, thanks for working on this. Here is some early feedback:
|
I've exposed xxhash in
Are you saying that the plan should be to have users wrap functions via something like fn wrappedDecompress(dest: []u8, src: []const u8, verify: bool) !ReadWriteCount {
@setRuntimeSafety(true);
return decompressFrame(dest, src, verify);
} if they want want to compile in release-fast, but not be susceptible to problems caused by bad input? The compressed input contains instructions for copying sections of previously decompressed data and bad/malicious input could cause math under/overflow or buffer overruns. I was thinking that there should be additional checks and return errors to catch this kind of malformed input - the current implementation relies on safe builds to crash on these malformed inputs. I have added functionality to support streaming decompression (i.e. decoding blocks one at a time into a ring buffer) and a simple (allocating) wrapper utilising it that should support decompressing any Zstandard frame. The main building blocks are exposed, so users can design their own decompression functions. |
Failing to handle malformed/malicious input safely is a Bug. We absolutely need code to prevent overflow and memory errors caused by maliciously crafted input. One should never rely on Zig's safety checks triggering for the correctness of software. They exist to catch programming errors (such as failing to handle the full range of possible input while parsing). Perfect software should never trigger any safety checks no matter what input it is given. |
Okay, I think this is ready sans any changes people would like to see to the API first - the error sets might want some cleanup/be made explicit in function signatures. Not sure why the CI is failing, but I don't think it should be related to this PR. |
Your changes have caused autodocs to crash. While this is a flaw in the autodoc system and not your code, it does need to be fixed before this can be merged. Perhaps @kristoff-it or @der-teufel-programming would be willing to help by taking a look and offering a bug fix or workaround to unblock these changes. |
Hmm, odd - I guess there must have been a recent change to autodoc, as I think the CI passed a few days ago. |
I will take a look at this PR and see if I can find and fix the problem, either here or in Autodoc Edit: |
#14456 was merged, rebase on latest master and everything should work smoothly. Thank you for your patience, Autodoc is still a work in progress and occasionally we find that we don't support new parts of the language yet. |
0e61592
to
1501f5c
Compare
I had a look in I could change the ZStandard API (or add a new one) to work with readers directly, but it won't be possible to avoid internal allocations/copying. |
I'd like to run some fuzz testing on this (a la #14500) but am a bit unsure how to implement the fuzzer. What should I be calling to do something like 'try to decompress some arbitrary bytes.' Would it be something like: // return on error or null, we're just looking for illegal behavior
const decompressed_size = (getFrameDecompressedSize(bytes) catch return) orelse return;
// skippable frame, just return
if (decompressed_size == 0) return;
var buf = try allocator.alloc(u8, decompressed_size);
const verify_checksum = false; // fuzzer would almost never generate a valid checksum I assume
_ = decodeZStandardFrame(buf, bytes, verify_checksum) catch return;
// maybe verify something about the return?
// should ReadWriteCount.write_count always match `decompressed_size`?
// should ReadWriteCount.read_count always match `bytes.len`? ? I see there's also |
Some fuzzing would be great! I was thinking about looking into it, but don't have experience fuzzing things. I guess there are a few approaches you could take - If it's hard to generate reasonable inputs for these major entry points, you could try fuzzing the inner parts of the API - I meant to make the API in a way that the core functionality can be used to implement more bespoke decoding systems. I would actually think it might be good to tell it to verify checksums (you can always catch |
Sounds good, thanks for the insight--I'll probably write a few different fuzzers. So far I've written an xxhash-specific fuzzer that compares the 32-bit and 64-bit hashes to the C implementation's hashes, and it's running cleanly so far. |
Wrote a basic fuzzer implementation. First, some general comments:
That said, I've currently mitigated the above by only running inputs that start with the magic number. Here are some preliminary results (minimized, but not fully de-duplicated, so it's very possible that many of these might overlap in which crashes they trigger): zstandard-fuzzing-crashes-20230131.zip The crashes can be reproduced with the following test code: test "fuzzed input" {
const input = "(\xb5/\xfd"; // This is the contents of 'id:000000,sig:06,src:000001,time:12,execs:122,op:havoc,rep:2'
const max_window_size = 256 * 1024 * 1024; // TODO: What is a reasonable value?
const decompressed = try std.compress.zstandard.decompress.decodeZStandardFrameAlloc(std.testing.allocator, input, false, max_window_size);
defer std.testing.allocator.free(decompressed);
} (note: Most of them seem to be caused by index-out-of-bounds and things like that, which as mentioned by @ifreund in #14394 (comment) should be handled gracefully without ever invoking illegal behavior for any inputs. |
f0cdd61
to
ea82ec2
Compare
Some API improvements have landed. There are now functions for handling input from readers, along with the high level @squeek502 Thanks for the feedback and crash cases, those crashes are now fixed.
Doc comments have been updated to indicate that the first four bytes of the input must be the Zstandard frame magic number.
There is now |
Nice! Here's a new set of crashing inputs: zstandard-fuzzing-crashes-20230202.zip With this set, each input should trigger a unique crash. |
Fixed - this fuzzing business feels like cheating for finding bugs. |
New set (the number in each set is getting smaller 👍): zstandard-fuzzing-crashes-20230202.1.zip
For things like this that it's well suited for, it pretty much is. At some point I might try setting up a fuzzer that compares the result of decompression to the zstandard reference implementation, and then the real cheat codes for finding bugs will be unlocked (since then we can find correctness bugs, too). |
Fixed those crashes.
The reference implementation's repo has a tool |
Done |
Will do some last minute fuzzing just to confirm that we're good to go on that front, but here's a quick summary of what's happened with regards to the zstandard upstream:
|
I was just thinking it would be nice to have a summary like this. |
Wow, great work, you two. I didn't realize you had discovered UB upstream. |
c6a89d1
to
cbb8066
Compare
pub fn DecompressStream( | ||
comptime ReaderType: type, | ||
comptime options: DecompressStreamOptions, | ||
) type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gzip
, lzma
, and xz
implementations all expose the same std.compress.foo.Decompress
signature:
pub fn Decompress(comptime ReaderType: type) type {
This is leveraged in commit d94613c for example to make some code generic over the decompression algorithm.
Is it strictly necessary to have the options comptime-known for the zstd implementation? If not I think it would be better to conform to the existing interface and instead pass the options at runtime to the init() function. This would also reduce generic code bloat in some cases.
Furthermore, std.compress.zstd
feels more consistent with the current members of the std.compress
namespace to me:
pub const deflate = @import("compress/deflate.zig");
pub const gzip = @import("compress/gzip.zig");
pub const lzma = @import("compress/lzma.zig");
pub const lzma2 = @import("compress/lzma2.zig");
pub const xz = @import("compress/xz.zig");
pub const zlib = @import("compress/zlib.zig");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that std.compress.zstd
is probably a better name to go with. As for the std.compress.foo.Decompress
signature, the package manager is only using the std.compress.foo.decompress()
function which isn't affected by the comptime parameters, but will probably require at least the window_size_max
comptime parameter to DecompressStream
to match this API.
It is not strictly necessary to have the options be comptime - the verify_checksum
option being comptime just allows avoiding, but the main thing is that not having them be comptime will either require us to hard code the values (for max_window_size
this is not a good solution as someone may have Zstandard frames they cannot decode because of this), or some other API will have to diverge from the others in std.compress
, most likely the init()
function as you mentioned, but this is a more intrusive divergence from the other decompression algorithms which I don't think will really allow for code that is generic across the different decompression algorithms. In normal usage you can just use the default comptime parameters to get an API that matches that of the other decompression algorithms.
I have a branch that adds Zstandard support to the package manager generically just like gzip and xz - the first commit on the branch renames all the Decompress
/decompress
functions to DecompressStream
/decompressStream
which I think is a better name. As I mentioned earlier if we really want to match the APIs we may need a separate issue to decide on one that works well for all of them (e.g. the Zstandard init()
function has no need to return an error union, which the others currently do, though I don't know how necessary this is) and I don't think Decompress
/decompress
is a great name, especially when you have non-streaming decompression functions available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is how the lzma decompressor currently handles options at runtime:
pub const decode = @import("lzma/decode.zig");
pub fn decompress(
allocator: Allocator,
reader: anytype,
) !Decompress(@TypeOf(reader)) {
return decompressWithOptions(allocator, reader, .{});
}
pub fn decompressWithOptions(
allocator: Allocator,
reader: anytype,
options: decode.Options,
) !Decompress(@TypeOf(reader)) {
const params = try decode.Params.readHeader(reader, options);
return Decompress(@TypeOf(reader)).init(allocator, reader, params, options.memlimit);
}
pub fn Decompress(comptime ReaderType: type) type {
...
It seems to me that the zstd implementation could use the same pattern here if I understand correctly that the options aren't strictly required to be comptime-known.
const Allocator = @import("std").mem.Allocator; | ||
const assert = @import("std").debug.assert; | ||
|
||
const RingBuffer = @This(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API has quite a bit of overlap with std.fifo.LinearFifo()
, I'm not sure we should have both in the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite a bit of overlap (I wasn't aware of LinearFifo
), however if I'm reading correctly LinearFifo
as it is currently isn't designed to be used the way the ring buffer is used, so I'm not sure it would be a good option to adapt/extend LinearFifo
and make use of it. If it would be confusing to have both, I can take it back out of the std
namespace, or we can leave it pending the pre 1.0 stdlib review. Thoughts @andrewrk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this discussion, it looks like the lzma implementation has its own LzCircularBuffer
type as well which is also a somewhat specialized ring buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe LzCircularBuffer
should be replaced by this RingBuffer
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #19231 to track this
To follow up on this, nothing new was found. All the findings were things that we've seen before and have reasons for the behavior being different from the reference implementation. |
Follow-up issue: |
This is an implementation of a Zstandard decompressor, intended to close #14183.
There are some things that should be done before this is mergeable:
If any of the above pointers shouldn't block merging an initial implementation and can be moved to follow up issues, or if there are more issues I haven't put above let me know (e.g. the api might need to be cleaned up).
The current implementation of
std.compress.zstandard.decode()
, which is achieved by decompressing literals on the fly directly into the output buffer, but requires frames to declare their decompressed content size. It might be the case that decompressing literals ahead of time is faster (this could use 4 threads in most cases), though this would require a separate buffer, which could be up to 128KB.Notes on items in the checklist
The safety checks for math operations/casts/slice access are probably required for release-fast and release-small mode, as these disable checks that should make decompression safe (by crashing) in safe build modes; without manual checks orEverything should have appropriate checks now (all math overflow and slice access issues fuzzing found so far have been fixed).@setRuntimeSafety(true)
, malicious/malformed data might cause undefined behaviourI'm a little unsure at the moment on the best way to implement streaming decompression. My initial plan for this is to decompress sequences into a ring buffer that has capacity equal to the frame window size, while clobbering any existing data; this would require the user to read at leastblock_maximum_size
bytes before continuing decompression (to be sure unread data won't be clobbered), which is the whole window unless the window is bigger than 128KB. If anyone has ideas I'm all ears.