Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Decompressor v3 #58

Closed

Conversation

cccs-sadugas
Copy link
Contributor

Pull Request Overview

Changes since last PR:

  • Added feature gate for "stream".
  • Removed code duplication caused by _check() functions, using an update flag instead.
  • Merged stream test module with lzma to remove duplication.
  • Changed tmp, tmp_len buffers to use a Cursor.
  • Addressed other code review comments.

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.
  • Added relevant end-to-end tests (such as .lzma, .lzma2, .xz files).

Supporting Documentation and References

Link to previous PR: #55

@cccs-sadugas cccs-sadugas force-pushed the streaming-decompressor-v3 branch from 333433f to c4b36d6 Compare August 10, 2020 20:33
@cccs-sadugas cccs-sadugas mentioned this pull request Aug 10, 2020
4 tasks
@cccs-sadugas cccs-sadugas force-pushed the streaming-decompressor-v3 branch 2 times, most recently from c2a4d53 to 693d572 Compare August 11, 2020 14:06
Create a new struct `Stream` that uses the `std::io::Write` interface
to read chunks of compressed data and write them to an output sink.

Add a streaming mode so processing can work with streaming chunks of
data. This is required because process() assumed the input reader
contained a complete stream.

Update flags and try_process_next() were added to handle when the
decompressor requests more input bytes than are available. Data is
temporarily buffered in the DecoderState if more input bytes are
required to make progress.

This commit also adds utility functions to the rangecoder for working
with streaming data.

Adds an allow_incomplete option to disable end of stream checks when
calling `finish()` on a stream. This is because some users may want
to retrieve partially decompressed data.
@cccs-sadugas cccs-sadugas force-pushed the streaming-decompressor-v3 branch from 693d572 to 00b4e3b Compare August 11, 2020 14:43
@cccs-sadugas
Copy link
Contributor Author

Benchmarks are:

master

test compress_65536                  ... bench:   3,136,090 ns/iter (+/- 243,071)
test compress_empty                  ... bench:       1,473 ns/iter (+/- 170)
test compress_hello                  ... bench:       2,149 ns/iter (+/- 215)
test decompress_after_compress_65536 ... bench:   3,846,980 ns/iter (+/- 554,928)
test decompress_after_compress_empty ... bench:       3,412 ns/iter (+/- 396)
test decompress_after_compress_hello ... bench:       4,375 ns/iter (+/- 514)
test decompress_big_file             ... bench:   7,479,885 ns/iter (+/- 855,966)
test decompress_huge_dict            ... bench:       4,333 ns/iter (+/- 685)

streaming-decompressor-v3

test compress_65536                  ... bench:   3,196,185 ns/iter (+/- 307,679)
test compress_empty                  ... bench:       1,418 ns/iter (+/- 161)
test compress_hello                  ... bench:       2,113 ns/iter (+/- 416)
test decompress_after_compress_65536 ... bench:   3,796,298 ns/iter (+/- 392,350)
test decompress_after_compress_empty ... bench:       3,325 ns/iter (+/- 370)
test decompress_after_compress_hello ... bench:       4,329 ns/iter (+/- 663)
test decompress_big_file             ... bench:   7,563,810 ns/iter (+/- 792,639)
test decompress_huge_dict            ... bench:       4,271 ns/iter (+/- 456)
test decompress_stream_big_file      ... bench:   7,194,940 ns/iter (+/- 647,558)

@cccs-sadugas
Copy link
Contributor Author

@gendx do you have an update on when you will be able to review this PR?

Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Sorry again for the delay in reviewing, a lot of things came up which meant I didn't have a lot of time in the past couple months, and this change is non-trivial to review.

I think we're getting there, most comments are for minor changes.

if let Err(error) = stream.write_all(&compressed) {
// WriteZero could indicate that the unpacked_size was reached before the
// end of the stream
if std::io::ErrorKind::WriteZero != error.kind() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: error.kind() != std::io::ErrorKind::WriteZero is more readable.

{
let mut stream = lzma_rs::decompress::Stream::new_with_options(decode_options, Vec::new());

if let Err(error) = stream.write_all(&compressed) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an error check here but not in round_trip_no_options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. A simple .unwrap() will do. This was because a test case would previously fail with WriteZero for some Options when unpacked_size was reached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. After running the tests with --all-features it shows that unpacked_size_write_none_to_header_and_use_provided_on_read can return a WriteZero io error:

thread 'unpacked_size_write_none_to_header_and_use_provided_on_read' panicked
  at 'called `Result::unwrap()` on an `Err` value:
  Custom { kind: WriteZero, error: "failed to write whole buffer" }', tests/lzma.rs:88:39

This is expected behavior. We are encoding 0xFFFF_FFFF_FFFF in the header but using the provided unpacked size during decoding. Therefore decoding will stop before the 5 to 6 byte end-of-stream marker is read.

The check is here but not in round_trip_no_options because it only applies when certain options are provided like in unpacked_size_write_none_to_header_and_use_provided_on_read. The round_trip_no_options function only uses default options.

I added more documentation and a match statement to make this more clear.

Comment on lines +115 to +120
let mut input = Vec::new();
std::fs::File::open(compfile)
.unwrap()
.read_to_end(&mut input)
.unwrap();
input
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this pattern appears twice already in this function, it would make sense to create a utility function for it.

stream.write_all(&tmp[0..n]).unwrap();

n > 0
} {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is not readable. Can you use if together with break for the exit condition?

assert_eq!(decomp, b"")
assert_decomp_eq(
b"\x5d\x00\x00\x80\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x83\xff\
\xfb\xff\xff\xc0\x00\x00\x00",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation should better align here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust-analyzer vscode plugin displays inlay hints and this was offsetting the first line in my IDE. Thankfully this can be disabled :) .

let tmp = *self.tmp.get_ref();

// Check if we need more data to advance the decompressor
if Mode::Run == mode
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about if variable == Constant being more readable than the reverse. There are a few more in the code.

break;
};
} else {
if (Mode::Run == mode) && (rangecoder.remaining()? < MAX_REQUIRED_INPUT) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about if variable == Constant being more readable than the reverse. There are a few more in the code.

break;
};
} else {
if (Mode::Run == mode) && (rangecoder.remaining()? < MAX_REQUIRED_INPUT) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses are not necessary.

Comment on lines +485 to +500
let range = rangecoder.range();
let code = rangecoder.code();
let buf = rangecoder.buf()?;

if self.try_process_next(buf, range, code).is_err() {
let bytes_read = rangecoder.read_into(&mut self.tmp.get_mut()[..])?;
let bytes_read = if bytes_read < std::u64::MAX as usize {
bytes_read as u64
} else {
return Err(error::Error::LZMAError(
"Failed to convert integer to u64.".to_string(),
));
};
self.tmp.set_position(bytes_read);
return Ok(());
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks similar to the above. Can you separate it into a utility function to avoid code duplication?

@@ -0,0 +1,510 @@
use crate::decode::lzbuffer::{LZBuffer, LZCircularBuffer};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I didn't look in detail at this file (yet). But it is restricted to the stream feature.

@cccs-sadugas cccs-sadugas mentioned this pull request Oct 5, 2020
2 tasks
@cccs-sadugas
Copy link
Contributor Author

Thanks for the review :). I published a new PR #59 with the requested changes.

I had some issues with the suggested changes for the RangeDecoder::{from_parts, set} functions so those are still up for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants