-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[zstd_stream] Don't block in reader.Read if a zstd block is available #96
Conversation
It could be interesting to run some benchmarks to see if that impacts performance, and if yes, to take a decision on whether that should be the default or flag-dependent. |
cdb27d7
to
67c6b8e
Compare
Benchmarks:
After:
No significant change for compression (as expected), ~0.7% slowdown for decompression. IMO no need to add a flag for that, the performance difference is OK. |
67c6b8e
to
25758a4
Compare
Ready for review @Viq111 |
Gentle ping 😃 |
Hi! Sorry I dropped the ball on this one so that's totally on me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It definitely makes a lot of sense
Rereading the stream decompression path, I think we can handle / document a bit better all the EOF
cases
return 0, fmt.Errorf("failed to read from underlying reader: %s", err) | ||
} | ||
if n == 0 { | ||
return 0, io.EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think you are correct that if r.decompSize == len(r.decompressionBuffer)
then zstd should output a decompressed block without additional input (except if its estimation was not correct but in that case, next loop it will ask more data).
But I don't think it means the opposite is correct though: i.e: r.decompSize < len(r.decompressionBuffer)
does not mean without any additional input, not output will be produced
. From the documentation (https://github.com/DataDog/zstd/blob/1.x/zstd.h#L777), a previous output could actually produce many blocks even if no additional input is given (and retCode
is only a hint) so I think it would be safer to defer the return of io.EOF
after the actual call to C zstd if err == EOF and compressionLeft == 0
(no compression left)
You might have more background than me on non-EOFed pipes but my understanding is that we should get (and return) only EOF when the pipe will not provide data anymore at any point in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm initially thinking of changing the condition to:
remainingOut := r.decompSize - r.decompOff
if err == io.EOF && r.compressionLeft == 0 && remainingOut == 0 {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that if r.decompSize < len(r.decompressionBuffer)
, ie if output.pos < output.size
, the decoder flushed all it could (see https://github.com/DataDog/zstd/blob/1.x/zstd.h#L774); in other words the decoder needs more data in order to output anything else.
The line you linked seems to refer to a special condition if len(r.decompressionBuffer) > ZSTD_BLOCKSIZE_MAX
. However this will never be the case, because the decompression buffer is allocated by a pool to a size of ZSTD_DStreamOutSize
, which is == ZSTD_BLOCKSIZE_MAX
. (Which is the purpose of using ZSTD_DStreamOutSize()
.)
So the condition looks correct to me as is. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! I didn't read on ZSTD_DStreamOutSize == ZSTD_BLOCKSIZE_MAX
but yes you are correct
err = errShortRead | ||
if r.decompOff > 0 { | ||
return r.decompOff, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking also about a input stream that got accidentaly cut off (so it starts returning EOF
but we still have some zstd partial data), we could return a io.UnexpectedEOF
https://golang.org/pkg/io/#pkg-variables
Functions should return EOF only to signal a graceful end of input. If the EOF occurs unexpectedly in a structured data stream, the appropriate error is either ErrUnexpectedEOF or some other error giving more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some code & comment about that. https://github.com/DataDog/zstd/pull/96/files#diff-ff3e40bda515a4af0936733c5415920eb09cc903913a7243bdec1074266784b0R470 Thoughts?
For benchmarking, I think those changes are immaterial so I would not worry about them:
|
reader.Read used to try to fully read an internal buffer until EOF or the buffer was filled. That was buffer was set to ZSTD_DStreamInSize, which is larger than any zstd block. This means that reader.Read could try to buffer much more data than what was needed to process and return a single block from the Read method. This was an issue because we could miss an urgent Flush from a corresponding Writer by blocking. (A typical use case is instant messaging.) It was also against the general convention of io.Read that a single call should return any immediately available data without blocking, if any. Interestingly enough, the test case should have caught this up, but because we used a bytes.Buffer, the Read returned io.EOF after reading the entirety of the buffer, even if we appended to the buffer later on. The test case is also fixed by this commit. Fixes: DataDog#95
Gentle ping 😀 |
reader.Read used to try to fully read an internal buffer until EOF or
the buffer was filled. That was buffer was set to ZSTD_DStreamInSize,
which is larger than any zstd block.
This means that reader.Read could try to buffer much more data than
what was needed to process and return a single block from the Read
method.
This was an issue because we could miss an urgent Flush from a
corresponding Writer by blocking. (A typical use case is instant
messaging.) It was also against the general convention of io.Read that a
single call should return any immediately available data without
blocking, if any.
Interestingly enough, the test case should have caught this up, but
because we used a bytes.Buffer, the Read returned io.EOF after reading
the entirety of the buffer, even if we appended to the buffer later on.
The test case is also fixed by this commit.
Fixes: #95