AWS / network store InputStream issues #15

electricsam · 2021-01-23T06:12:42Z

electricsam
Jan 23, 2021

This discussion was started with @SabineEmbacher via a pull request. Moving this discussion here.

Some background:
I ran into a few issues when reading from an S3 store with the blosc compressor. That compressor had two issues. BloscCompressor.uncompress() uses InputStream.available(), which is not guaranteed to return the actual available bytes and in fact an S3 implementation always returned 0. This caused an EOFException. The second issue I ran into after creating a fix for the first in a branch, was that BloscCompressor.uncompress() also calls InputStream.read(byte b[], int off, int len) without checking the actual number of bytes read. InputStream.read() does not guarantee that the number of requested bytes will actually be read. This was also the case with S3, which wrapped an Apache HttpClient implementation of InputStream, which would use a network chunked response and could return partial data if the next chunk was not read. This would cause the blosc header byte array to be populated with payload data in a loop. This invalid data was then fed to cbufferSizes(), which returned zero values and caused the output to be a zero filled response.

Both of these issues were resolved in a pull request: #13.

@SabineEmbacher made a good point that my fix was specific to BloscCompressor.uncompress() and would not address similar issues elsewere in the code base. @SabineEmbacher created the https://github.com/bcdev/jzarr/tree/S3_AWS branch with another possible solution. See #13 (comment)

@SabineEmbacher, I have tested your solution in the S3_AWS branch and it appears to work. It seems a little cumbersome to require an InputStreamCreatorStrategy every time an S3 Zarr store is opened. This seems like a common use case for Zarr.

You can take or leave my pull request. There are unit tests that demonstrate the behavior without using an actual S3 InputStream. Maybe you can use this to test your solution.

SabineEmbacher · 2021-01-26T10:41:21Z

SabineEmbacher
Jan 26, 2021
Maintainer

Many thanks to Chris Slater.
Thanks to his help I have learned a lot about AWS S3.
Thanks also for the AWS mock implementation for testing purposes.

In the beginning I just wanted to find a general solution for reading from remote systems. If problems should arise in such a case, I thought, these should be cleaned up as far as possible at the beginning of the chain. In any case, an AWS-specific solution should not be built in at the end of the chain, i.e. in the compressor. My thought was: "What about the next remote system XXX? Should the next XXX specific solution then be installed in the compressor?

Unfortunately, I didn't have the time to really deal with this at the beginning. Hence the approach with an InputStreamCreatorStrategy in the FileSystemStore.

But now I have taken the time to really deal with it.
I have taken over the pull request.
I have learned a lot and have tried to keep the code as simple as possible.

Thanks for making me think about it again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS / network store InputStream issues #15

{{title}}

Replies: 1 comment

{{title}}

Select a reply

AWS / network store InputStream issues #15

electricsam Jan 23, 2021

Replies: 1 comment

SabineEmbacher Jan 26, 2021 Maintainer

electricsam
Jan 23, 2021

SabineEmbacher
Jan 26, 2021
Maintainer