-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serving datasets with bitshuffle compression in POSIX files #76
Comments
As you can imagine, HSDS doesn't support pluggable filters (I imagine security people wouldn't be happy with clients injecting code onto the server for one thing...). So any supported filter needs to be implemented in HSDS. The nice aspect of this is that clients can utilize any supported filter without change on their part. I'd be open to a PR to support bitshuffle in HSDS. The shuffle filter is implemented here: https://github.com/HDFGroup/hsds/blob/master/hsds/util/storUtil.py:L43. You can see it's not much code, I guess it wouldn't be too hard to do something similar for bitshuffle. You'll note the shuffle filter uses numba to speed up the shuffle/unshuffle. If I remember correctly, it's about a 100x faster than the non-jitted code. Ideally it would be nice to have a cython implementation for the filters. That should be even faster. |
Hi, The HDF5 bitshuffle filter is actually bitshuffle+lz4 compression in a single step, and it is seen as a compression filter from a HDF5 point of view. Line 223 in 2d959f3
|
Hi, @loichuder tested with Yet, gaining a factor 4 in docker image is good! |
Ah, sorry, I was thinking regular shuffle, not bitshuffle. We'll look into adding bitshuffle support. |
Hi, Since it is used by dectris detectors, there's some dataset available in zenodo, e.g.: That would be definitely nice to have, but don't add it especially for us, it will be complicated for us to use hsds in production in the near future since we can have HDF5 files with millions of entries, I expect it will be an issue to store the structure in a POSIX file system. Best, |
I don't know exactly what fixed it (perhaps #90) but serving bitshuffle compressed datasets now works 🙂 |
Nevermind, I went too fast: I looked at uncompressed datasets believing that they were compressed 😐 ... Sorry for the useless ping |
Supporting Bitshuffle filter in HSDS will require some effort because both the bitshuffle and hdf5plugin packages are geared towards h5py and HDF5 library. My understanding is Bitshuffle consists of two independent operations: bit shuffling, and then, LZ4 compression. Hopefully there is a way to implement those in some other way, and use the HDF5 filter |
BTW, the Here is a function I use to decompress bitshuffle+LZ4 compressed chunks when using hdf5 direct chunk read. It's using the bitshuffle Python package: import numpy, bitshuffle
def decompress_bslz4_chunk(payload, dtype, chunk_shape):
total_nbytes, block_nbytes = struct.unpack(">QI", payload[:12])
block_size = block_nbytes // dtype.itemsize
arr = numpy.frombuffer(payload, dtype=numpy.uint8, offset=12) # No copy here
chunk_data = bitshuffle.decompress_lz4(arr, chunk_shape, dtype, block_size)
return chunk_data |
Do you know if the bitshuffle package can be installed or built without h5py and HDF5 library? I did not notice that option. |
I don't think so, the HDF5 compression filter should be optional, but not the module |
That's my understanding, too. Thanks! |
If someone would like to create a package that could do bitshuffle without the HDF library dependency, that would be much appreciated! |
Bitshuffle should now be working in HSDS! I've put in updates to use the bitshuffle package and utilize bitshuffle+lz4 compression. (so in the filter pipeline, it's sufficient to just use the bitshuffle filter and no compressor). In h5pyd (version 0.17.0 or later), you can use the --link option so the HSDS domain will read from the bitshuffled chunks in the HDF5 file as needed. E.g.: If anyone has time to try this out with their HDF5 bitshuffled data, it would be most appreciated! Please let me know if you have questions or comments. I'll leave this issue open for a bit in case any problems come up. |
Gave it a try but I get an error when running
I used the import h5py
import numpy as np
import hdf5plugin
with h5py.File("compressed.h5", "w") as h5file:
data = np.random.random((1000, 1000))
h5file.create_dataset("gzip", data=data, compression="gzip")
h5file.create_dataset("szip", data=data, compression="szip")
h5file.create_dataset("scaleoffset", data=data, scaleoffset=4)
h5file.create_dataset(
"gzip_shuffle", data=data, compression="gzip", shuffle=True
)
h5file.create_dataset("bitshuffled", data=data, **hdf5plugin.Bitshuffle()) |
Thanks for trying it! |
But I pulled the latest version from this repo and used |
No, the runall.sh will just run the last build. You need to do a build.sh first. :) |
Ah yes. Sorry, it has been a while 😅 |
All right, I could do the However, the data node seems to crash when wanting to uncompress the data:
|
The nbytes and block_nbytes seem reasonable at least. Strange you had an illegal instruction error. Is HSDS running on a x86 or ARM? |
Running on I guess it is because I tried to specify a slice when trying to get the data (via EDIT: No it is unrelated. sorry. Here is the file I used: https://cloud.esrf.fr/s/btA8C4aB8C9YMLH |
@loichuder - I'm able to download the file you used, but seems like it's not an HDF5 file:
|
Yeah it seems You can do
instead. |
ok - I got that and it seems load ok. Copied the file to s3: Ran hsload on it: Checked out the loaded file:
Do you see the same content when you open the file with h5py? |
I get the same thing as you, up to the part
|
I guess it's progress that you at least got to the dset[:,:]. I wonder if the timeout is unrelated to bitshuffle, and just an effect of trying to fetch a large block of data in one request. Could you change the "data = dset[:,:]" to use a chunk iterator instead? Like this:
|
It takes quite some time but I get another error:
No sure if this is related but |
Could you take a look at the docker logs and see if anything obvious shows up? |
Good call, the service node indeed has an error:
See below for a more complete picture Full logs1702469460.050 INFO> read_chunk_hyperslab, chunk_id: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0, bucket: files 1702469460.050 DEBUG> using chunk_map entry for c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0: {'s3path': 'files/compressed.h5', 's3offset': 4016, 's3size': 27699, 'chunk_sel': [slice(0, 63, 1), slice(0, 63, 1)], 'data_sel': (slice(0, 63, 1), slice(0, 63, 1))} 1702469460.050 DEBUG> read_chunk_hyperslab - chunk_sel: [slice(0, 63, 1), slice(0, 63, 1)] 1702469460.050 DEBUG> read_chunk_hyperslab - data_sel: (slice(0, 63, 1), slice(0, 63, 1)) 1702469460.050 DEBUG> hyperslab selection - chunk_shape: [63, 63] 1702469460.051 DEBUG> getNodeCount for dn_urls: ['http://172.22.0.4:6101', 'http://172.22.0.5:6101', 'http://172.22.0.6:6101', 'http://172.22.0.7:6101'] 1702469460.051 DEBUG> got dn_url: http://172.22.0.4:6101 for obj_id: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.051 DEBUG> read_chunk_hyperslab - GET chunk req: http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.051 DEBUG> params: {'s3path': 'files/compressed.h5', 's3offset': 4016, 's3size': 27699, 'bucket': 'files', 'select': '[0:63,0:63]'} 1702469460.051 INFO> http_get('http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0') 1702469460.055 INFO> http_get status: 503 for req: http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.055 WARN> 503 error for http_get_Json http://172.22.0.4:6101/chunks/c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.055 WARN> HTTPServiceUnavailable for read_chunk_hyperslab(c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0): Service Unavailable 1702469460.056 ERROR> ChunkCrawler action: read_chunk_hyperslab failed after: 7 retries 1702469460.056 INFO> ChunkCrawler - worker status for chunk c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0: 503 1702469460.056 DEBUG> ChunkCrawler - task c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 start: 1702469445.763 elapsed: 14.293 1702469460.056 INFO> ChunkCrawler - join complete - count: 1 1702469460.056 DEBUG> ChunkCrawler - workers canceled 1702469460.056 INFO> returning chunk_status: 503 for chunk: c-d6f45ee7-90e9c4fa-cb6d-10b780-7f110c_0_0 1702469460.056 INFO> doReadSelection complete - status: 503 1702469460.056 INFO> doReadSelection raising HTTPInternalServerError for status: 503 1702469460.056 ERROR> got exception doing getSelectionData: Internal Server Error 1702469460.056 INFO> streaming data for 1 pages complete, 0 bytes written 1702469460.056 DEBUG> ChunkCrawler - worker has been cancelled Error handling request Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 332, in data_received messages, upgraded, tail = self._request_parser.feed_data(data) File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data aiohttp.http_exceptions.BadStatusLine: 400, message: Bad status line "Invalid method encountered:\n\n b''\n ^" |
I suspect the real error is occurring in one of the DN containers. Could you take a look at the DN logs as well? To make life easier when debugging this kind of issue, I will usually start the server with |
Hey @loichuder - have you run into problems using bitshuffle with packages that require numpy >= 2.0? It seems that the bitshuffle repo hasn't been updated in quite a while at this is causing problems moving to the new numpy release. See: #378. |
I am trying to serve POSIX files that contain datasets compressed with the bitshuffle filter.
hsload --link
works without any trouble as do requests to metadata and uncompressed datasets. However, requests to the compressed datasets fail with the following errors in the datanode:and
.
How should I proceed to be able to request such datasets ? Given that I do
hsload --link
, should I look into HSDS rather than h5pyd ?The text was updated successfully, but these errors were encountered: