-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation for Directly Writing to HDF5 #651
Comments
I guess adding a more minimal example in python: import os
import blosc2
import h5py
import hdf5plugin
import numpy as np
shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)
comp = blosc2.compress(data, typesize=2, clevel=5,
filter=blosc2.Filter.SHUFFLE,
codec=blosc2.Codec.BLOSCLZ)
with h5py.File("blosc_test.h5", "w") as h5f:
dset = h5f.create_dataset("data",
shape=data.shape,
dtype=data.dtype,
chunks=data.shape,
allow_unknown_filter=False,
compression=hdf5plugin.Blosc(cname="blosclz", clevel=5, shuffle=1),
)
dset.id.write_direct_chunk((0, 0, 0), comp)
comp_data = dset.id.read_direct_chunk((0,0,0))
dset[:] # (Blosc decompression error)
# manually decompressing the data works however....
decomp = np.frombuffer(blosc2.decompress(comp_data[1]), dtype=data.dtype).reshape((10,1024,1024))
np.all(np.equal(decomp, data)) with |
Just a note that this seems to work with the newer blosc2 implementation: import os
import blosc2
import h5py
import hdf5plugin
import numpy as np
shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)
b_array = blosc2.asarray(
data,
chunks=data.shape,
blocks=(1,) + data.shape[1:], # Compress slice by slice
)
with h5py.File("blosc2_test.h5", "w") as h5f:
dset = h5f.create_dataset("data",
shape=b_array.shape,
dtype=b_array.dtype,
chunks=b_array.shape,
allow_unknown_filter=False,
compression=hdf5plugin.Blosc2(),
)
dset.id.write_direct_chunk((0, 0, 0), b_array.schunk.to_cframe())
print(dset._filters)
comp_data = dset.id.read_direct_chunk((0,0,0))
dset[:] # (No Blosc decompression error) I guess it's possible there is an bug in the hdf5plugin library which isn't reading the compressed data correctly... |
If you want to accelerate HDF5 I/O with Blosc2, the direct chunking approach is best as can be seen in e.g. https://www.blosc.org/posts/blosc2-pytables-perf/. BTW, we have made a package that allows this approach with h5py in https://github.com/Blosc/b2h5py, with performance being similar to the one in the blog. If you want to write about this in some place in the docs, you are welcome! |
FWIW, here there is info on the expected Blosc2 info for a plugin in HDF5: https://github.com/Blosc/HDF5-Blosc2/blob/main/src/blosc2_plugin.c |
@FrancescAlted Thank you so much for the information! How is the information for the blosc1 plugin different than the blosc2 plugin? My understanding is that with the Blosc2 implementation it expects super chunks even if there is not subblocking in the underlying array. Is that a correct assumption? For reading data ultimately the |
Well, I don't remember the details, but I think blosc2/b2h5py plugin accepts both single chunks and super-chunks too. @ivilata may bring some light here. |
Hi! Not completely fresh in my mind either, but b2h5py should accept in an HDF5 chunk whatever b2schunk_open() can read, and HDF5-Blosc2 whatever blosc_decompress() can. For instance, the second allows PyTables to read an HDF5 chunk which contains a Blosc2 super-chunk with either a single chunk or several of them (PyTables only creates super-chunks with one chunk when writing). But I don't remember having tried with an HDF5 chunk containing a standalone chunk created with Blosc1. |
@ivilata thanks! It seems like there is a bug in the blosc1 decompression implementation. How/where is the blosc1 plugin defined? The import os
import blosc2
import h5py
import hdf5plugin
import numpy as np
import b2h5py
shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)
comp = blosc2.compress(data, typesize=2, clevel=5,
filter=blosc2.Filter.SHUFFLE,
codec=blosc2.Codec.BLOSCLZ)
with h5py.File("blosc_test.h5", "w") as h5f:
dset = h5f.create_dataset("data",
shape=data.shape,
dtype=data.dtype,
chunks=data.shape,
allow_unknown_filter=False,
compression=hdf5plugin.Blosc(cname="blosclz", clevel=5, shuffle=1),
)
dset.id.write_direct_chunk((0, 0, 0), comp)
comp_data = dset.id.read_direct_chunk((0,0,0)) # no error...
dset = b2h5py.B2Dataset(dset)
dset[:] # (Blosc decompression error)
# manually decompressing the data works however....
decomp = np.frombuffer(blosc2.decompress(comp_data[1]), dtype=data.dtype).reshape((10,1024,1024))
np.all(np.equal(decomp, data)) |
The plugin for blosc1 is around here: https://github.com/Blosc/hdf5-blosc. For what I can see in your example, you try to use blosc1 from hdf5plugin, and then use b2h5py that is using blosc2. As blosc2 is backward compatible with blosc1, this should work, but it isn't in your case (for some reason). Just curious, does your script works if you use hdf5plugin.Blosc2 instead? FWIW, blosc2 comes with lots of advantages wrt blosc v1, and we encourage people using the new version. |
@FrancescAlted The python script works for blosc2. For trying to write directly with C++ I'm still struggling with how to get the implementation to work properly and have the hdf5 file readable via python even for writing to blosc2. I can share the code for writing the dataset or a small hdf5 file if that helps. I keep getting an error:
But if I try to directly read the data:
Maybe a little more direct, but it's this code fails and I can't quite figure out why.... from blosc2.schunk import open as blosc2_schunk_open_offset
es = blosc2_schunk_open_offset(f, "r", 17800) When I directly call the same function in the c++ code however it works... blosc2_schunk* test_off = blosc2_schunk_open_offset(m_filePath.c_str(), 17800); |
Interesting. You may want to try setting the |
Hi, long time lurker and big fan of blosc! Apologies if this is not the right place to raise this issue. I've been working on trying push the saving speed for hdf5 files. From my understanding when trying to optimize for speed it is best to bypass the
filter
implementation and handle the compression/ saving separately.i.e:
https://www.blosc.org/posts/pytables-direct-chunking/
https://github.com/imaris/ImarisWriter
My impression was that this was largely done using the
H5Dwrite_chunk
function in the hdf5 library which allows you to directly write to some chunk bypassing the filter implementation.Doing this seems fairly straight forward. I've tried starting with the easier serial case:
blosc
H5Dwrite_chunk
Currently data is written to the HDF5 dataset but reading it with something like hdf5plugin doesn't seem to work and a
blosc decompression error
is thrown. My first thought is that this is related to the filter parameters, is there documentation on what they should be? Currently I have something like:I was wondering if there is a minimal example of this kind of workflow somewhere? If not it would be great to add this to the documentation in some way. I can help with that :) assuming that I can figure out exactly how to save the data in a way that is readable.
The text was updated successfully, but these errors were encountered: