Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for Directly Writing to HDF5 #651

Open
CSSFrancis opened this issue Feb 5, 2025 · 11 comments
Open

Documentation for Directly Writing to HDF5 #651

CSSFrancis opened this issue Feb 5, 2025 · 11 comments

Comments

@CSSFrancis
Copy link

CSSFrancis commented Feb 5, 2025

Hi, long time lurker and big fan of blosc! Apologies if this is not the right place to raise this issue. I've been working on trying push the saving speed for hdf5 files. From my understanding when trying to optimize for speed it is best to bypass the filter implementation and handle the compression/ saving separately.

i.e:

https://www.blosc.org/posts/pytables-direct-chunking/
https://github.com/imaris/ImarisWriter

My impression was that this was largely done using the H5Dwrite_chunk function in the hdf5 library which allows you to directly write to some chunk bypassing the filter implementation.

Doing this seems fairly straight forward. I've tried starting with the easier serial case:

  1. Compress some data using blosc
  2. Write the compressed data directly to a hdf5 dataset using the H5Dwrite_chunk

Currently data is written to the HDF5 dataset but reading it with something like hdf5plugin doesn't seem to work and a blosc decompression error is thrown. My first thought is that this is related to the filter parameters, is there documentation on what they should be? Currently I have something like:

cd_values[0] = 2;
cd_values[1] = 2; 
cd_values[2] = m_hdfImagesDataType.getSize(); // 2 for 16 bit int

/* Get the size of the chunk */
int bufsize = m_hdfImagesDataType.getSize();
for (int i = 0; i <  3; i++) {
	bufsize *= (unsigned int)m_hdfImagesChunkDimensions[i];
}
m_compressionLibrary = blosc1_set_compressor("blosclz");

cd_values[3] = bufsize;
cd_values[4] = m_clevel;               /* compression level */
cd_values[5] = m_bitshuffle;               /* 0: shuffle not active, 1: shuffle active */
cd_values[6] = m_compressionLibrary;      /* the actual compressor to use */


// Create the HDF5 data space and data set for images
m_hdfImagesDataSpace = H5::DataSpace(m_hdfImagesDataSpaceDimensionsCount, m_hdfImagesDataSpaceDimensions, NULL);
H5::DSetCreatPropList hdfImagesDataSetProperties; 
hdfImagesDataSetProperties.setChunk(m_hdfImagesDataSpaceDimensionsCount, m_hdfImagesChunkDimensions);
// 32001 for blosc1
hdfImagesDataSetProperties.setFilter(32001,H5Z_FLAG_OPTIONAL, 7, cd_values);

m_hdfImagesDataSet = m_hdfImagesGroup.createDataSet("patterns", m_hdfImagesDataType, m_hdfImagesDataSpace, hdfImagesDataSetProperties);

....

int compressed_size = bytesPerChunkUncompressed+ BLOSC2_MAX_OVERHEAD;
int blosc_result = blosc1_compress(m_clevel, m_bitshuffle, type_bytes, bytesPerChunkUncompressed, (char*)imagePixelData, compressed_data, compressed_size);  
if (blosc_result <= 0) {
	delete[] compressed_data; // Clean up
	return false;
}
hsize_t hdfDataSpaceNewImageDimensions[3] = {static_cast<hsize_t>(framesInBuffer), static_cast<hsize_t>(m_imageHeight), static_cast<hsize_t>(m_imageWidth)};
// Get the offset Chunk for the data... 
hsize_t hdfOffset[3] = { static_cast<hsize_t>(framesInBuffer* m_outputImageCurrentCount), 0, 0};

compressed_size = blosc_result; // Update to actual compressed size
H5Dwrite_chunk(m_hdfImagesDataSet.getId(), H5P_DEFAULT, 0, hdfOffset, compressed_size, compressed_data); // filters = 0 --> no filter applied?

I was wondering if there is a minimal example of this kind of workflow somewhere? If not it would be great to add this to the documentation in some way. I can help with that :) assuming that I can figure out exactly how to save the data in a way that is readable.

@CSSFrancis
Copy link
Author

CSSFrancis commented Feb 5, 2025

I guess adding a more minimal example in python:

import os
import blosc2
import h5py
import hdf5plugin
import numpy as np

shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)

comp = blosc2.compress(data, typesize=2, clevel=5,
                filter=blosc2.Filter.SHUFFLE,
                codec=blosc2.Codec.BLOSCLZ)

with h5py.File("blosc_test.h5", "w") as h5f:
    dset = h5f.create_dataset("data", 
                       shape=data.shape, 
                       dtype=data.dtype,
                       chunks=data.shape,
                       allow_unknown_filter=False,
                       compression=hdf5plugin.Blosc(cname="blosclz", clevel=5, shuffle=1),
                      )
    dset.id.write_direct_chunk((0, 0, 0), comp)
    comp_data = dset.id.read_direct_chunk((0,0,0))
    dset[:] # (Blosc decompression error)

# manually decompressing the data works however....

decomp = np.frombuffer(blosc2.decompress(comp_data[1]), dtype=data.dtype).reshape((10,1024,1024))
np.all(np.equal(decomp, data))

with print(dset._filters) giving {'32001': (2, 2, 2, 20971520, 5, 1, 0)} which is what I would expect.

@CSSFrancis
Copy link
Author

Just a note that this seems to work with the newer blosc2 implementation:

import os
import blosc2
import h5py
import hdf5plugin
import numpy as np

shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)

b_array = blosc2.asarray(
        data,
        chunks=data.shape,
        blocks=(1,) + data.shape[1:],  # Compress slice by slice
    )

with h5py.File("blosc2_test.h5", "w") as h5f:
    dset = h5f.create_dataset("data", 
                       shape=b_array.shape, 
                       dtype=b_array.dtype,
                       chunks=b_array.shape,
                       allow_unknown_filter=False,
                       compression=hdf5plugin.Blosc2(),
                      )
    dset.id.write_direct_chunk((0, 0, 0), b_array.schunk.to_cframe())
    print(dset._filters)
    comp_data = dset.id.read_direct_chunk((0,0,0))
    dset[:] # (No Blosc decompression error)

I guess it's possible there is an bug in the hdf5plugin library which isn't reading the compressed data correctly...

@FrancescAlted
Copy link
Member

If you want to accelerate HDF5 I/O with Blosc2, the direct chunking approach is best as can be seen in e.g. https://www.blosc.org/posts/blosc2-pytables-perf/. BTW, we have made a package that allows this approach with h5py in https://github.com/Blosc/b2h5py, with performance being similar to the one in the blog.

If you want to write about this in some place in the docs, you are welcome!

@FrancescAlted
Copy link
Member

FWIW, here there is info on the expected Blosc2 info for a plugin in HDF5: https://github.com/Blosc/HDF5-Blosc2/blob/main/src/blosc2_plugin.c

@CSSFrancis
Copy link
Author

@FrancescAlted Thank you so much for the information! How is the information for the blosc1 plugin different than the blosc2 plugin? My understanding is that with the Blosc2 implementation it expects super chunks even if there is not subblocking in the underlying array. Is that a correct assumption?

For reading data ultimately the b2h5py library is a great tool and what we would like to use, but we are currently trying to write a stream of data so we need slightly lower level control. From a performance standpoint, using Blosc1 might make more sense as well, becuase we need to compress a stream of data and (from my understanding) the fastest compression is still with 10-100 mb chunks. I still need to play around with it a little bit but I can add some documentation once I get something working.

@FrancescAlted
Copy link
Member

Well, I don't remember the details, but I think blosc2/b2h5py plugin accepts both single chunks and super-chunks too. @ivilata may bring some light here.

@ivilata
Copy link
Contributor

ivilata commented Feb 10, 2025

Hi! Not completely fresh in my mind either, but b2h5py should accept in an HDF5 chunk whatever b2schunk_open() can read, and HDF5-Blosc2 whatever blosc_decompress() can.

For instance, the second allows PyTables to read an HDF5 chunk which contains a Blosc2 super-chunk with either a single chunk or several of them (PyTables only creates super-chunks with one chunk when writing). But I don't remember having tried with an HDF5 chunk containing a standalone chunk created with Blosc1.

@CSSFrancis
Copy link
Author

@ivilata thanks!

It seems like there is a bug in the blosc1 decompression implementation. How/where is the blosc1 plugin defined? The decompress function works but it seems like something isn't being defined properly somewhere.

import os
import blosc2
import h5py
import hdf5plugin
import numpy as np
import b2h5py
shape = 10, 1024, 1024
data = np.arange(np.prod(shape), dtype=np.uint16).reshape(*shape)

comp = blosc2.compress(data, typesize=2, clevel=5,
                filter=blosc2.Filter.SHUFFLE,
                codec=blosc2.Codec.BLOSCLZ)

with h5py.File("blosc_test.h5", "w") as h5f:
    dset = h5f.create_dataset("data", 
                       shape=data.shape, 
                       dtype=data.dtype,
                       chunks=data.shape,
                       allow_unknown_filter=False,
                       compression=hdf5plugin.Blosc(cname="blosclz", clevel=5, shuffle=1),
                      )
    dset.id.write_direct_chunk((0, 0, 0), comp)
    comp_data = dset.id.read_direct_chunk((0,0,0)) # no error...
    dset = b2h5py.B2Dataset(dset)
    dset[:] # (Blosc decompression error)

# manually decompressing the data works however....

decomp = np.frombuffer(blosc2.decompress(comp_data[1]), dtype=data.dtype).reshape((10,1024,1024))
np.all(np.equal(decomp, data))

@FrancescAlted
Copy link
Member

The plugin for blosc1 is around here: https://github.com/Blosc/hdf5-blosc. For what I can see in your example, you try to use blosc1 from hdf5plugin, and then use b2h5py that is using blosc2. As blosc2 is backward compatible with blosc1, this should work, but it isn't in your case (for some reason). Just curious, does your script works if you use hdf5plugin.Blosc2 instead? FWIW, blosc2 comes with lots of advantages wrt blosc v1, and we encourage people using the new version.

@CSSFrancis
Copy link
Author

CSSFrancis commented Feb 25, 2025

@FrancescAlted The python script works for blosc2. For trying to write directly with C++ I'm still struggling with how to get the implementation to work properly and have the hdf5 file readable via python even for writing to blosc2. I can share the code for writing the dataset or a small hdf5 file if that helps.

I keep getting an error:

..\..\..\venv\Lib\site-packages\b2h5py\blosc2.py:263: in __getitem__
    return opt_selection_read(self.__dataset, selection)
..\..\..\venv\Lib\site-packages\b2h5py\blosc2.py:143: in opt_selection_read
    chunk_slice_arr = _read_chunk_slice(
..\..\..\venv\Lib\site-packages\b2h5py\blosc2.py:90: in _read_chunk_slice
    schunk = b2schunk_open(path, mode='r', offset=offset)
..\..\..\venv\Lib\site-packages\blosc2\schunk.py:1531: in open
    res = blosc2_ext.open(urlpath, mode, offset, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   RuntimeError: blosc2_schunk_open_offset('file_we_are_trying_to_open.h5', 17800) returned NULL

But if I try to directly read the data:

import blosc2
import h5py
import b2h5py

f2 = h5py.File(f, "r") # f is the file name
dset2 =f2["Scan 1/EBSD/Data/patterns"]
s_info = dset2.id.get_chunk_info_by_coord((0,0,0))
offset = s_info.byte_offset # offset is 17800
size = s_info.size
with open(f,'rb') as file:
    file.seek(offset)
    b = file.read(size,)
decomp = blosc2.decompress2(b)
arr =np.frombuffer(decomp, dtype=np.int16)

Maybe a little more direct, but it's this code fails and I can't quite figure out why....

from blosc2.schunk import open as blosc2_schunk_open_offset
es = blosc2_schunk_open_offset(f, "r", 17800)

When I directly call the same function in the c++ code however it works...

blosc2_schunk* test_off = blosc2_schunk_open_offset(m_filePath.c_str(), 17800);

@FrancescAlted
Copy link
Member

Interesting. You may want to try setting the BLOSC_TRACE environment variable to something. This usually provides some debugging info on errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants