Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked writing + compression proof of concept #673

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cstoeckl
Copy link

No description provided.

Copy link
Owner

@jamesmudd jamesmudd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a lot of unrelated changes at the moment. Eg. Reverting the copyright year and white space changes.

Could you merge master into this and clean these up so the diff is smaller?

@cstoeckl cstoeckl force-pushed the chunkedWrite branch 3 times, most recently from 605b8ce to bce0499 Compare January 20, 2025 21:58
@cstoeckl
Copy link
Author

cstoeckl commented Jan 20, 2025

Cleaned up pull request to remove formatting only changes

Sorry about the confusion, this is my first time collaborating using git and Github.
I'm still learning.

@cstoeckl cstoeckl changed the title Initial write Chunked writing + compression proof of concept Jan 20, 2025
Copy link
Owner

@jamesmudd jamesmudd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of great work here. I would consider breaking it up, here are a few things I consider separate.

  • The implementation of btree writing
  • The extension of the Filter interface to add writing and impl of deflate
  • The addition of UFixed - this is a little debatable to me, as Java doesn't have unsigned types by definition anything you want to write must be signed, if there is a reason this would be nice to flag unsigned in the file we could consider it but for a separate discussion IMO.

I would reconsider the use of Object header v1 and btree v1. Just forward thinking I don't really intend to support using object header v1. If you take a look at the spec for latest format (i.e. object header v2) Apendix c https://support.hdfgroup.org/documentation/hdf5/latest/_f_m_t3.html#AppendixC then you can see the supported chunk indexes, if you want btree then you would look at btree v2, but first I would look at single chunk.

So great you have got this to work but think there is quite a bit to cleanup here. Consider breaking the change up into stages. I would probably suggest looking a single chunk or btree v2 first to avoid the chnages required for object header v1. To get something merged would also want unit tests with decent coverage, there are quite a few examples for writing to follow including read verification with h5dump to check the file reads with the HDF5 group lib as well.

@@ -45,4 +45,6 @@ public interface Filter {
*/
byte[] decode(byte[] encodedData, int[] filterData);

byte[] encode(byte[] data, int[] filterData);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we will need a default impl here that throws an exception as this is public API and people might have implemented custom filters.

Copy link
Author

@cstoeckl cstoeckl Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted this change and only added the encode method to the DeflatePipelineFilter.

Once you have decided on the right public API for the encode method for all filters, we can adjust this method accordingly.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API itself if good, just that we need to make it compatible with existing Filter implementations. if we add a default e.g.

	default byte[] encode(byte[] data, int[] filterData) {
		throw new UnsupportedHdfException(String.format("[%s (%d)] does not support encoding", getName(), getId()));
	}

Then all the existing code will compile. encode implementation can be gradually added as people want to support those filters. But all exisrting filters only used for reading will keep working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound like a very good approach. It's done and uploaded.

@@ -31,18 +31,18 @@ public enum FilterManager {

private static final Logger logger = LoggerFactory.getLogger(FilterManager.class);

private static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>();
public static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping with some refactoring we can avoid the need to expose this

Copy link
Author

@cstoeckl cstoeckl Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you prefer the getter/setter approach which does not expose internal variables.

I've added a getFilter method, which keeps ID_TO_FILTER private.

/**
* Retrieves a filter.
*
* @param filterId the ID of the filter to retrieve
* @throws HdfFilterException if the filterId is not valid
*/

'public static Filter getFilter(int filterId) {

	Filter filter = ID_TO_FILTER.get(filterId);
	logger.info("Retrieved HDF5 filter '{}' with ID '{}'", filter.getName(), filter.getId());

	return filter;		
}'

Comment on lines 41 to 45
// addFilter(new ByteShuffleFilter());
// addFilter(new FletcherChecksumFilter());
// addFilter(new LzfFilter());
// addFilter(new BitShuffleFilter());
// addFilter(new Lz4Filter());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding default method should allow this to be reverted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only commented these extra filters out to simplify my development environment. I'll revert this in my next pull request. As I said, I'm still learning all these tools.

@@ -80,6 +80,23 @@ public BufferBuilder writeInt(int i) {
}
}

public BufferBuilder writeInts(int[] ints) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this csn be simplified

	public BufferBuilder writeInts(int[] ints) {
		for (int i=0; i < ints.length; i++) {
			writeInt(ints[i]);
		}
		return this;
	}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Fixed.

@@ -92,6 +109,24 @@ public BufferBuilder writeLong(long l) {
}
}

public BufferBuilder writeLongs(long[] longs) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

Fixed.

@@ -87,6 +87,10 @@ public boolean isSigned() {
return signed;
}

public void setSigned(boolean sig) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design I have up to now is to favour immutable objects, think I would like to stick to this where possible. We might need to introduce some kind of DatasetBuilder though as we would want the ability to specify more options like filters, filter options, chunk size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Given that you prefer not to use unsigned fixed points, I'll work on removing this feature from the pull request. I'll update the comment once I'm done.

logger.debug("Reading implicit indexed dataset");
chunkIndex = new ImplicitChunkIndex(layoutMessage.getAddress(), datasetInfo);
break;
throw new UnsupportedHdfException("Implicit indexing is currently not supported");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a merge issue?

Copy link
Author

@cstoeckl cstoeckl Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge issue, Fixed

@cstoeckl
Copy link
Author

cstoeckl commented Jan 22, 2025

I'll look into the issues with the V1 vs. V2 headers. I agree it is preferable to write only the V2 versions, but I'm not sure that all applications that will read the H5 files support the new features.
As I've said in my e-mail I needed the example files to understand the differences between documentation and implementation and the applications I have access to write only V1 headers.

I agree tests are important and valuable, but I'm not familiar with the unit test system.
It took me a while, but I was able to clear all errors in the test system.

I managed to switch to a V2 Superblock and use V2 ObjectHeaders.

I'm not sure if it is worth the effort to switch to V2 BTree, since I don't have any examples/code that needs these.

The edits are pushed to Github.

@cstoeckl cstoeckl force-pushed the chunkedWrite branch 3 times, most recently from b390964 to 55f3a1a Compare January 24, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants