Chunked writing + compression proof of concept #673

cstoeckl · 2025-01-20T18:15:48Z

No description provided.

jamesmudd

Looks like there are a lot of unrelated changes at the moment. Eg. Reverting the copyright year and white space changes.

Could you merge master into this and clean these up so the diff is smaller?

cstoeckl · 2025-01-20T22:00:37Z

Cleaned up pull request to remove formatting only changes

Sorry about the confusion, this is my first time collaborating using git and Github.
I'm still learning.

jamesmudd

Lots of great work here. I would consider breaking it up, here are a few things I consider separate.

The implementation of btree writing
The extension of the Filter interface to add writing and impl of deflate
The addition of UFixed - this is a little debatable to me, as Java doesn't have unsigned types by definition anything you want to write must be signed, if there is a reason this would be nice to flag unsigned in the file we could consider it but for a separate discussion IMO.

I would reconsider the use of Object header v1 and btree v1. Just forward thinking I don't really intend to support using object header v1. If you take a look at the spec for latest format (i.e. object header v2) Apendix c https://support.hdfgroup.org/documentation/hdf5/latest/_f_m_t3.html#AppendixC then you can see the supported chunk indexes, if you want btree then you would look at btree v2, but first I would look at single chunk.

So great you have got this to work but think there is quite a bit to cleanup here. Consider breaking the change up into stages. I would probably suggest looking a single chunk or btree v2 first to avoid the chnages required for object header v1. To get something merged would also want unit tests with decent coverage, there are quite a few examples for writing to follow including read verification with h5dump to check the file reads with the HDF5 group lib as well.

jamesmudd · 2025-01-21T21:19:46Z

jhdf/src/main/java/io/jhdf/filter/Filter.java

@@ -45,4 +45,6 @@ public interface Filter {
 	 */
 	byte[] decode(byte[] encodedData, int[] filterData);

+	byte[] encode(byte[] data, int[] filterData);


Think we will need a default impl here that throws an exception as this is public API and people might have implemented custom filters.

I've reverted this change and only added the encode method to the DeflatePipelineFilter.

Once you have decided on the right public API for the encode method for all filters, we can adjust this method accordingly.

I think the API itself if good, just that we need to make it compatible with existing Filter implementations. if we add a default e.g.

default byte[] encode(byte[] data, int[] filterData) { throw new UnsupportedHdfException(String.format("[%s (%d)] does not support encoding", getName(), getId())); }

Then all the existing code will compile. encode implementation can be gradually added as people want to support those filters. But all exisrting filters only used for reading will keep working.

Sound like a very good approach. It's done and uploaded.

jamesmudd · 2025-01-21T21:21:54Z

jhdf/src/main/java/io/jhdf/filter/FilterManager.java

@@ -31,18 +31,18 @@ public enum FilterManager {

 	private static final Logger logger = LoggerFactory.getLogger(FilterManager.class);

-	private static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>();
+	public static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>();


Hoping with some refactoring we can avoid the need to expose this

I noticed that you prefer the getter/setter approach which does not expose internal variables.

I've added a getFilter method, which keeps ID_TO_FILTER private.

/**
* Retrieves a filter.
*
* @param filterId the ID of the filter to retrieve
* @throws HdfFilterException if the filterId is not valid
*/

'public static Filter getFilter(int filterId) { Filter filter = ID_TO_FILTER.get(filterId); logger.info("Retrieved HDF5 filter '{}' with ID '{}'", filter.getName(), filter.getId()); return filter; }'

jamesmudd · 2025-01-21T21:22:14Z

jhdf/src/main/java/io/jhdf/filter/FilterManager.java

+		// addFilter(new ByteShuffleFilter());
+		// addFilter(new FletcherChecksumFilter());
+		// addFilter(new LzfFilter());
+		// addFilter(new BitShuffleFilter());
+		// addFilter(new Lz4Filter());


Adding default method should allow this to be reverted

I only commented these extra filters out to simplify my development environment. I'll revert this in my next pull request. As I said, I'm still learning all these tools.

jamesmudd · 2025-01-21T21:34:43Z

jhdf/src/main/java/io/jhdf/BufferBuilder.java

@@ -80,6 +80,23 @@ public BufferBuilder writeInt(int i) {
 		}
 	}

+	public BufferBuilder writeInts(int[] ints) {


Think this csn be simplified

public BufferBuilder writeInts(int[] ints) { for (int i=0; i < ints.length; i++) { writeInt(ints[i]); } return this; }

Thanks !

Fixed.

jamesmudd · 2025-01-21T21:34:52Z

jhdf/src/main/java/io/jhdf/BufferBuilder.java

@@ -92,6 +109,24 @@ public BufferBuilder writeLong(long l) {
 		}
 	}

+	public BufferBuilder writeLongs(long[] longs) {


Thanks !

Fixed.

jamesmudd · 2025-01-21T21:44:42Z

jhdf/src/main/java/io/jhdf/object/datatype/FixedPoint.java

@@ -87,6 +87,10 @@ public boolean isSigned() {
 		return signed;
 	}

+	public void setSigned(boolean sig) {


The design I have up to now is to favour immutable objects, think I would like to stick to this where possible. We might need to introduce some kind of DatasetBuilder though as we would want the ability to specify more options like filters, filter options, chunk size.

Understood. Given that you prefer not to use unsigned fixed points, I'll work on removing this feature from the pull request. I'll update the comment once I'm done.

jamesmudd · 2025-01-21T21:45:33Z

jhdf/src/main/java/io/jhdf/dataset/chunked/ChunkedDatasetV4.java

-						logger.debug("Reading implicit indexed dataset");
-						chunkIndex = new ImplicitChunkIndex(layoutMessage.getAddress(), datasetInfo);
-						break;
+						throw new UnsupportedHdfException("Implicit indexing is currently not supported");


This looks like a merge issue?

Merge issue, Fixed

cstoeckl · 2025-01-22T15:05:11Z

I'll look into the issues with the V1 vs. V2 headers. I agree it is preferable to write only the V2 versions, but I'm not sure that all applications that will read the H5 files support the new features.
As I've said in my e-mail I needed the example files to understand the differences between documentation and implementation and the applications I have access to write only V1 headers.

I agree tests are important and valuable, but I'm not familiar with the unit test system.
It took me a while, but I was able to clear all errors in the test system.

I managed to switch to a V2 Superblock and use V2 ObjectHeaders.

I'm not sure if it is worth the effort to switch to V2 BTree, since I don't have any examples/code that needs these.

The edits are pushed to Github.

jamesmudd requested changes Jan 20, 2025

View reviewed changes

cstoeckl force-pushed the chunkedWrite branch 3 times, most recently from 605b8ce to bce0499 Compare January 20, 2025 21:58

cstoeckl changed the title ~~Initial write~~ Chunked writing + compression proof of concept Jan 20, 2025

jamesmudd requested changes Jan 21, 2025

View reviewed changes

cstoeckl force-pushed the chunkedWrite branch 3 times, most recently from b390964 to 55f3a1a Compare January 24, 2025 14:04

Initial write of chunked dataset + compression

343354a

cstoeckl force-pushed the chunkedWrite branch from 55f3a1a to 343354a Compare January 29, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked writing + compression proof of concept #673

Chunked writing + compression proof of concept #673

cstoeckl commented Jan 20, 2025

jamesmudd left a comment

cstoeckl commented Jan 20, 2025 •

edited

Loading

jamesmudd left a comment

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025 •

edited

Loading

jamesmudd Jan 24, 2025

cstoeckl Jan 24, 2025

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025 •

edited

Loading

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025

jamesmudd Jan 21, 2025

cstoeckl Jan 22, 2025 •

edited

Loading

cstoeckl commented Jan 22, 2025 •

edited

Loading

Chunked writing + compression proof of concept #673

Are you sure you want to change the base?

Chunked writing + compression proof of concept #673

Conversation

cstoeckl commented Jan 20, 2025

jamesmudd left a comment

Choose a reason for hiding this comment

cstoeckl commented Jan 20, 2025 • edited Loading

jamesmudd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cstoeckl Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cstoeckl Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cstoeckl Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

cstoeckl commented Jan 22, 2025 • edited Loading

cstoeckl commented Jan 20, 2025 •

edited

Loading

cstoeckl Jan 22, 2025 •

edited

Loading

cstoeckl Jan 22, 2025 •

edited

Loading

cstoeckl Jan 22, 2025 •

edited

Loading

cstoeckl commented Jan 22, 2025 •

edited

Loading