-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunked writing + compression proof of concept #673
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are a lot of unrelated changes at the moment. Eg. Reverting the copyright year and white space changes.
Could you merge master into this and clean these up so the diff is smaller?
605b8ce
to
bce0499
Compare
Cleaned up pull request to remove formatting only changes Sorry about the confusion, this is my first time collaborating using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of great work here. I would consider breaking it up, here are a few things I consider separate.
- The implementation of btree writing
- The extension of the
Filter
interface to add writing and impl of deflate - The addition of
UFixed
- this is a little debatable to me, as Java doesn't have unsigned types by definition anything you want to write must be signed, if there is a reason this would be nice to flag unsigned in the file we could consider it but for a separate discussion IMO.
I would reconsider the use of Object header v1 and btree v1. Just forward thinking I don't really intend to support using object header v1. If you take a look at the spec for latest format (i.e. object header v2) Apendix c https://support.hdfgroup.org/documentation/hdf5/latest/_f_m_t3.html#AppendixC then you can see the supported chunk indexes, if you want btree then you would look at btree v2, but first I would look at single chunk.
So great you have got this to work but think there is quite a bit to cleanup here. Consider breaking the change up into stages. I would probably suggest looking a single chunk or btree v2 first to avoid the chnages required for object header v1. To get something merged would also want unit tests with decent coverage, there are quite a few examples for writing to follow including read verification with h5dump to check the file reads with the HDF5 group lib as well.
@@ -45,4 +45,6 @@ public interface Filter { | |||
*/ | |||
byte[] decode(byte[] encodedData, int[] filterData); | |||
|
|||
byte[] encode(byte[] data, int[] filterData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we will need a default
impl here that throws an exception as this is public API and people might have implemented custom filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted this change and only added the encode method to the DeflatePipelineFilter.
Once you have decided on the right public API for the encode method for all filters, we can adjust this method accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the API itself if good, just that we need to make it compatible with existing Filter
implementations. if we add a default
e.g.
default byte[] encode(byte[] data, int[] filterData) {
throw new UnsupportedHdfException(String.format("[%s (%d)] does not support encoding", getName(), getId()));
}
Then all the existing code will compile. encode
implementation can be gradually added as people want to support those filters. But all exisrting filters only used for reading will keep working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound like a very good approach. It's done and uploaded.
@@ -31,18 +31,18 @@ public enum FilterManager { | |||
|
|||
private static final Logger logger = LoggerFactory.getLogger(FilterManager.class); | |||
|
|||
private static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>(); | |||
public static final Map<Integer, Filter> ID_TO_FILTER = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoping with some refactoring we can avoid the need to expose this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that you prefer the getter/setter approach which does not expose internal variables.
I've added a getFilter method, which keeps ID_TO_FILTER private.
/**
* Retrieves a filter.
*
* @param filterId the ID of the filter to retrieve
* @throws HdfFilterException if the filterId is not valid
*/
'public static Filter getFilter(int filterId) {
Filter filter = ID_TO_FILTER.get(filterId);
logger.info("Retrieved HDF5 filter '{}' with ID '{}'", filter.getName(), filter.getId());
return filter;
}'
// addFilter(new ByteShuffleFilter()); | ||
// addFilter(new FletcherChecksumFilter()); | ||
// addFilter(new LzfFilter()); | ||
// addFilter(new BitShuffleFilter()); | ||
// addFilter(new Lz4Filter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding default method should allow this to be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only commented these extra filters out to simplify my development environment. I'll revert this in my next pull request. As I said, I'm still learning all these tools.
@@ -80,6 +80,23 @@ public BufferBuilder writeInt(int i) { | |||
} | |||
} | |||
|
|||
public BufferBuilder writeInts(int[] ints) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this csn be simplified
public BufferBuilder writeInts(int[] ints) {
for (int i=0; i < ints.length; i++) {
writeInt(ints[i]);
}
return this;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Fixed.
@@ -92,6 +109,24 @@ public BufferBuilder writeLong(long l) { | |||
} | |||
} | |||
|
|||
public BufferBuilder writeLongs(long[] longs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Fixed.
@@ -87,6 +87,10 @@ public boolean isSigned() { | |||
return signed; | |||
} | |||
|
|||
public void setSigned(boolean sig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design I have up to now is to favour immutable objects, think I would like to stick to this where possible. We might need to introduce some kind of DatasetBuilder
though as we would want the ability to specify more options like filters, filter options, chunk size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Given that you prefer not to use unsigned fixed points, I'll work on removing this feature from the pull request. I'll update the comment once I'm done.
logger.debug("Reading implicit indexed dataset"); | ||
chunkIndex = new ImplicitChunkIndex(layoutMessage.getAddress(), datasetInfo); | ||
break; | ||
throw new UnsupportedHdfException("Implicit indexing is currently not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a merge issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge issue, Fixed
I'll look into the issues with the V1 vs. V2 headers. I agree it is preferable to write only the V2 versions, but I'm not sure that all applications that will read the H5 files support the new features. I agree tests are important and valuable, but I'm not familiar with the unit test system. I managed to switch to a V2 Superblock and use V2 ObjectHeaders. I'm not sure if it is worth the effort to switch to V2 BTree, since I don't have any examples/code that needs these. The edits are pushed to Github. |
b390964
to
55f3a1a
Compare
55f3a1a
to
343354a
Compare
No description provided.