Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update string equality tests for h5py 3.1 compatibility #263

Merged
merged 6 commits into from
Dec 15, 2020

Conversation

dwpaley
Copy link

@dwpaley dwpaley commented Nov 18, 2020

h5py version 3 updated the rules for strings: https://docs.h5py.org/en/stable/strings.html. Some attributes that were previously returned as regular strings are now returned as bytes, which breaks tests like if obj.handle["depends_on"][()] == ".":. This PR converts both sides of all string equality tests in format/nexus.py to bytes using numpy.string_.

@codecov
Copy link

codecov bot commented Nov 18, 2020

Codecov Report

Merging #263 (7fcfbe5) into master (982b60c) will increase coverage by 0.22%.
The diff coverage is 43.28%.

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
+ Coverage   45.16%   45.38%   +0.22%     
==========================================
  Files         229      229              
  Lines       19079    19096      +17     
  Branches     2756     2761       +5     
==========================================
+ Hits         8617     8667      +50     
+ Misses       9945     9921      -24     
+ Partials      517      508       -9     

@graeme-winter
Copy link
Collaborator

@dwpaley strangely enough, while I recognise the changes you are making here, I did not need them for h5py 3.1 to work correctly for me (been using this already)

Much 🤔

In no sense is this a comment on the change set which I think you will find mirrored already on several branches...

@graeme-winter
Copy link
Collaborator

Asked @noemifrisina @benjaminhwilliams to take a look as they have both been looking at this code closely lately

@phyy-nx
Copy link
Contributor

phyy-nx commented Nov 18, 2020

Could have something to do with reading files in different versions of h5py than they were written. Similar to issues we had when we were doing the python 3 conversion.

format/nexus.py Outdated
@@ -111,7 +111,7 @@ def visitor(name, obj):
numpy.string_("NXsubentry"),
]:
if "definition" in obj:
if obj["definition"][()] == numpy.string_("NXmx"):
if numpy.string_(obj["definition"][()]) == numpy.string_("NXmx"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know that these strings are returned as bytes wouldn't

Suggested change
if numpy.string_(obj["definition"][()]) == numpy.string_("NXmx"):
if obj["definition"][()] == b"NXmx":

be more natural?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively:

Suggested change
if numpy.string_(obj["definition"][()]) == numpy.string_("NXmx"):
if obj["definition"][()] in ("NXmx", b"NXmx"):

if they can be both. Which would lend itself to be wrapped in a helper function such as

def _h5match(obj, string):
    return obj[()] in (string, string.encode("utf-8"))

(...)
    if _h5match(obj["definition"], "NXmx"):
    (...)

@ndevenish
Copy link
Collaborator

ndevenish commented Dec 14, 2020

By my reading

Variable-length strings in attributes are read as str objects. These are decoded as UTF-8 with surrogate escaping for unrecognised bytes.

Implies that all attributes are read as str objects now, and the only time numpy arrays are returned is for fixed-length datasets? So moving almost everything to plain str might be better, unless we want to support older h5py versions

@ndevenish
Copy link
Collaborator

Implies that all attributes are read as str objects now, and the only time numpy arrays are returned is for fixed-length datasets? So moving almost everything to plain str might be better, unless we want to support older h5py versions

Okay, this was me being hopelessly optimistic. It looks like, now, attribute strings can be str, but can also be numpy.string_ (which I assume is what they mean by bytes) - depending on how the attribute was written (fixed length or not). Rules appear to be summarized in h5py/h5py#1338 (comment). Additionally, it looks like attrs keys can be both e.g. both "x" in node.attrs and b"x" in node.attrs work.

So it'd have to be this, or the latter of Markus' suggestions (although maybe another variant - say a h5str(something_out_of_h5py) that means that only the left side needs to be converted).

These can now sometimes be numpy.string_, sometimes str, depending
on how the value was written. Rather than converting everything to
numpy.string_, convert to str immediately when read, so we don't
need to convert literals all over the place. In the best case, this
is a no-op.
@ndevenish
Copy link
Collaborator

ndevenish commented Dec 14, 2020

I've got a prototype of this (on top of the work here) in nexus_h5py3...ndevenish:nexus_h5py3, which also seems to work

@ndevenish ndevenish merged commit 687e231 into master Dec 15, 2020
@ndevenish ndevenish deleted the nexus_h5py3 branch December 15, 2020 12:39
@ndevenish
Copy link
Collaborator

Since tests are breaking I've pulled this in now. Thanks!

@ndevenish ndevenish mentioned this pull request Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants