Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy string types in compound type #773

Closed
jacklovell opened this issue Mar 7, 2018 · 27 comments
Closed

Numpy string types in compound type #773

jacklovell opened this issue Mar 7, 2018 · 27 comments

Comments

@jacklovell
Copy link

I have the following Numpy dtype for an array:

dtype([('Object_type', 'S30'), ('ID', 'S30'), ('Version', '<i4'), 
('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('width', '<f8'), ('height', '<f8'),
('slit_id', 'S30'), ('slit_no', '<i4')])

When I try to create a NetCDF compound type from this dtype, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._def_compound (netCDF4/_netCDF4.c:51191)()

KeyError: 'S30'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-247-c1ab3d42293b> in <module>()
     20     nc_coord = bolo_group.createCompoundType(cartesian_coord, "COORDINATE")
---> 21     nc_detector = bolo_group.createCompoundType(aperture_dtype_full, "DETECTOR")
     22     foil_no = bolo_group.createDimension("foil_no", foils.size)

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createCompoundType (netCDF4/_netCDF4.c:16268)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.CompoundType.__init__ (netCDF4/_netCDF4.c:49971)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._def_compound (netCDF4/_netCDF4.c:51245)()

ValueError: Unsupported compound type element

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-247-c1ab3d42293b> in <module>()
     27     nc_slits = sxd_group.createVariable("slits", nc_detector, "slit_no")
     28     nc_slits[:] = slits_full
---> 29     nc_slits.units = "mm"

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__exit__ (netCDF4/_netCDF4.c:13090)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.close (netCDF4/_netCDF4.c:15045)()

RuntimeError: NetCDF: HDF error

I understand that it is not possible to create compound types which include variable-length arrays, but the Numpy '<S30' type should always be 30 bytes, not a variable length. This is actually why I'm using S30 and not U30, since the latter may have multi-byte characters.

Is it possible to add support for the S30 dtype in netcdf4-python? Otherwise, is there another way that I can include fixed length string types in a NetCDF compound type?

@jswhit
Copy link
Collaborator

jswhit commented Mar 7, 2018

Problem is there is no fixed length string datatype in netcdf. The only workaround I know of is to create a character array ('S1') with a length of 30, and then use the stringtoarr and chartostring utilities to convert back and forth from strings to arrays of characters. There's an example of this for compound types at https://github.com/Unidata/netcdf4-python/blob/master/examples/tutorial.py.

In your case, you would use

dtype([('Object_type', 'S1',30), ('ID', 'S1',30), ('Version', '<i4'), 
('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('width', '<f8'), ('height', '<f8'),
('slit_id', 'S1',30), ('slit_no', '<i4')])

I've asked for fixed length strings to be added to netcdf-c in the past, but the idea never gained any traction (Unidata/netcdf-c#132). If you really want this, please comment on that ticket.

@jswhit
Copy link
Collaborator

jswhit commented Mar 8, 2018

BTW - it is technically possible to support vlen strings inside compound data types. I just have devoted the time to implement it (yet).

@jacklovell
Copy link
Author

Thanks for this. Using the strtoarr and chartostring utilities works well, although it does add some boilerplate to the code.

It would be nice if there was at least an option in netcdf4-python to perform this step internally. Xarray does this (http://xarray.pydata.org/en/stable/io.html#string-encoding) for its netcdf IO, although that doesn't seem to work for compound data types yet (pydata/xarray#1977). Enabling this at the netcdf4-python level would make it easier for other packages to support what would be a very useful feature.

@jswhit
Copy link
Collaborator

jswhit commented Mar 9, 2018

We already to this for character arrays if the _Encoding attribute is set (that is what xarray is using under the hood), so it may not be too hard to support for compound types. I'll look into it.

@shoyer
Copy link
Contributor

shoyer commented Mar 10, 2018

One option is to use .view() to convert the data from repeated single characters to strings. Consider:

In [9]: data = [('a', 'bb'), ('ccc', 'dddd')]

In [10]: arr1 = np.array(data, dtype=[('f0', 'S1', 3), ('f1', 'S1', 4)])

In [11]: arr1
Out[11]:
array([([b'a', b'a', b'a'], [b'b', b'b', b'b', b'b']),
       ([b'c', b'c', b'c'], [b'd', b'd', b'd', b'd'])],
      dtype=[('f0', 'S1', (3,)), ('f1', 'S1', (4,))])

In [12]: arr2 = np.array(data, dtype='S3,S4')

In [13]: arr2
Out[13]:
array([(b'a', b'bb'), (b'ccc', b'dddd')],
      dtype=[('f0', 'S3'), ('f1', 'S4')])

In [14]: arr3 = arr1.view(arr2.dtype)

In [15]: arr3
Out[15]:
array([(b'aaa', b'bbbb'), (b'ccc', b'dddd')],
      dtype=[('f0', 'S3'), ('f1', 'S4')])

@shoyer
Copy link
Contributor

shoyer commented Mar 10, 2018

OK, numpy is doing something horrible with arr1 -- those element should not be repeated like that.

But hopefully you get the idea that view() can be used to convert between data types.

@jswhit
Copy link
Collaborator

jswhit commented Mar 11, 2018

Yes, I think views are the simplest and safest way to do this. You just have to create two numpy datatypes, one with the string components represented as character arrays, and one with numpy string arrays. Just use the first (dtype1) to create the netcdf variable, and the second (dtype2) to create your data, and write the data to the variable using v[:] = data[:].view(dtype1). To read the data use data[:] = v[:].view(dtype2).

@jswhit
Copy link
Collaborator

jswhit commented Mar 11, 2018

An example

from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype1 = np.dtype([('observation', 'f4'),
                    ('station_name','S1',80)])
dtype2 = np.dtype([('observation', 'f4'),
                    ('station_name','S80')])
station_data_t = f.createCompoundType(dtype1,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
data = np.empty(2,dtype2)
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data.view(dtype1)
print statdat[:]
print
print statdat[:].view(dtype2)
f.close()

[(123.  , ['B', 'o', 'u', 'l', 'd', 'e', 'r', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])
 (  3.14, ['N', 'e', 'w', ' ', 'Y', 'o', 'r', 'k', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])]

[(123.  , 'Boulder') (  3.14, 'New York')]

ncdump compound_example.nc

netcdf compound_example {
types:
  compound station_data {
    float observation ;
    char station_name(80) ;
  }; // station_data
dimensions:
	station = UNLIMITED ; // (2 currently)
variables:
	station_data station_obs(station) ;
data:

 station_obs = {123, {"Boulder"}}, {3.14, {"New York"}} ;

@jswhit
Copy link
Collaborator

jswhit commented Mar 12, 2018

Pull request #778 enables this automatically, so now this works

from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype = np.dtype([('observation', 'f4'),
                  ('station_name','S1',80)])
station_data_t = f.createCompoundType(dtype,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
statdat._Encoding = 'ascii'
data = np.empty(2,station_data_t.dtype_view)
print data.dtype
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data
print statdat.dtype
print statdat[:].dtype
print statdat[:]
f.close()

{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4',('S1', (80,))], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
[(123.  , 'Boulder') (  3.14, 'New York')]
~

@jswhit
Copy link
Collaborator

jswhit commented Mar 12, 2018

I'm a little nervous about adding this extra magic. The pros are:

  1. it's only done if the _Encoding variable attribute is set, so the behavior is similar to what happens with character array variables except that a view is returned instead of a copy.
  2. simplifies user code a bit, since the user almost always wants numpy strings and not character arrays. If the user really wants the character arrays, they can just not set _Encoding.

Cons:

  1. may be confusing to get numpy data back that is not the same type as the netcdf variable.
  2. may break existing code (doubtful, since probably no one is setting _Encoding on compound types).

@jswhit
Copy link
Collaborator

jswhit commented Mar 13, 2018

@jacklovell and @shoyer, I really would like your feedback on this proposed change.

@shoyer
Copy link
Contributor

shoyer commented Mar 13, 2018

Remind me -- does the _Encoding attribute get set automatically?

I do think this is probably a win for usability. Most users want NumPy strings, not arrays of characters.

@jswhit
Copy link
Collaborator

jswhit commented Mar 13, 2018

Pull request was updated so that if you specify 'S#' in a structured dtype when creating a netcdf compound type it automatically gets converted to ('S1',#). So now in the above example

dtype = np.dtype([('observation', 'f4'),
                  ('station_name','S1',80)])
station_data_t = f.createCompoundType(dtype,'station_data')

can be changed to

dtype = np.dtype([('observation', 'f4'),
                 ('station_name','S80')])
station_data_t = f.createCompoundType(dtype,'station_data')

_Encoding does not get set automatically to preserve backward compatibility. To get the new behavior, you have to explicitly set it.

@jacklovell
Copy link
Author

@jswhit This looks good. Under-the-hood conversion is better in my opinion, since it removes the need for boilerplate in users' codes (particularly having to create 2 similar-but-not-quite-identical dtypes). Being able to pass in a Numpy dtype with strings and read back a Numpy dtype with strings, without having to manually convert 'S#' to ('S1', #), does make this more user friendly, I think.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

Should we require the use of the _Encoding attribute to trigger the conversion (as we do for netcdf character arrays), or just make it the default for compound types?

@jacklovell
Copy link
Author

Well, if _Encoding is not present we will get ValueError: Unsupported compound type element, won't we? This is the problem I initially encountered when I created this issue. So I'd suggest making it the default for compound types, if there isn't any way of doing it at the moment without the conversion to character arrays.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

Right now the conversion is done if the _Encoding attribute is set - we could just remove that check and always do it. It can still be disabled using set_auto_chartostring(False).

@jacklovell
Copy link
Author

I would suggest that the check should be removed. That way, the default usage (i.e. no _Encoding set by the user) would just work. It would be less user friendly to raise an Exception unless _Encoding was set, I think, as there doesn't seem to be any benefit to not setting _Encoding.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

The only problem would be that if there is user code out there that is expecting numpy structured arrays with character array subtypes to be returned from the netcdf file, they will all of a sudden get structured arrays with strings back (unless they set set_auto_chartostring(False)).

@shoyer
Copy link
Contributor

shoyer commented Mar 14, 2018

The work-around here is easy enough that it's probably worth adding this to set_auto_chartostring().

Note that xarray does explicitly disable set_auto_chartostring(False) so we'll need a parallel fix there.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

Note quite clear on what you mean by 'adding this to set_auto_chartostring' - do you mean don't check for _Encoding at all for compound types and rely on the flag set by set_auto_chartostring to control the behavior? BTW - set_auto_charstring(True) is the library default...

@shoyer
Copy link
Contributor

shoyer commented Mar 14, 2018

Note quite clear on what you mean by 'adding this to set_auto_chartostring' - do you mean don't check for _Encoding at all for compound types and rely on the flag set by set_auto_chartostring to control the behavior?

Yes, that's what I meant. If only start writing _Encoding for structured arrays now then I suspect the backwards incompatibility impact will be minimal.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

@dopplershift, if you have time to read over this issue I'd appreciate your input.

@jswhit
Copy link
Collaborator

jswhit commented Mar 14, 2018

I updated the pull request to remove the check for _Encoding for compound types, so the conversion is always done unless it's turned off with set_auto_chartostring. Also added a section to the docs on dealing with strings.

@dopplershift
Copy link
Member

I think the changes seem reasonable--especially since you can turn it off with set_auto_chartostring. I have no idea how much use compound types actually see in the wild, though.

@jswhit
Copy link
Collaborator

jswhit commented Mar 15, 2018

OK, I'm going to go ahead an merge now.

jswhit added a commit that referenced this issue Mar 15, 2018
return views with numpy strings in compound types (issue #773)
@jacklovell
Copy link
Author

Thanks. Just tested with the up-to-date master branch, and it works very nicely. I'll close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants