-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numpy string types in compound type #773
Comments
Problem is there is no fixed length string datatype in netcdf. The only workaround I know of is to create a character array ('S1') with a length of 30, and then use the stringtoarr and chartostring utilities to convert back and forth from strings to arrays of characters. There's an example of this for compound types at https://github.com/Unidata/netcdf4-python/blob/master/examples/tutorial.py. In your case, you would use dtype([('Object_type', 'S1',30), ('ID', 'S1',30), ('Version', '<i4'),
('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('width', '<f8'), ('height', '<f8'),
('slit_id', 'S1',30), ('slit_no', '<i4')]) I've asked for fixed length strings to be added to netcdf-c in the past, but the idea never gained any traction (Unidata/netcdf-c#132). If you really want this, please comment on that ticket. |
BTW - it is technically possible to support vlen strings inside compound data types. I just have devoted the time to implement it (yet). |
Thanks for this. Using the It would be nice if there was at least an option in netcdf4-python to perform this step internally. Xarray does this (http://xarray.pydata.org/en/stable/io.html#string-encoding) for its netcdf IO, although that doesn't seem to work for compound data types yet (pydata/xarray#1977). Enabling this at the netcdf4-python level would make it easier for other packages to support what would be a very useful feature. |
We already to this for character arrays if the |
One option is to use
|
OK, numpy is doing something horrible with But hopefully you get the idea that |
Yes, I think views are the simplest and safest way to do this. You just have to create two numpy datatypes, one with the string components represented as character arrays, and one with numpy string arrays. Just use the first (dtype1) to create the netcdf variable, and the second (dtype2) to create your data, and write the data to the variable using |
An example from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype1 = np.dtype([('observation', 'f4'),
('station_name','S1',80)])
dtype2 = np.dtype([('observation', 'f4'),
('station_name','S80')])
station_data_t = f.createCompoundType(dtype1,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
data = np.empty(2,dtype2)
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data.view(dtype1)
print statdat[:]
print
print statdat[:].view(dtype2)
f.close()
[(123. , ['B', 'o', 'u', 'l', 'd', 'e', 'r', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])
( 3.14, ['N', 'e', 'w', ' ', 'Y', 'o', 'r', 'k', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])]
[(123. , 'Boulder') ( 3.14, 'New York')]
ncdump compound_example.nc
netcdf compound_example {
types:
compound station_data {
float observation ;
char station_name(80) ;
}; // station_data
dimensions:
station = UNLIMITED ; // (2 currently)
variables:
station_data station_obs(station) ;
data:
station_obs = {123, {"Boulder"}}, {3.14, {"New York"}} ; |
Pull request #778 enables this automatically, so now this works from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype = np.dtype([('observation', 'f4'),
('station_name','S1',80)])
station_data_t = f.createCompoundType(dtype,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
statdat._Encoding = 'ascii'
data = np.empty(2,station_data_t.dtype_view)
print data.dtype
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data
print statdat.dtype
print statdat[:].dtype
print statdat[:]
f.close()
{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4',('S1', (80,))], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
[(123. , 'Boulder') ( 3.14, 'New York')]
~ |
I'm a little nervous about adding this extra magic. The pros are:
Cons:
|
@jacklovell and @shoyer, I really would like your feedback on this proposed change. |
Remind me -- does the I do think this is probably a win for usability. Most users want NumPy strings, not arrays of characters. |
Pull request was updated so that if you specify 'S#' in a structured dtype when creating a netcdf compound type it automatically gets converted to ('S1',#). So now in the above example
can be changed to
|
@jswhit This looks good. Under-the-hood conversion is better in my opinion, since it removes the need for boilerplate in users' codes (particularly having to create 2 similar-but-not-quite-identical dtypes). Being able to pass in a Numpy dtype with strings and read back a Numpy dtype with strings, without having to manually convert 'S#' to ('S1', #), does make this more user friendly, I think. |
Should we require the use of the |
Well, if |
Right now the conversion is done if the |
I would suggest that the check should be removed. That way, the default usage (i.e. no |
The only problem would be that if there is user code out there that is expecting numpy structured arrays with character array subtypes to be returned from the netcdf file, they will all of a sudden get structured arrays with strings back (unless they set |
The work-around here is easy enough that it's probably worth adding this to Note that xarray does explicitly disable |
Note quite clear on what you mean by 'adding this to set_auto_chartostring' - do you mean don't check for _Encoding at all for compound types and rely on the flag set by set_auto_chartostring to control the behavior? BTW - set_auto_charstring(True) is the library default... |
Yes, that's what I meant. If only start writing |
@dopplershift, if you have time to read over this issue I'd appreciate your input. |
I updated the pull request to remove the check for |
I think the changes seem reasonable--especially since you can turn it off with |
OK, I'm going to go ahead an merge now. |
return views with numpy strings in compound types (issue #773)
Thanks. Just tested with the up-to-date master branch, and it works very nicely. I'll close this issue now. |
I have the following Numpy dtype for an array:
When I try to create a NetCDF compound type from this dtype, I get the following error:
I understand that it is not possible to create compound types which include variable-length arrays, but the Numpy '<S30' type should always be 30 bytes, not a variable length. This is actually why I'm using S30 and not U30, since the latter may have multi-byte characters.
Is it possible to add support for the S30 dtype in netcdf4-python? Otherwise, is there another way that I can include fixed length string types in a NetCDF compound type?
The text was updated successfully, but these errors were encountered: