Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON file cannot be written as HDF5 due to axis metadata #709

Closed
jairideout opened this issue Sep 23, 2016 · 7 comments
Closed

JSON file cannot be written as HDF5 due to axis metadata #709

jairideout opened this issue Sep 23, 2016 · 7 comments
Labels

Comments

@jairideout
Copy link
Member

I have a valid JSON .biom file that cannot be written in HDF5 format due to its axis metadata. I've attached a file to reproduce the issue. I had to name it with a .txt extension so I could upload it to GitHub. AFAIK, this JSON .biom file was generated with QIIME 1 (it's used as test data in q2-types). It has observation metadata containing taxonomy assignments, where each taxonomic assignment is stored as a single string.

Tested with biom-format 2.1.5.

In [1]: import biom

In [2]: import json

In [3]: import h5py

In [4]: with open('feature-table.biom', 'r') as fh:
   ...:     t = biom.Table.from_json(json.load(fh))
   ...:

In [5]: t
Out[5]: 658 x 104 <class 'biom.table.Table'> with 5670 nonzero entries (8% dense)

In [6]: t.metadata(axis='observation')[:5]
Out[6]:
(defaultdict(<function biom.table.Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda>>,
             {'taxonomy': 'k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Xanthomonadales; f__Xanthomonadaceae; g__Stenotrophomonas'}),
 defaultdict(<function biom.table.Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda>>,
             {'taxonomy': 'k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__Dialister; s__'}),
 defaultdict(<function biom.table.Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda>>,
             {'taxonomy': 'k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__; g__; s__'}),
 defaultdict(<function biom.table.Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda>>,
             {'taxonomy': 'k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae'}),
 defaultdict(<function biom.table.Table._cast_metadata.<locals>.cast_metadata.<locals>.<lambda>>,
             {'taxonomy': 'k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Tepidimonas; s__'}))

In [7]: with h5py.File('hdf5.biom', mode='w') as fh:
   ...:     t.to_hdf5(fh, generated_by='me')
   ...:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-1b33a03b776d> in <module>()
      1 with h5py.File('hdf5.biom', mode='w') as fh:
----> 2     t.to_hdf5(fh, generated_by='me')
      3

/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/biom/table.py in to_hdf5(self, h5grp, generated_by, compress, format_fs)
   3603                   self.ids(axis='observation'),
   3604                   self.metadata(axis='observation'),
-> 3605                   self.group_metadata(axis='observation'), 'csr', compression)
   3606         axis_dump(h5grp.create_group('sample'), self.ids(),
   3607                   self.metadata(), self.group_metadata(), 'csc', compression)

/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/biom/table.py in axis_dump(grp, ids, md, group_md, order, compression)
   3575                     # Create the dataset for the current category,
   3576                     # putting values in id order
-> 3577                     formatter[category](grp, category, md, compression)
   3578
   3579             # Create the group for the group metadata

/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/biom/table.py in vlen_list_of_str_formatter(grp, header, md, compression)
    275             continue
    276         value = np.asarray(m[header])
--> 277         data[i, :len(value)] = [v.encode('utf8') for v in value]
    278     # Change the None entries on data to empty strings ""
    279     data = np.where(data == np.array(None), "", data)

TypeError: iteration over a 0-d array

In [8]:
@jairideout jairideout added the bug label Sep 23, 2016
@jairideout
Copy link
Member Author

I also verified all observations have a "taxonomy" key in their metadata, and that each "taxonomy" value is a string. If taxonomy metadata was missing for an observation, the metadata defaultdict would have returned None, which is not the case here.

@wasade
Copy link
Member

wasade commented Sep 26, 2016

Issue is that "taxonomy" is a special keyword for formatting and parsing with BIOM 2.1 and it expects the values associated with the key to be list of str. A custom formatter can be passed which can override the default handling.

The table generated_by string is null implying the table was not written by QIIME1 as its writer has a defined generated by string. Since the generated_by string is null, I'm assuming the table was generated manually, so I don't believe the scope of impact to be wide.

I'm removing the bug label as this isn't a bug in that the data are represented outside of the documented expectations. However, I'm marking this as a proposed addition, specifically: catch the case where taxonomy comes in as a str and to handle it gracefully.

In [18]: t = biom.load_table('feature-table.biom')

In [19]: for md in t.metadata(axis='observation'):
    ...:     md['taxonomy'] = md['taxonomy'].split('; ')
    ...:

In [20]: with h5py.File('asd.foo', 'w') as fp:
    ...:     t.to_hdf5(fp, 'asd')
    ...:

@wasade
Copy link
Member

wasade commented Sep 26, 2016

Relabeling as a bug. It turns out that vlen_list_of_str_formatter was characterizing a str as valid as it is Iterable resulting the obscure error message in the traceback as opposed to a more graceful handling.

@wasade wasade mentioned this issue Sep 26, 2016
@jairideout
Copy link
Member Author

Thanks for explaining, I didn't know the "taxonomy" key was a special case in 2.1. Maybe worth documenting in the 2.1 format spec?

@wasade
Copy link
Member

wasade commented Sep 26, 2016

It's not special to the spec but to this implementation

@jairideout
Copy link
Member Author

OK.

@wasade
Copy link
Member

wasade commented Sep 26, 2016

Thanks for quick review!

On Sep 26, 2016 10:33 AM, "Jai Ram Rideout" notifications@github.com
wrote:

OK.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#709 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAc8srtCp5pQTXYbJ35NvX9a2enjwij8ks5quAHLgaJpZM4KFe6i
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants