Accept metabolite (GNPS) feature metadata, and validate this in tests #49

fedarko · 2019-02-21T03:45:43Z

At least with the Red Sea metabolite data used in the QIIME 2 tutorial on Songbird, this seems to just be a TSV file where each line contains a metabolite (feature) ID, a "cluster," and a description. I don't know if the "cluster" is important (maybe?) but, in any case, making this work should be as simple as adjusting the code in rankratioviz.generate.process_input() that reads taxam (feature metadata) to create labels for each taxon/metabolite (and using the correct column names for the metabolites). It might be easiest to just have the user pass in a --metabolite-data flag or something, so we know how to read the feature metadata file properly [1]. Update: yeah, these features are KEGG orthologs, not metabolites. My bad for not reading that paper in detail. But it's still cool that this supports those types of features!

So what we actually want to do for legit metabolomics feature metadata, from what I can tell, involves handling the feature ID used in the BIOM table/ranks input as a combo of the mass-to-charge-ratio and discharge time, and extracting the Library ID from a feature metadata file in which both parts of the feature ID are their own columns.

To validate this, we should at least

include metabolite data in the test suite (Add tests/validation #2) and check that labels are being properly created, and
check this against lots of metabolite data to ensure that we're parsing things according to whatever standards for metabolite data file formats/metabolite feature IDs/etc. exist, and that this tool is robust enough to be useful in all of these contexts.

The text was updated successfully, but these errors were encountered:

There's a lot to describe here. Let me try to go through it. rankratioviz/generate.py -Added more documentation to matchdf(). -Added various assertions that should prevent samples and features from being dropped (#54). I think that our use of matchdf(), in addition to these assertions, should mean that #54 is done -- but I'd like to add some tests that verify that. -To support these assertions, sample metadata is now a parameter of process_input(). -Completely rewrote the feature metadata label creation code. It should generalize now to any sort of feature metadata, at the cost of not being as pretty as the old taxonomy label code. I might add that back in the future, and/or add some new label creation code to handle metabolite data more carefully (e.g. assuming that the feature description includes the feature ID) (#49). -The reason I rewrote this was in order to prevent features from being dropped just because they lacked feature metadata (this was the case until now) (#54). This is testable on the Red Sea dataset songbird uses, for which 172 (iirc) features have no associated metadata (those features are just assigned their ID by itself now, instead of being completely dropped). -Didn't call .T on the table twice at the start of process_input (I don't know why this was originally done, but it doesn't do anything. and our not doing it saves a tiny bit of time). -I think the changes to the feature metadata label creation code also may have made the table new column assignments more accurate -- it seems like the old code assumed that the table columns had the same order as the feature meta "Taxon_" values -- but I'm not 100% sure of that. In any case, it's correct now. (We'll eventually double-check the correctness of all the count data in the sample JSON data when #2 rolls around, so this will be doubly ensured to be correct.) -Added a small sanity test to process_input() to ensure that adding feature metadata didn't make some feature IDs the same (should never happen but you never know). -To fix a bug with the red sea data, made the default rank column explicitly a numeric dtype (see #62 for a retrospective and details). More work on this sorta front remains to be done. -Some small aesthetic tweaks to the interface: "Ranks" -> "Feature Ranks" for the rank plot title, and adding back "x" to the feature tooltips. rankratioviz/q2/_method.py -Added logic to accept either a DataFrame or OrdinationResults as input, which promptly didn't work (I got the error discussed here: https://forum.qiime2.org/t/plugin-error-issubclass-arg-1-must-be-a-class/2774) Changing the ranks type from a Union[DF, OrdResults] to just one of the two works: so the problem is with that. I'll try to get it fixed soon. -Added logic to process these files and generate a feature_ranks dataframe. nothing too fancy. -adjusted code a tiny bit to throw in df_sample_metadata to process_input(), as discussed above. rankratioviz/q2/plugin_setup.py -Added a substantial amount of code to check if songbird is available and, if so, accepts the FeatureData[Differential] input. I'd like for this to not be needed, but apparently it is (see https://github.com/mortonjt/songbird/pull/34). If we have to do this, I can add songbird as an "extras" dependency in setup.py. So that's another thing to do for #51. -Also added some logic to update the --i-ranks description depending on whether or not songbird input is acceptable. -This logic isn't super testable, since my conda/Q2 installation has a habit of keeping references to uninstalled plugins (even when I refresh the cache) -- so testing this is kind of frustrating. I'll think about it as I move past getting the basic functionality of #51 out of the way. rankratioviz/scripts/_plot.py -Some changes to work with the _rank_processing module. -Renamed "taxonomy" to "feature_metadata" (makes the code a lot clearer). -Adjustments to pass in sample metadata to generate.process_input(). -I ended up just treating the first column of the feature metadata file as the index by default, instead of searching for the 'feature id' column and using that -- this should be functionally equivalent for the old test data, and far more robust. also it follows QIIME 2's metadata guidelines. OK! I uh think that is everything for now. sorry this message ended up being so long. Probably should've done more atomic commits, but everything sort of ended up relying on everything else and it got to the point where any commit of individual files would've just broken the tests. GRANTED: I think I already did that tonight, but whatever, you get the point.

fedarko · 2019-04-25T03:24:43Z

Currently, rankratioviz "annotates" feature IDs by adding on all feature metadata fields to those IDs where feature metadata has been given, separated by | (pipe characters). The remaining TODO here is altering the python test suite to check that this is being done correctly (in the rank plot JSON, and in the sample plot JSON w/r/t feature column IDs and corresponding count info).

Some feature metadata annotation is sorta being done in the main matching integration test. See about moving that into the main testing_utilities so it's done on every integration test by default.

(This is also a good time to beef up the sample plot JSON validation.)

Once that's done, we will be able to close this.

Need to validate that feature metadata annotation works properly, then #49 will be taken care of.

Progress towards #49. Might make this standalone-rrv-exclusive for now?

fedarko · 2019-05-13T10:41:09Z

GNPS support is tentatively ready. Once more tests are in that verify that this works, I'm content to close this issue. (May be worth adding another issue for Q2 GNPS feature metadata support, but that's another story.)

obviously gnps f. metadata support is still mostly untested/experimental, but this is a decent start

You can still use GNPS data with Qurro! Now you'll just have to supply "normal" data. This removes all of that experimental code, which is nice.

fedarko self-assigned this Feb 21, 2019

fedarko added the metabolomics ensuring that metabolite data is supported and useful label Feb 21, 2019

fedarko changed the title ~~Accept metabolite feature metadata~~ Accept metabolite feature metadata, and validate this in tests Apr 25, 2019

fedarko mentioned this issue Apr 25, 2019

Full support for metabolite data #47

Closed

fedarko added a commit that referenced this issue Apr 25, 2019

TST: Include red sea metabolite data in py tests

8c8587b

Need to validate that feature metadata annotation works properly, then #49 will be taken care of.

fedarko changed the title ~~Accept metabolite feature metadata, and validate this in tests~~ Accept metabolite (GNPS) feature metadata, and validate this in tests May 11, 2019

fedarko added a commit that referenced this issue May 11, 2019

ENH: add basic func for reading GNPS feat metadata

2acfa66

Progress towards #49. Might make this standalone-rrv-exclusive for now?

fedarko added a commit that referenced this issue Jul 8, 2019

TST: Add get_truncated_feature_id tests #49

9bb7e22

obviously gnps f. metadata support is still mostly untested/experimental, but this is a decent start

fedarko added a commit to fedarko/qurro that referenced this issue Sep 6, 2019

MAINT: Remove -gnps option + code: close biocore#49

7853bea

You can still use GNPS data with Qurro! Now you'll just have to supply "normal" data. This removes all of that experimental code, which is nice.

fedarko mentioned this issue Nov 8, 2019

Add "autoselection" of top/bottom features; remove -gnps option; other small UI fixes #245

Merged

antgonza closed this as completed in #245 Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

fedarko commented Feb 21, 2019 •

edited

Loading

fedarko commented Apr 25, 2019

fedarko commented May 13, 2019

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

Comments

fedarko commented Feb 21, 2019 • edited Loading

fedarko commented Apr 25, 2019

fedarko commented May 13, 2019

fedarko commented Feb 21, 2019 •

edited

Loading