Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

Closed
fedarko opened this issue Feb 21, 2019 · 2 comments · Fixed by #245
Closed

Accept metabolite (GNPS) feature metadata, and validate this in tests #49

fedarko opened this issue Feb 21, 2019 · 2 comments · Fixed by #245
Assignees
Labels
metabolomics ensuring that metabolite data is supported and useful

Comments

@fedarko
Copy link
Collaborator

fedarko commented Feb 21, 2019

At least with the Red Sea metabolite data used in the QIIME 2 tutorial on Songbird, this seems to just be a TSV file where each line contains a metabolite (feature) ID, a "cluster," and a description. I don't know if the "cluster" is important (maybe?) but, in any case, making this work should be as simple as adjusting the code in rankratioviz.generate.process_input() that reads taxam (feature metadata) to create labels for each taxon/metabolite (and using the correct column names for the metabolites). It might be easiest to just have the user pass in a --metabolite-data flag or something, so we know how to read the feature metadata file properly [1]. Update: yeah, these features are KEGG orthologs, not metabolites. My bad for not reading that paper in detail. But it's still cool that this supports those types of features!

So what we actually want to do for legit metabolomics feature metadata, from what I can tell, involves handling the feature ID used in the BIOM table/ranks input as a combo of the mass-to-charge-ratio and discharge time, and extracting the Library ID from a feature metadata file in which both parts of the feature ID are their own columns.

To validate this, we should at least

  1. include metabolite data in the test suite (Add tests/validation #2) and check that labels are being properly created, and
  2. check this against lots of metabolite data to ensure that we're parsing things according to whatever standards for metabolite data file formats/metabolite feature IDs/etc. exist, and that this tool is robust enough to be useful in all of these contexts.
@fedarko fedarko self-assigned this Feb 21, 2019
@fedarko fedarko added the metabolomics ensuring that metabolite data is supported and useful label Feb 21, 2019
fedarko added a commit that referenced this issue Feb 27, 2019
There's a lot to describe here. Let me try to go through it.

rankratioviz/generate.py
-Added more documentation to matchdf().
-Added various assertions that should prevent samples and features
 from being dropped (#54). I think that our use of matchdf(), in
 addition to these assertions, should mean that #54 is done -- but
 I'd like to add some tests that verify that.
-To support these assertions, sample metadata is now a parameter of
 process_input().
-Completely rewrote the feature metadata label creation code. It
 should generalize now to any sort of feature metadata, at the cost
 of not being as pretty as the old taxonomy label code. I might add
 that back in the future, and/or add some new label creation code
 to handle metabolite data more carefully (e.g. assuming that the
 feature description includes the feature ID) (#49).
        -The reason I rewrote this was in order to prevent features
         from being dropped just because they lacked feature metadata
         (this was the case until now) (#54). This is testable on the
         Red Sea dataset songbird uses, for which 172 (iirc) features
         have no associated metadata (those features are just assigned
         their ID by itself now, instead of being completely dropped).
-Didn't call .T on the table twice at the start of process_input (I
 don't know why this was originally done, but it doesn't do anything.
 and our not doing it saves a tiny bit of time).
-I think the changes to the feature metadata label creation code also
 may have made the table new column assignments more accurate -- it
 seems like the old code assumed that the table columns had the same
 order as the feature meta "Taxon_" values -- but I'm not 100% sure
 of that. In any case, it's correct now. (We'll eventually double-check
 the correctness of all the count data in the sample JSON data when #2
 rolls around, so this will be doubly ensured to be correct.)
-Added a small sanity test to process_input() to ensure that adding
 feature metadata didn't make some feature IDs the same (should never
 happen but you never know).
-To fix a bug with the red sea data, made the default rank column
 explicitly a numeric dtype (see #62 for a retrospective and details).
 More work on this sorta front remains to be done.
-Some small aesthetic tweaks to the interface: "Ranks" -> "Feature
 Ranks" for the rank plot title, and adding back "x" to the feature
 tooltips.

rankratioviz/q2/_method.py
-Added logic to accept either a DataFrame or OrdinationResults as
 input, which promptly didn't work (I got the error discussed here:
 https://forum.qiime2.org/t/plugin-error-issubclass-arg-1-must-be-a-class/2774)
 Changing the ranks type from a Union[DF, OrdResults] to just one
 of the two works: so the problem is with that. I'll try to get it
 fixed soon.
-Added logic to process these files and generate a feature_ranks
 dataframe. nothing too fancy.
-adjusted code a tiny bit to throw in df_sample_metadata to
 process_input(), as discussed above.

rankratioviz/q2/plugin_setup.py
-Added a substantial amount of code to check if songbird is available
 and, if so, accepts the FeatureData[Differential] input. I'd like for
 this to not be needed, but apparently it is (see
 https://github.com/mortonjt/songbird/pull/34). If we have to do this,
 I can add songbird as an "extras" dependency in setup.py. So that's
 another thing to do for #51.
-Also added some logic to update the --i-ranks description depending
 on whether or not songbird input is acceptable.
-This logic isn't super testable, since my conda/Q2 installation has a
 habit of keeping references to uninstalled plugins (even when I
 refresh the cache) -- so testing this is kind of frustrating.
 I'll think about it as I move past getting the basic functionality
 of #51 out of the way.

rankratioviz/scripts/_plot.py
-Some changes to work with the _rank_processing module.
-Renamed "taxonomy" to "feature_metadata" (makes the code a lot
 clearer).
-Adjustments to pass in sample metadata to generate.process_input().
-I ended up just treating the first column of the feature metadata
 file as the index by default, instead of searching for the 'feature
 id' column and using that -- this should be functionally equivalent
 for the old test data, and far more robust. also it follows QIIME 2's
 metadata guidelines.

OK! I uh think that is everything for now. sorry this message ended up
being so long. Probably should've done more atomic commits, but
everything sort of ended up relying on everything else and it got to the
point where any commit of individual files would've just broken the
tests. GRANTED: I think I already did that tonight, but whatever, you
get the point.
@fedarko fedarko changed the title Accept metabolite feature metadata Accept metabolite feature metadata, and validate this in tests Apr 25, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented Apr 25, 2019

Currently, rankratioviz "annotates" feature IDs by adding on all feature metadata fields to those IDs where feature metadata has been given, separated by | (pipe characters). The remaining TODO here is altering the python test suite to check that this is being done correctly (in the rank plot JSON, and in the sample plot JSON w/r/t feature column IDs and corresponding count info).

Some feature metadata annotation is sorta being done in the main matching integration test. See about moving that into the main testing_utilities so it's done on every integration test by default.

(This is also a good time to beef up the sample plot JSON validation.)

Once that's done, we will be able to close this.

fedarko added a commit that referenced this issue Apr 25, 2019
Need to validate that feature metadata annotation works properly,
then #49 will be taken care of.
@fedarko fedarko changed the title Accept metabolite feature metadata, and validate this in tests Accept metabolite (GNPS) feature metadata, and validate this in tests May 11, 2019
fedarko added a commit that referenced this issue May 11, 2019
Progress towards #49.

Might make this standalone-rrv-exclusive for now?
@fedarko
Copy link
Collaborator Author

fedarko commented May 13, 2019

GNPS support is tentatively ready. Once more tests are in that verify that this works, I'm content to close this issue. (May be worth adding another issue for Q2 GNPS feature metadata support, but that's another story.)

fedarko added a commit that referenced this issue Jul 8, 2019
obviously gnps f. metadata support is still mostly
untested/experimental, but this is a decent start
fedarko added a commit to fedarko/qurro that referenced this issue Sep 6, 2019
You can still use GNPS data with Qurro! Now you'll just have to supply
"normal" data.

This removes all of that experimental code, which is nice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metabolomics ensuring that metabolite data is supported and useful
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant