double-check data/dataframe types in code #62

fedarko · 2019-02-27T06:48:05Z

apparently the dataframe has been storing all of the ranks as objects this whole time, but a problem with this was that objects are apparently sorted in lexicographic (alphabetical) order. So you'd see this with the red sea ranks (at least until I fixed the problem)...

I'm pretty sure this isn't a problem anywhere else in the application -- the data is converted to numeric types when it's needed -- but I'd like to be sure of that. This issue involves adding tests to validate the types of the various DataFrames used, and potentially adding some assertions as well.

The text was updated successfully, but these errors were encountered:

There's a lot to describe here. Let me try to go through it. rankratioviz/generate.py -Added more documentation to matchdf(). -Added various assertions that should prevent samples and features from being dropped (#54). I think that our use of matchdf(), in addition to these assertions, should mean that #54 is done -- but I'd like to add some tests that verify that. -To support these assertions, sample metadata is now a parameter of process_input(). -Completely rewrote the feature metadata label creation code. It should generalize now to any sort of feature metadata, at the cost of not being as pretty as the old taxonomy label code. I might add that back in the future, and/or add some new label creation code to handle metabolite data more carefully (e.g. assuming that the feature description includes the feature ID) (#49). -The reason I rewrote this was in order to prevent features from being dropped just because they lacked feature metadata (this was the case until now) (#54). This is testable on the Red Sea dataset songbird uses, for which 172 (iirc) features have no associated metadata (those features are just assigned their ID by itself now, instead of being completely dropped). -Didn't call .T on the table twice at the start of process_input (I don't know why this was originally done, but it doesn't do anything. and our not doing it saves a tiny bit of time). -I think the changes to the feature metadata label creation code also may have made the table new column assignments more accurate -- it seems like the old code assumed that the table columns had the same order as the feature meta "Taxon_" values -- but I'm not 100% sure of that. In any case, it's correct now. (We'll eventually double-check the correctness of all the count data in the sample JSON data when #2 rolls around, so this will be doubly ensured to be correct.) -Added a small sanity test to process_input() to ensure that adding feature metadata didn't make some feature IDs the same (should never happen but you never know). -To fix a bug with the red sea data, made the default rank column explicitly a numeric dtype (see #62 for a retrospective and details). More work on this sorta front remains to be done. -Some small aesthetic tweaks to the interface: "Ranks" -> "Feature Ranks" for the rank plot title, and adding back "x" to the feature tooltips. rankratioviz/q2/_method.py -Added logic to accept either a DataFrame or OrdinationResults as input, which promptly didn't work (I got the error discussed here: https://forum.qiime2.org/t/plugin-error-issubclass-arg-1-must-be-a-class/2774) Changing the ranks type from a Union[DF, OrdResults] to just one of the two works: so the problem is with that. I'll try to get it fixed soon. -Added logic to process these files and generate a feature_ranks dataframe. nothing too fancy. -adjusted code a tiny bit to throw in df_sample_metadata to process_input(), as discussed above. rankratioviz/q2/plugin_setup.py -Added a substantial amount of code to check if songbird is available and, if so, accepts the FeatureData[Differential] input. I'd like for this to not be needed, but apparently it is (see https://github.com/mortonjt/songbird/pull/34). If we have to do this, I can add songbird as an "extras" dependency in setup.py. So that's another thing to do for #51. -Also added some logic to update the --i-ranks description depending on whether or not songbird input is acceptable. -This logic isn't super testable, since my conda/Q2 installation has a habit of keeping references to uninstalled plugins (even when I refresh the cache) -- so testing this is kind of frustrating. I'll think about it as I move past getting the basic functionality of #51 out of the way. rankratioviz/scripts/_plot.py -Some changes to work with the _rank_processing module. -Renamed "taxonomy" to "feature_metadata" (makes the code a lot clearer). -Adjustments to pass in sample metadata to generate.process_input(). -I ended up just treating the first column of the feature metadata file as the index by default, instead of searching for the 'feature id' column and using that -- this should be functionally equivalent for the old test data, and far more robust. also it follows QIIME 2's metadata guidelines. OK! I uh think that is everything for now. sorry this message ended up being so long. Probably should've done more atomic commits, but everything sort of ended up relying on everything else and it got to the point where any commit of individual files would've just broken the tests. GRANTED: I think I already did that tonight, but whatever, you get the point.

Something silly -- since I was just fetching the rank values directly from the JSON (without Vega in the middle), these values were getting treated as strings (so the comparison function I defined when sorting them didn't work). This sort of thing has happened before, so fixing this was a lot easier than then. This emphasizes the importance of following through with #62. The less there is to worry about, the better.

fedarko · 2019-05-03T02:45:28Z

This should be addressed with boolean values. pandas.read_csv(), json.loads(), json.dumps(), the QIIME 2 Metadata object, and Altair handle or don't handle them in a certain way, and we need to be able to ensure they're treated in a consistent way.

IMO: the best way to do this is to just treat them as strings of True or False. Representing them as bools in python will mean they'll be represented as bools in JS, which will mean the user will see true or false in the visualization (which IMO is ugly and counterintuitive).

Currently, what I've noticed is that pandas.read_csv() will convert True/False values to the type numpy.bool_, which threw a wrench in the in-progress sample plot JSON validation code.

See comments for justification. This came up while I was testing stuff. Related to #62. The easiest solution would probably be using qiime2.Metadata.load(), but I don't want to introduce a heavy dependency on QIIME 2 for people using rankratioviz in isolation.

Should validate that at least the boolean nonsense in #62 is being dealt with properly. Also: as stuff comes up re: #62, read_metadata_file() can be adjusted accordingly (and further test cases can be added here). The recent changes to the codebase should make responding to these sorts of problems easier.

fedarko · 2019-05-22T01:21:11Z

One thing worth looking in to: is it ok for sample/feature metadata fields to be named with just numbers? IIRC this was a problem with feature rank IDs for altair. Not sure if it'd be a problem in the future.

fedarko · 2019-05-22T01:35:40Z

also look at if null in metadata is counted as None by pandas, or if it's just the string null. (B/C it should be counted as a string null—otherwise our solution to #132 will cause some weird corner case problems.)

This constitutes most of the python progress on #132 (and by extension also #125). Next up for #132: -Add tooltip defs in the rank plot for all feature metadata cols. I guess we could do this in the python side of things, in the rank_chart definition. -create a list of non-ranking feature data columns (all feature data fields, minus ranking columns, minus Feature ID and Classification) to populate two <select>s with all avail. feature metadata columns. -Make the JS searching functionality look by selected feature metadata field (one select for num search, one select for den search). -Ignore features that have "null" for a given col (i.e. no row in the feature md file). This might cause some confusion if any metadata actually has "null" as a given string, but I *think* this should be ok. (Should add test cases; tagging #2 and #62.) -Support multiple searches (e.g. with multiple taxa). -MAYBE support searching by taxonomic rank as current? -IDK. -Also maybe more specialized searches by field type (e.g. if the field is numeric, limit to ranges). -Probably goes beyond the scope of #132 tho.

fedarko · 2019-05-25T03:42:06Z

This, #116, and #101 (and maybe more) should be solvable if we treat all values as objects/strings to start off with when using pd.read_csv(), and then go through columns to figure out what we can convert to numeric types. See #101 (comment) for what I think is a decent solution.

fedarko added the optimization Making code faster or cleaner label Feb 27, 2019

fedarko self-assigned this Feb 27, 2019

fedarko mentioned this issue Jun 24, 2019

Display info/stats about removed samples; much more detailed testing #166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

double-check data/dataframe types in code #62

double-check data/dataframe types in code #62

fedarko commented Feb 27, 2019

fedarko commented May 3, 2019

fedarko commented May 22, 2019

fedarko commented May 22, 2019 •

edited

Loading

fedarko commented May 25, 2019

double-check data/dataframe types in code #62

double-check data/dataframe types in code #62

Comments

fedarko commented Feb 27, 2019

fedarko commented May 3, 2019

fedarko commented May 22, 2019

fedarko commented May 22, 2019 • edited Loading

fedarko commented May 25, 2019

fedarko commented May 22, 2019 •

edited

Loading