Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double-check data/dataframe types in code #62

Open
fedarko opened this issue Feb 27, 2019 · 4 comments
Open

double-check data/dataframe types in code #62

fedarko opened this issue Feb 27, 2019 · 4 comments
Assignees
Labels
optimization Making code faster or cleaner

Comments

@fedarko
Copy link
Collaborator

fedarko commented Feb 27, 2019

apparently the dataframe has been storing all of the ranks as objects this whole time, but a problem with this was that objects are apparently sorted in lexicographic (alphabetical) order. So you'd see this with the red sea ranks (at least until I fixed the problem)...
legitimate_rank_plot

I'm pretty sure this isn't a problem anywhere else in the application -- the data is converted to numeric types when it's needed -- but I'd like to be sure of that. This issue involves adding tests to validate the types of the various DataFrames used, and potentially adding some assertions as well.

@fedarko fedarko added the optimization Making code faster or cleaner label Feb 27, 2019
@fedarko fedarko self-assigned this Feb 27, 2019
fedarko added a commit that referenced this issue Feb 27, 2019
There's a lot to describe here. Let me try to go through it.

rankratioviz/generate.py
-Added more documentation to matchdf().
-Added various assertions that should prevent samples and features
 from being dropped (#54). I think that our use of matchdf(), in
 addition to these assertions, should mean that #54 is done -- but
 I'd like to add some tests that verify that.
-To support these assertions, sample metadata is now a parameter of
 process_input().
-Completely rewrote the feature metadata label creation code. It
 should generalize now to any sort of feature metadata, at the cost
 of not being as pretty as the old taxonomy label code. I might add
 that back in the future, and/or add some new label creation code
 to handle metabolite data more carefully (e.g. assuming that the
 feature description includes the feature ID) (#49).
        -The reason I rewrote this was in order to prevent features
         from being dropped just because they lacked feature metadata
         (this was the case until now) (#54). This is testable on the
         Red Sea dataset songbird uses, for which 172 (iirc) features
         have no associated metadata (those features are just assigned
         their ID by itself now, instead of being completely dropped).
-Didn't call .T on the table twice at the start of process_input (I
 don't know why this was originally done, but it doesn't do anything.
 and our not doing it saves a tiny bit of time).
-I think the changes to the feature metadata label creation code also
 may have made the table new column assignments more accurate -- it
 seems like the old code assumed that the table columns had the same
 order as the feature meta "Taxon_" values -- but I'm not 100% sure
 of that. In any case, it's correct now. (We'll eventually double-check
 the correctness of all the count data in the sample JSON data when #2
 rolls around, so this will be doubly ensured to be correct.)
-Added a small sanity test to process_input() to ensure that adding
 feature metadata didn't make some feature IDs the same (should never
 happen but you never know).
-To fix a bug with the red sea data, made the default rank column
 explicitly a numeric dtype (see #62 for a retrospective and details).
 More work on this sorta front remains to be done.
-Some small aesthetic tweaks to the interface: "Ranks" -> "Feature
 Ranks" for the rank plot title, and adding back "x" to the feature
 tooltips.

rankratioviz/q2/_method.py
-Added logic to accept either a DataFrame or OrdinationResults as
 input, which promptly didn't work (I got the error discussed here:
 https://forum.qiime2.org/t/plugin-error-issubclass-arg-1-must-be-a-class/2774)
 Changing the ranks type from a Union[DF, OrdResults] to just one
 of the two works: so the problem is with that. I'll try to get it
 fixed soon.
-Added logic to process these files and generate a feature_ranks
 dataframe. nothing too fancy.
-adjusted code a tiny bit to throw in df_sample_metadata to
 process_input(), as discussed above.

rankratioviz/q2/plugin_setup.py
-Added a substantial amount of code to check if songbird is available
 and, if so, accepts the FeatureData[Differential] input. I'd like for
 this to not be needed, but apparently it is (see
 https://github.com/mortonjt/songbird/pull/34). If we have to do this,
 I can add songbird as an "extras" dependency in setup.py. So that's
 another thing to do for #51.
-Also added some logic to update the --i-ranks description depending
 on whether or not songbird input is acceptable.
-This logic isn't super testable, since my conda/Q2 installation has a
 habit of keeping references to uninstalled plugins (even when I
 refresh the cache) -- so testing this is kind of frustrating.
 I'll think about it as I move past getting the basic functionality
 of #51 out of the way.

rankratioviz/scripts/_plot.py
-Some changes to work with the _rank_processing module.
-Renamed "taxonomy" to "feature_metadata" (makes the code a lot
 clearer).
-Adjustments to pass in sample metadata to generate.process_input().
-I ended up just treating the first column of the feature metadata
 file as the index by default, instead of searching for the 'feature
 id' column and using that -- this should be functionally equivalent
 for the old test data, and far more robust. also it follows QIIME 2's
 metadata guidelines.

OK! I uh think that is everything for now. sorry this message ended up
being so long. Probably should've done more atomic commits, but
everything sort of ended up relying on everything else and it got to the
point where any commit of individual files would've just broken the
tests. GRANTED: I think I already did that tonight, but whatever, you
get the point.
fedarko added a commit that referenced this issue Mar 7, 2019
Something silly -- since I was just fetching the rank values directly
from the JSON (without Vega in the middle), these values were getting
treated as strings (so the comparison function I defined when sorting
them didn't work). This sort of thing has happened before, so fixing
this was a lot easier than then.

This emphasizes the importance of following through with #62. The less
there is to worry about, the better.
@fedarko
Copy link
Collaborator Author

fedarko commented May 3, 2019

This should be addressed with boolean values. pandas.read_csv(), json.loads(), json.dumps(), the QIIME 2 Metadata object, and Altair handle or don't handle them in a certain way, and we need to be able to ensure they're treated in a consistent way.

IMO: the best way to do this is to just treat them as strings of True or False. Representing them as bools in python will mean they'll be represented as bools in JS, which will mean the user will see true or false in the visualization (which IMO is ugly and counterintuitive).

Currently, what I've noticed is that pandas.read_csv() will convert True/False values to the type numpy.bool_, which threw a wrench in the in-progress sample plot JSON validation code.

fedarko added a commit that referenced this issue May 3, 2019
See comments for justification. This came up while I was testing
stuff.

Related to #62. The easiest solution would probably be using
qiime2.Metadata.load(), but I don't want to introduce a heavy
dependency on QIIME 2 for people using rankratioviz in isolation.
fedarko added a commit that referenced this issue May 3, 2019
Should validate that at least the boolean nonsense in #62 is being
dealt with properly.

Also: as stuff comes up re: #62, read_metadata_file() can be adjusted
accordingly (and further test cases can be added here). The recent
changes to the codebase should make responding to these sorts of
problems easier.
@fedarko
Copy link
Collaborator Author

fedarko commented May 22, 2019

One thing worth looking in to: is it ok for sample/feature metadata fields to be named with just numbers? IIRC this was a problem with feature rank IDs for altair. Not sure if it'd be a problem in the future.

@fedarko
Copy link
Collaborator Author

fedarko commented May 22, 2019

also look at if null in metadata is counted as None by pandas, or if it's just the string null. (B/C it should be counted as a string null—otherwise our solution to #132 will cause some weird corner case problems.)

fedarko added a commit that referenced this issue May 22, 2019
This constitutes most of the python progress on #132 (and by extension
also #125).

Next up for #132:

-Add tooltip defs in the rank plot for all feature metadata cols.
 I guess we could do this in the python side of things, in the
 rank_chart definition.

-create a list of non-ranking feature data columns (all feature data
 fields, minus ranking columns, minus Feature ID and Classification)
 to populate two <select>s with all avail. feature metadata columns.

-Make the JS searching functionality look by selected feature metadata
 field (one select for num search, one select for den search).
        -Ignore features that have "null" for a given col (i.e. no row
         in the feature md file). This might cause some confusion if any
         metadata actually has "null" as a given string, but I *think*
         this should be ok. (Should add test cases; tagging #2 and #62.)
        -Support multiple searches (e.g. with multiple taxa).
                -MAYBE support searching by taxonomic rank as current?
                -IDK.
                -Also maybe more specialized searches by field type
                 (e.g. if the field is numeric, limit to ranges).
                -Probably goes beyond the scope of #132 tho.
@fedarko
Copy link
Collaborator Author

fedarko commented May 25, 2019

This, #116, and #101 (and maybe more) should be solvable if we treat all values as objects/strings to start off with when using pd.read_csv(), and then go through columns to figure out what we can convert to numeric types. See #101 (comment) for what I think is a decent solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Making code faster or cleaner
Projects
None yet
Development

No branches or pull requests

1 participant