-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store sparse count data JSON in visualization #58
Comments
Also worth using a sparse DF for the table instead of dense DF, as is currently done. (Note that this would need to either 1) convert the BIOM DF to a sparse DF, since until biocore/biom-format#808 the default |
To allow more than 5k rows (either features or samples) in a spec. There *are* some limits we should probably enforce (see #58 for details), but for the time being I think this is a decent temporary solution.
Might be worth eventually making this JSON load separately, or in a web worker or something. But that's something for #58.
Another thing: as mentioned in 041aaaf, it'd be cool if possible to load count data in a web worker or something to minimize strain on the browser. That's the main scaling problem IMO—it's basically just the contents of a BIOM table. One silly option: could we just... not include samples that have 0 counts for a given feature? (or vice versa, doesn't really matter, just depends on the way the counts JSON is laid out.) As long as we know that we have a list of all sample IDs, we can infer in the JS code which samples have "0" for a given feature (or vice versa). This would basically let us "cheat" and store a sparse representation of the BIOM table, and since BIOM tables for microbiome studies are usually super duper sparse this could actually let us do silly crap like load a huge portion of the EMP in. Another option would be to use https://github.com/molbiodiv/biojs-io-biom |
Another good idea: improving DataFrame iteration. I know |
Turns out that we need to delay matching until after filtering out unextreme features in order to avoid computing the intersection of two super-huge indices (EMP has on the order of 200k observations and trying to match those up is going to be horrible). And it seems like _df_utils.remove_empty_samples() was super slow on even SparseDataFrames. To make things easier -- and because I know from experience that it worked, and let Qurro handle huge EMP-scale amounts of data -- I reorganized things so that we first filter out ranks and empty samples, THEN do matching. This works, but it breaks a lot of the unit tests (and some of the integration tests that rely on specific error messages that are now changed). Need to fix these then double check that everything works properly. This is a prereq for #58. I have an alg for that sketched out, also.
Draft algorithm for how this would work: # to be added to the end of gen_sample_plot()
counts_dict = table.to_dict()
sparse_counts_dict = {}
for feature_id, sample_counts in counts_dict.items():
fdict = {}
for sample_id, ct in sample_counts.items():
if ct != 0:
fdict[sample_id] = ct
if len(fdict.keys()) > 0:
sparse_counts_dict[feature_id] = fdict Now, In the JS, we'll need to somehow figure out a list of sample IDs (probs can just embed it in the sample plot JSON analogously to how feature lists are embedded in the rank plot JSON) independent of the count JSON. Also, note how we might not even add |
Turns out this is a ton more efficient. Added bonus of now relying on pandas' implementation of this instead of ours. Turns out transposing huge dataframes is a pretty significant endeavor, so calling .T on the feature table for like the EMP dataset was taking a super long time. Fortunately, we can finesse our way around this by instead transposing the sample metadata and then aligning on the columns. I'm glad that we reached a solution for this that preserved all of the matching-up-front niceness re: testing. Solid stuff. Uh, next up are #171 and then #58? But we can def merge this back into master now.
Closes #171. Turns out the reason q2_moving_pictures and sleep_apnea's main.js outputs were different is that removing empty features changed the order of features in their rank plot JSONs around a bit (the actual data was exactly the same since these datasets didn't have any empty features, though). I think it was just the use of .filter(). In any case, that's ok with me. We're done here! Now we can move on to #58.
Need to actually integrate this with Qurro, but close!
This isn't a problem yet, but as we start trying larger datasets here this will become an issue.
Action items:
Disable Altair's "MaxRows
check" -- this is more of a temporary measure than anything, but it'll prevent Altair from giving up immediately as soon as more than 5k features or 5k samples are in a dataset.Store datasets in their own JSON files in the data directory. Our use of column mapping in the sample plot means that the costs of storing data in memory/etc aren't that severe, but it is still probably a good idea to separate datasets from the Vega-Lite JSON specs generated by Altair. This should also remove our need to disable the
MaxRows
check (if there are more than 5k features, would the column mapping stuff trigger aMaxRowsError
? -- I should look into this, and consider storing that stuff in its own JSON file as well although that sounds a bit excessive).NOTE that this will probably (?) require using cross-origin requests again, which will essentially nullify the benefits of Remove need for cross-origin requests #72 (i.e. users will again have to use view.qiime2.org or python3 -m http.server to view a visualization). It should be doable to make this an option—i.e. if you want to generate a "serverless" visualization then that should be a default, and if you want to make a viz that uses CORS but potentially can tolerate more data then you can specify a
--use-external-data
flag or something for the python script.This is probably a pretty far off thing, but worth it to start thinking about it now.
Reduce the number of samples shown to the user (idea c/o @mortonjt). IMO we should approach this carefully, giving as much control to the user as possible (e.g. making the user very aware of how many samples have been excluded).
The text was updated successfully, but these errors were encountered: