Store sparse count data JSON in visualization #58

fedarko · 2019-02-25T06:33:31Z

This isn't a problem yet, but as we start trying larger datasets here this will become an issue.

Action items:

Disable Altair's "MaxRows check" -- this is more of a temporary measure than anything, but it'll prevent Altair from giving up immediately as soon as more than 5k features or 5k samples are in a dataset.
Store datasets in their own JSON files in the data directory. Our use of column mapping in the sample plot means that the costs of storing data in memory/etc aren't that severe, but it is still probably a good idea to separate datasets from the Vega-Lite JSON specs generated by Altair. This should also remove our need to disable the MaxRows check (if there are more than 5k features, would the column mapping stuff trigger a MaxRowsError? -- I should look into this, and consider storing that stuff in its own JSON file as well although that sounds a bit excessive).
- NOTE that this will probably (?) require using cross-origin requests again, which will essentially nullify the benefits of Remove need for cross-origin requests #72 (i.e. users will again have to use view.qiime2.org or python3 -m http.server to view a visualization). It should be doable to make this an option—i.e. if you want to generate a "serverless" visualization then that should be a default, and if you want to make a viz that uses CORS but potentially can tolerate more data then you can specify a --use-external-data flag or something for the python script.
- This is probably a pretty far off thing, but worth it to start thinking about it now.
Reduce the number of samples shown to the user (idea c/o @mortonjt). IMO we should approach this carefully, giving as much control to the user as possible (e.g. making the user very aware of how many samples have been excluded).

The text was updated successfully, but these errors were encountered:

fedarko · 2019-04-30T03:01:40Z

Also worth using a sparse DF for the table instead of dense DF, as is currently done. (Note that this would need to either 1) convert the BIOM DF to a sparse DF, since until biocore/biom-format#808 the default biom.Table.to_dataframe() produced a dense DF, or 2) would need to require an up-to-date version of biom.) Although I'm still having trouble with generating a sparse DF in a reasonable amount of time—maybe a scipy CSR matrix would be a better option, if we can make it work in the code. Update: Done as of beaf464

To allow more than 5k rows (either features or samples) in a spec. There *are* some limits we should probably enforce (see #58 for details), but for the time being I think this is a decent temporary solution.

Might be worth eventually making this JSON load separately, or in a web worker or something. But that's something for #58.

fedarko · 2019-05-17T01:24:58Z

Another thing: as mentioned in 041aaaf, it'd be cool if possible to load count data in a web worker or something to minimize strain on the browser. That's the main scaling problem IMO—it's basically just the contents of a BIOM table.

One silly option: could we just... not include samples that have 0 counts for a given feature? (or vice versa, doesn't really matter, just depends on the way the counts JSON is laid out.) As long as we know that we have a list of all sample IDs, we can infer in the JS code which samples have "0" for a given feature (or vice versa). This would basically let us "cheat" and store a sparse representation of the BIOM table, and since BIOM tables for microbiome studies are usually super duper sparse this could actually let us do silly crap like load a huge portion of the EMP in.

Another option would be to use https://github.com/molbiodiv/biojs-io-biom

fedarko · 2019-05-21T21:42:08Z

Another good idea: improving DataFrame iteration. I know iterrows() is used in process_input(), and there are probably at least a few other places where something faster (e.g. vectorized operations) could be done.

Turns out that we need to delay matching until after filtering out unextreme features in order to avoid computing the intersection of two super-huge indices (EMP has on the order of 200k observations and trying to match those up is going to be horrible). And it seems like _df_utils.remove_empty_samples() was super slow on even SparseDataFrames. To make things easier -- and because I know from experience that it worked, and let Qurro handle huge EMP-scale amounts of data -- I reorganized things so that we first filter out ranks and empty samples, THEN do matching. This works, but it breaks a lot of the unit tests (and some of the integration tests that rely on specific error messages that are now changed). Need to fix these then double check that everything works properly. This is a prereq for #58. I have an alg for that sketched out, also.

fedarko · 2019-07-02T02:37:52Z

Draft algorithm for how this would work:

# to be added to the end of gen_sample_plot()
counts_dict = table.to_dict()
sparse_counts_dict = {}
for feature_id, sample_counts in counts_dict.items():
    fdict = {}
    for sample_id, ct in sample_counts.items():
        if ct != 0:
            fdict[sample_id] = ct
    if len(fdict.keys()) > 0:
        sparse_counts_dict[feature_id] = fdict

Now, sparse_counts_dict only contains data for nonzero counts. So for sparse datasets, we'd need a lot less space for abundance data (i.e. much lower memory overhead).

In the JS, we'll need to somehow figure out a list of sample IDs (probs can just embed it in the sample plot JSON analogously to how feature lists are embedded in the rank plot JSON) independent of the count JSON.

Also, note how we might not even add fdict into sparse_counts_dict (this would happen if all samples have 0 for a given feature, which is ostensibly possible in Qurro at present). This would mean that that feature is effectively useless in the Qurro visualization. I think we can get around this by just only passing nonzero features to the visualization -- see #171 for how we'd address this.

This is gonna be kind of inconvient to apply to the feature ranks, also. Think I'm going to go back and try an alternative approach to all of this stuff (#172, #171, #58) on another branch.

Turns out this is a ton more efficient. Added bonus of now relying on pandas' implementation of this instead of ours. Turns out transposing huge dataframes is a pretty significant endeavor, so calling .T on the feature table for like the EMP dataset was taking a super long time. Fortunately, we can finesse our way around this by instead transposing the sample metadata and then aligning on the columns. I'm glad that we reached a solution for this that preserved all of the matching-up-front niceness re: testing. Solid stuff. Uh, next up are #171 and then #58? But we can def merge this back into master now.

Closes #171. Turns out the reason q2_moving_pictures and sleep_apnea's main.js outputs were different is that removing empty features changed the order of features in their rank plot JSONs around a bit (the actual data was exactly the same since these datasets didn't have any empty features, though). I think it was just the use of .filter(). In any case, that's ok with me. We're done here! Now we can move on to #58.

Need to actually integrate this with Qurro, but close!

relates to #58, #171. now it's pretty quick to run the EMP with -x 2000 into Qurro. going to do some more basic benchmarking with this.

fedarko added the optimization Making code faster or cleaner label Feb 25, 2019

fedarko self-assigned this Feb 25, 2019

fedarko added a commit that referenced this issue May 13, 2019

ENH: Store count data in separate JSON: close #102

041aaaf

Might be worth eventually making this JSON load separately, or in a web worker or something. But that's something for #58.

fedarko mentioned this issue May 17, 2019

By default, apply -x option if amount of count data stored exceeds a certain limit? #118

Open

fedarko mentioned this issue Jun 3, 2019

Add more demos #152

Closed

fedarko added this to the v0.2.0 milestone Jun 28, 2019

fedarko changed the title ~~Optimizing rankratioviz' performance~~ Store sparse count data JSON in visualization Jul 2, 2019

This was referenced Jul 2, 2019

Remove empty *features* from the visualization? #171

Closed

Restore ability to work with EMP-scale huge datasets #172

Closed

fedarko added a commit that referenced this issue Jul 3, 2019

MAINT: Remove empty features #171

5e9977a

This is gonna be kind of inconvient to apply to the feature ranks, also. Think I'm going to go back and try an alternative approach to all of this stuff (#172, #171, #58) on another branch.

fedarko added a commit that referenced this issue Jul 7, 2019

ENH: Add sparsify_count_dict() and unit tests #58

cea1037

Need to actually integrate this with Qurro, but close!

fedarko closed this as completed in bc0f97d Jul 7, 2019

fedarko added a commit that referenced this issue Jul 7, 2019

REL: Update changelog re: #58 and #171 changes

c175cc8

fedarko added a commit that referenced this issue Jul 7, 2019

ENH: Speed up empty sample/feature removing

1942df0

relates to #58, #171. now it's pretty quick to run the EMP with -x 2000 into Qurro. going to do some more basic benchmarking with this.

fedarko added a commit that referenced this issue Jul 7, 2019

DEV: Add debug msg indicating when #58 is done

e42b273

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store sparse count data JSON in visualization #58

Store sparse count data JSON in visualization #58

fedarko commented Feb 25, 2019 •

edited

Loading

fedarko commented Apr 30, 2019 •

edited

Loading

fedarko commented May 17, 2019 •

edited

Loading

fedarko commented May 21, 2019

fedarko commented Jul 2, 2019 •

edited

Loading

Store sparse count data JSON in visualization #58

Store sparse count data JSON in visualization #58

Comments

fedarko commented Feb 25, 2019 • edited Loading

fedarko commented Apr 30, 2019 • edited Loading

fedarko commented May 17, 2019 • edited Loading

fedarko commented May 21, 2019

fedarko commented Jul 2, 2019 • edited Loading

fedarko commented Feb 25, 2019 •

edited

Loading

fedarko commented Apr 30, 2019 •

edited

Loading

fedarko commented May 17, 2019 •

edited

Loading

fedarko commented Jul 2, 2019 •

edited

Loading