Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast validation for remote AnnData headers (SCP-5770) #2118

Merged
merged 9 commits into from
Sep 3, 2024

Conversation

eweitz
Copy link
Member

@eweitz eweitz commented Aug 28, 2024

This extends up-front checks to cover adding an AnnData file from a bucket. It makes authoring smoother, and thus faster.

Impact

Previously, new AnnData validation caught errors upon selecting a local file. Validating before upload is a great idea. It eliminates long wait times, potentially reducing a disjointed 15-45 minute process into a cohesive 10 second process. But local AnnData validation doesn't help when the file is remote, i.e. in a Terra workspace GCP bucket.

Now, client-side file validation (CSFV) for AnnData headers also applies to remote files. When adding AnnData via "Use bucket path", header validation finishes in < 10 seconds rather than 5-10 minutes like before.

Demo

Here's how it looks!

AnnData_remote_header_CSFV__SCP_2024-08-28.mov

Technical notes

This required fixing a bug in the external HDF5 library that prevented using it with files needing an OAuth token. Details on that are in jrobinso/hdf5-indexed-reader#2. That HDF5 library is now also available as a canonical NPM package, which replaces our use of an org-scope package.

Next steps

  • Instant content validation for AnnData files
  • Show loading indicator to avoid confusion when validating remote files
  • Fix how "Replace" keeps bad file name if good file selected before

This satisfies SCP-5770.

@eweitz eweitz requested review from bistline and jlchang August 28, 2024 20:39
@eweitz eweitz changed the title Fast validation for remote AnnData headers Fast validation for remote AnnData headers (SCP-5770) Aug 28, 2024
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 56.25000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 69.54%. Comparing base (20fdea2) to head (4c47c6b).
Report is 13 commits behind head on development.

Files with missing lines Patch % Lines
app/javascript/lib/validation/validate-anndata.js 62.50% 3 Missing ⚠️
...javascript/lib/validation/validate-file-content.js 50.00% 2 Missing ⚠️
app/javascript/lib/validation/validate-file.js 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #2118      +/-   ##
===============================================
- Coverage        69.54%   69.54%   -0.01%     
===============================================
  Files              327      327              
  Lines            27536    27544       +8     
  Branches          2287     2289       +2     
===============================================
+ Hits             19151    19156       +5     
- Misses            8251     8254       +3     
  Partials           134      134              
Files with missing lines Coverage Δ
app/javascript/lib/scp-api.jsx 42.15% <100.00%> (+0.28%) ⬆️
...javascript/lib/validation/validate-file-content.js 87.64% <50.00%> (-0.43%) ⬇️
app/javascript/lib/validation/validate-file.js 66.26% <0.00%> (-0.81%) ⬇️
app/javascript/lib/validation/validate-anndata.js 87.50% <62.50%> (-5.61%) ⬇️

... and 1 file with indirect coverage changes

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - works as advertised. This is getting us very close to a huge quality of life improvement for AnnData uploaders!

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that fast validation now runs on remote files.

Non-blocking comments (for content validation work):
Unsure what header validations (shown in the video) are currently implemented, the two tests I ran used local files I have, both were permitted to click Save

unconventional.h5ad (you'll need to remove the .txt) (has NO conventional metadata)

h5ls  unconventional.h5ad/obs
G2M_score                Dataset {700}
S_score                  Dataset {700}
bulk_labels              Group
index                    Dataset {700}
louvain                  Group
n_counts                 Dataset {700}
n_genes                  Dataset {700}
percent_mito             Dataset {700}
phase                    Group

header_w_period.h5ad.txt

h5ls header_w_period.h5ad/obs
biosample_id             Group
disease                  Group
disease__ontology_label  Group
donor_id                 Group
index                    Dataset {2638}
library_preparation_protocol Group
library_preparation_protocol__ontology_label Group
louvain                  Group
**n.counts**                 Dataset {2638}
n_genes                  Dataset {2638}
organ                    Group
organ__ontology_label    Group
percent_mito             Dataset {2638}
sex                      Group
species                  Group
species__ontology_label  Group

Happy to share those files (Github doesn't want to upload .h5ad files as attachments) - let me know where to put them for you.

@eweitz
Copy link
Member Author

eweitz commented Sep 3, 2024

Thanks for the enhanced testing, Jean! I see expected validation failure with unconventional.h5ad per screenshot, so my hunch is your finding is due to a discrepancy in local setup. The other file is a known separate case. So this seems good to merge. Let's pair when you've got a few minutes!

Screenshot 2024-09-03 at 12 58 21 PM

@eweitz eweitz merged commit 166e7e6 into development Sep 3, 2024
5 checks passed
@github-actions github-actions bot deleted the ew-anndata-remote-csfv branch September 3, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants