-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast validation for remote AnnData headers (SCP-5770) #2118
Conversation
…ortal_core into ew-anndata-remote-csfv
…ortal_core into ew-anndata-remote-csfv
…ortal_core into ew-anndata-remote-csfv
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #2118 +/- ##
===============================================
- Coverage 69.54% 69.54% -0.01%
===============================================
Files 327 327
Lines 27536 27544 +8
Branches 2287 2289 +2
===============================================
+ Hits 19151 19156 +5
- Misses 8251 8254 +3
Partials 134 134
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - works as advertised. This is getting us very close to a huge quality of life improvement for AnnData uploaders!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that fast validation now runs on remote files.
Non-blocking comments (for content validation work):
Unsure what header validations (shown in the video) are currently implemented, the two tests I ran used local files I have, both were permitted to click Save
unconventional.h5ad (you'll need to remove the .txt
) (has NO conventional metadata)
h5ls unconventional.h5ad/obs
G2M_score Dataset {700}
S_score Dataset {700}
bulk_labels Group
index Dataset {700}
louvain Group
n_counts Dataset {700}
n_genes Dataset {700}
percent_mito Dataset {700}
phase Group
h5ls header_w_period.h5ad/obs
biosample_id Group
disease Group
disease__ontology_label Group
donor_id Group
index Dataset {2638}
library_preparation_protocol Group
library_preparation_protocol__ontology_label Group
louvain Group
**n.counts** Dataset {2638}
n_genes Dataset {2638}
organ Group
organ__ontology_label Group
percent_mito Dataset {2638}
sex Group
species Group
species__ontology_label Group
Happy to share those files (Github doesn't want to upload .h5ad files as attachments) - let me know where to put them for you.
This extends up-front checks to cover adding an AnnData file from a bucket. It makes authoring smoother, and thus faster.
Impact
Previously, new AnnData validation caught errors upon selecting a local file. Validating before upload is a great idea. It eliminates long wait times, potentially reducing a disjointed 15-45 minute process into a cohesive 10 second process. But local AnnData validation doesn't help when the file is remote, i.e. in a Terra workspace GCP bucket.
Now, client-side file validation (CSFV) for AnnData headers also applies to remote files. When adding AnnData via "Use bucket path", header validation finishes in < 10 seconds rather than 5-10 minutes like before.
Demo
Here's how it looks!
AnnData_remote_header_CSFV__SCP_2024-08-28.mov
Technical notes
This required fixing a bug in the external HDF5 library that prevented using it with files needing an OAuth token. Details on that are in jrobinso/hdf5-indexed-reader#2. That HDF5 library is now also available as a canonical NPM package, which replaces our use of an org-scope package.
Next steps
This satisfies SCP-5770.