-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instant header validation for local AnnData files (SCP-5718) #2112
Conversation
…ortal_core into ew-anndata-header-csfv
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #2112 +/- ##
===============================================
+ Coverage 69.80% 69.83% +0.03%
===============================================
Files 324 325 +1
Lines 27276 27319 +43
Branches 2252 2267 +15
===============================================
+ Hits 19039 19079 +40
- Misses 8112 8115 +3
Partials 125 125
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the specified error message. Instead I get the generic error message which indicates a TypeError
was thrown:
After putting in a breakpoint, I get see this error message in the Sources tab: TypeError: file.startsWith is not a function at getAnnDataHeaders
. I can't add any AnnData files, valid or not.
Thanks for the catch. Adaptations I made for the unit test broke the UX -- a downside of discrepant web and Node APIs. I've fixed the UX and ensured the unit test still works, and added another unit test to help ensure soundness, and refined the manual test steps. The 2 unit tests pass, and the 3 cases outlined in the manual test steps also work for me now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual tests behave as described.
Two questions that, I hope, are non-blocking:
- When I run
yarn ui-test
, I get:
FAIL test/js/lib/validate-anndata.test.js (7.888 s)
● Client-side file validation for AnnData › Parses AnnData headers
: Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Error:
I recall some issues with setting up Jest so I'm assuming the issue with this test fail may be on my end.
- For manual test Step 9. Instead of using
Delete
I used thereplace
option (next to the AnnData filename) to choose a new file. When CSFV errors are encountered, the CSFV error reports the most recently chosen file but the form itself retains the previous filename, which is a little confusing, but not blocking since you won't get a successful upload unless you have a conformant file anyways...
Here's a screen capture of what the oddness looks like:
Screen.Recording.2024-08-19.at.4.42.57.PM.mov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix! Works as advertised.
@@ -2,6 +2,22 @@ | |||
* @fileoverview Functions used in multiple file types validation | |||
*/ | |||
|
|||
// from lib/assets/metadata_schemas/alexandria_convention_schema.json | |||
// (which in turn is from scp-ingest-pipeline/schemas) | |||
export const REQUIRED_CONVENTION_COLUMNS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud... I know this is the same as we always did previously, but I wonder if it's possible to source this directly from lib/assets/metadata_schemas/alexandria_convention/alexandria_convention_schema.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! This required some configuration as that file is outside our frontend root (app/javascript), but it works.
Consolidating here helps avoid problems if we update required headers.
// Ultimately sourced from: scp-ingest-pipeline/schemas | ||
import * as data from 'lib/assets/metadata_schemas/alexandria_convention/alexandria_convention_schema.json'; | ||
|
||
export const REQUIRED_CONVENTION_COLUMNS = data.required.filter(c => c !== 'CellID') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
Thanks for the reviews!
All the CI test runs pass, and my local tests via
Useful find! It turns out this is reproducible already on staging, and affects that UX flow for all file types. So this is a pre-existing issue. I agree it's confusing when it occurs. Its incidence is likely very low, as selecting a valid file then an invalid file via "Replace" seems rare. Fixing this while we're refining the upload UI might be opportune, though. I've opened SCP-5771 so we can assess priority next sprint. |
This helps users avoid waiting 15-45 minutes to upload a big H5AD file, only to learn of a user error at the end.
Impact
Previously, authors would learn of any validation errors in their AnnData file only after uploading it and then checking their email. That's quite slow because .h5ad files in single cell science tend to be large: 2 GB, 6 GB, sometimes 60 GB. Needing to upload such big files to detect validation problems is frustrating.
Now, some problems are found before upload. Basic AnnData validation is done right in the web upload UI in less than a second. It validates AnnData metadata headers with a few existing rules, and reports any problems instantly and in context. The user journey is smoother, and thus faster.
We made AnnData upload generally available in July 2023, and have noticed a distinct uptick in its usage in the last few months. So improving this UX seems worthwhile.
Demo
Here's how it looks!
Instant_AnnData_headers_validation_CSFV__2024-08-16.mov
Technical notes
This uses hdf5-indexed-reader*, an efficient JavaScript library for HDF5 files. JS in the web browser parses the start of the AnnData file, reads
obs
keys, and applies some basic pre-existing client-side file validation (CSFV) rules. It reuses existing rules via minor rearrangement of parsed AnnData. So we can use the same validation rule code for both classic and AnnData files.Parsing is streaming and non-blocking. That's important because AnnData files can be larger than client memory. We recognized this as an underlying blocker for AnnData back in November 2022, and started a community discussion about how to solve it. Since then, building on @bmaranville's foundational work, @jrobinso made
hdf5-indexed-reader
and we now have a library that neatly fits this need.This changeset also includes a standalone HTML page and spinning DNA graphic. That enables nimbly experimenting with HDF5 and JS using something like Simple Web Server.
* Because hdf5-indexed-reader isn't on NPM, it can't be used as a conventional package. So I went ahead and published an org-scoped version of it, @single-cell-portal/hdf5-indexed-reader. The package.json source trivially adds a scope to the original name. This ensures the canonical version can use a simpler top-level name (presumably
hdf5-indexed-reader
) if / when it's published to NPM.Next steps
Test
A new automated test confirms some behavior. To manually test:
yarn install
single_cell_portal_core/test/test_data/
anndata_test_bad_header_no_species.h5ad
anndata_test.h5ad
anndata_test_bad_header_no_species.h5ad
andanndata_test.h5ad
to your study's GCP bucketFurther details
For more context, see 2024-08-13 SCP demo video. This satisfies SCP-5718.