Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instant header validation for local AnnData files (SCP-5718) #2112

Merged
merged 23 commits into from
Aug 20, 2024

Conversation

eweitz
Copy link
Member

@eweitz eweitz commented Aug 16, 2024

This helps users avoid waiting 15-45 minutes to upload a big H5AD file, only to learn of a user error at the end.

Impact

Previously, authors would learn of any validation errors in their AnnData file only after uploading it and then checking their email. That's quite slow because .h5ad files in single cell science tend to be large: 2 GB, 6 GB, sometimes 60 GB. Needing to upload such big files to detect validation problems is frustrating.

Now, some problems are found before upload. Basic AnnData validation is done right in the web upload UI in less than a second. It validates AnnData metadata headers with a few existing rules, and reports any problems instantly and in context. The user journey is smoother, and thus faster.

We made AnnData upload generally available in July 2023, and have noticed a distinct uptick in its usage in the last few months. So improving this UX seems worthwhile.

Demo

Here's how it looks!

Instant_AnnData_headers_validation_CSFV__2024-08-16.mov

Technical notes

This uses hdf5-indexed-reader*, an efficient JavaScript library for HDF5 files. JS in the web browser parses the start of the AnnData file, reads obs keys, and applies some basic pre-existing client-side file validation (CSFV) rules. It reuses existing rules via minor rearrangement of parsed AnnData. So we can use the same validation rule code for both classic and AnnData files.

Parsing is streaming and non-blocking. That's important because AnnData files can be larger than client memory. We recognized this as an underlying blocker for AnnData back in November 2022, and started a community discussion about how to solve it. Since then, building on @bmaranville's foundational work, @jrobinso made hdf5-indexed-reader and we now have a library that neatly fits this need.

This changeset also includes a standalone HTML page and spinning DNA graphic. That enables nimbly experimenting with HDF5 and JS using something like Simple Web Server.

* Because hdf5-indexed-reader isn't on NPM, it can't be used as a conventional package. So I went ahead and published an org-scoped version of it, @single-cell-portal/hdf5-indexed-reader. The package.json source trivially adds a scope to the original name. This ensures the canonical version can use a simpler top-level name (presumably hdf5-indexed-reader) if / when it's published to NPM.

Next steps

  • Instant header validation for remote AnnData files
  • Instant content validation for AnnData files

Test

A new automated test confirms some behavior. To manually test:

  1. yarn install
  2. Ensure Vite frontend service worker cache is disabled
  3. Create a new study "AnnData CSFV"
  4. Select AnnData upload UX
  5. In AnnData tab, click "Choose file"
  6. In Finder, go to single_cell_portal_core/test/test_data/
  7. Select anndata_test_bad_header_no_species.h5ad
  8. Confirm error message is shown, including "File is missing required obs keys: species"
  9. Click "Delete" to remove errant file, then click "Choose file"
  10. In same directory as before, select anndata_test.h5ad
  11. Confirm no error message is shown, and "Save" button is enabled
  12. Upload anndata_test_bad_header_no_species.h5ad and anndata_test.h5ad to your study's GCP bucket
  13. Click "Use bucket path"
  14. Enter "anndata_test_bad_header_no_species.h5ad" in text input
  15. Click off input to blur / change browser focus
  16. Confirm no error message is shown (remote AnnData CSFV will be added next) and "Save" button is enabled

Further details

For more context, see 2024-08-13 SCP demo video. This satisfies SCP-5718.

@eweitz eweitz requested review from bistline and jlchang August 16, 2024 21:18
Copy link

codecov bot commented Aug 16, 2024

Codecov Report

Attention: Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.

Project coverage is 69.83%. Comparing base (809a8ae) to head (c256082).
Report is 29 commits behind head on development.

Files Patch % Lines
app/javascript/lib/validation/shared-validation.js 93.10% 2 Missing ⚠️
app/javascript/lib/validation/validate-anndata.js 93.10% 2 Missing ⚠️
...javascript/lib/validation/validate-file-content.js 60.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #2112      +/-   ##
===============================================
+ Coverage        69.80%   69.83%   +0.03%     
===============================================
  Files              324      325       +1     
  Lines            27276    27319      +43     
  Branches          2252     2267      +15     
===============================================
+ Hits             19039    19079      +40     
- Misses            8112     8115       +3     
  Partials           125      125              
Files Coverage Δ
app/javascript/components/upload/upload-utils.js 91.15% <100.00%> (ø)
...script/components/validation/ValidationMessage.jsx 96.29% <ø> (ø)
app/javascript/lib/validation/validate-file.js 67.07% <100.00%> (ø)
app/javascript/lib/validation/shared-validation.js 95.48% <93.10%> (-0.67%) ⬇️
app/javascript/lib/validation/validate-anndata.js 93.10% <93.10%> (ø)
...javascript/lib/validation/validate-file-content.js 88.06% <60.00%> (-1.49%) ⬇️

... and 7 files with indirect coverage changes

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the specified error message. Instead I get the generic error message which indicates a TypeError was thrown:
Screenshot 2024-08-19 at 9 55 42 AM

After putting in a breakpoint, I get see this error message in the Sources tab: TypeError: file.startsWith is not a function at getAnnDataHeaders. I can't add any AnnData files, valid or not.

@eweitz eweitz requested a review from bistline August 19, 2024 16:51
@eweitz
Copy link
Member Author

eweitz commented Aug 19, 2024

Thanks for the catch. Adaptations I made for the unit test broke the UX -- a downside of discrepant web and Node APIs.

I've fixed the UX and ensured the unit test still works, and added another unit test to help ensure soundness, and refined the manual test steps. The 2 unit tests pass, and the 3 cases outlined in the manual test steps also work for me now.

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual tests behave as described.
Two questions that, I hope, are non-blocking:

  1. When I run yarn ui-test, I get:

FAIL test/js/lib/validate-anndata.test.js (7.888 s)
● Client-side file validation for AnnData › Parses AnnData headers

: Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Timeout - Async callback was not invoked within the 5000 ms timeout specified by jest.setTimeout.Error:

I recall some issues with setting up Jest so I'm assuming the issue with this test fail may be on my end.

  1. For manual test Step 9. Instead of using Delete I used the replace option (next to the AnnData filename) to choose a new file. When CSFV errors are encountered, the CSFV error reports the most recently chosen file but the form itself retains the previous filename, which is a little confusing, but not blocking since you won't get a successful upload unless you have a conformant file anyways...

Here's a screen capture of what the oddness looks like:

Screen.Recording.2024-08-19.at.4.42.57.PM.mov

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! Works as advertised.

@@ -2,6 +2,22 @@
* @fileoverview Functions used in multiple file types validation
*/

// from lib/assets/metadata_schemas/alexandria_convention_schema.json
// (which in turn is from scp-ingest-pipeline/schemas)
export const REQUIRED_CONVENTION_COLUMNS = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud... I know this is the same as we always did previously, but I wonder if it's possible to source this directly from lib/assets/metadata_schemas/alexandria_convention/alexandria_convention_schema.json?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! This required some configuration as that file is outside our frontend root (app/javascript), but it works.

Consolidating here helps avoid problems if we update required headers.

// Ultimately sourced from: scp-ingest-pipeline/schemas
import * as data from 'lib/assets/metadata_schemas/alexandria_convention/alexandria_convention_schema.json';

export const REQUIRED_CONVENTION_COLUMNS = data.required.filter(c => c !== 'CellID')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@eweitz
Copy link
Member Author

eweitz commented Aug 20, 2024

Thanks for the reviews!

When I run yarn ui-test

All the CI test runs pass, and my local tests via yarn test pass (modulo two pre-existing local-only unrelated failures). So I think your hunch is right this is an issue on your end. Ping me if you'd like to pair on it!

When CSFV errors are encountered, the CSFV error reports the most recently chosen file but the form itself retains the previous filename

Useful find! It turns out this is reproducible already on staging, and affects that UX flow for all file types. So this is a pre-existing issue. I agree it's confusing when it occurs. Its incidence is likely very low, as selecting a valid file then an invalid file via "Replace" seems rare. Fixing this while we're refining the upload UI might be opportune, though. I've opened SCP-5771 so we can assess priority next sprint.

@eweitz eweitz merged commit 1ee877e into development Aug 20, 2024
5 checks passed
@github-actions github-actions bot deleted the ew-anndata-header-csfv branch August 20, 2024 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants