Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Snapshot errors include source in Display #34799

Closed

Conversation

brooksprumo
Copy link
Contributor

@brooksprumo brooksprumo commented Jan 16, 2024

Problem

When printing with Display, some snapshot errors do not include their source due to fs-err. This results in errors like this:

[2024-01-10T10:41:17.250961063Z ERROR solana_runtime::accounts_background_service] Stopping AccountsBackgroundService! Fatal error while handling snapshot requests: failed to add bank snapshot for slot 271529984: failed to hard link storages: failed to hard link storage: failed to hardlink file from /mnt/accounts/accountsdb/run/271301292.375938 to /mnt/accounts/accountsdb/snapshot/271529984/271301292.375938

Note that this error message does not include why the hard linking failed...

Summary of Changes

For all snapshot errors that use fs-err, we know they include a populated source field. Thus, we know that .source() will always return Some.

I don't love this solution, but it is simple and safe. I'd like to backport this to v1.17, and then do a follow-up PR to fix this properly in master. This plan allows us to have the smallest patch to v1.17.

@brooksprumo brooksprumo self-assigned this Jan 16, 2024
Copy link

codecov bot commented Jan 16, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (e9a6bb3) 81.8% compared to head (29c1a8f) 81.8%.
Report is 27 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #34799     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         823      823             
  Lines      222713   222713             
=========================================
- Hits       182337   182254     -83     
- Misses      40376    40459     +83     

@brooksprumo brooksprumo marked this pull request as ready for review January 16, 2024 23:53
@@ -413,7 +414,7 @@ pub enum AddBankSnapshotError {
#[error("bank snapshot dir already exists: {0}")]
SnapshotDirAlreadyExists(PathBuf),

#[error("failed to create snapshot dir: {0}")]
#[error("failed to create snapshot dir: {0}: {}", .0.source().unwrap())]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these unwraps safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors from fs-err always contains a source, so calling .source() on any error generated from fs-err will always be Some. I went through and audited all uses of these snapshot errors and confirmed they all come from fs-err functions. Therefore they'll always have a source, and the unwrap is safe.

Would it be good to add that into the code as a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the api returns Option<T>. a comment in our code won't stop upstream from starting to return None in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. My thought was that this PR would be backported to v1.17, which if I understand correctly, we do not bump dependencies anymore. So the upstream fs-err won't change and thus won't start returning None.

Then in master I can do a more thorough fix, so as to avoid larger code changes needing to be backported.

I'm happy to go either way here! Do you have a suggested game plan for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I'd rather avoid backporting something with these unwraps, same concerns as Trent. Better error messaging is great, but don't think it's a critical fix that needs immediate backporting - especially given the possibility of unwrapping a none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll close this PR. Thanks for the input, folks!

@brooksprumo
Copy link
Contributor Author

Closing. This PR is not the desired solution to the problem.

@t-nelson
Copy link
Contributor

i'd still like to see something done about the useless error messages that fs-err gives us. ideally that something is completely removing the dependency from the code base

@brooksprumo
Copy link
Contributor Author

i'd still like to see something done about the useless error messages that fs-err gives us. ideally that something is completely removing the dependency from the code base

The fs-err messages aren't useless though, as it indicates the fs/io operation that failed and the paths. Previously, this was either (1) entirely missing from the error, or (2) manually added via a .map_err(|err| SnapshotError::IoWithSourceAndFile("the fs operation", the_path)) at every fs call site. Since the operation was often missing, errors tended to both not include the path(s) involved, and only say the underlying fs error (i.e. EPERM/etc)

The fs-err Errors still contain the source error, but are not output as part of Display. So when we changed the snapshot processing/AccountsBackgroundService to return Errors and log them, instead of panicking, we changed the output from Debug to Display. Those two changes combined are why now we don't get the underlying fs error in the log.

The right fix is to include the underlying (aka source) error when logging. The std lib actually has a way to do exactly that, std::error::Report, but it is unfortunately unstable. So another good option for v1.17 is to conditionally include the source errors when logging at the end of ABS/AHV/SPS.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants