Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #20859 added unsafe shutdown detector #21133

Closed
wants to merge 2 commits into from

Conversation

muhammednagy
Copy link
Contributor

@muhammednagy muhammednagy commented May 27, 2020

Added unsafe shutdown detector in case the app crashed.

fixes: #20859

@muhammednagy
Copy link
Contributor Author

Not sure why it's failing on ARM or OSX.
I am using OSX myself.

@MariusVanDerWijden
Copy link
Member

@muhammednagy Don't worry, these tests are known to be flaky and fail sometimes

cmd/geth/main.go Outdated
@@ -378,13 +378,25 @@ func startNode(ctx *cli.Context, stack *node.Node) {
}
ethClient := ethclient.NewClient(rpcClient)

var ethService *eth.Ethereum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @muhammednagy this is unfortunately not valid. You can run a light-node without having the eth.Ethereum service started. @ligi correct me if I'm wrong but we mostly care about crashes in full client not in light clients, so I think its valid to move this code into the if ctx.GlobalInt... on line 383.

To repro failure:

> make geth
> ./build/bin/geth --syncmode=light --goerli
...
Fatal: Failed to retrieve ethereum service: unknown service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea - it would mainly be interesting for full nodes - so I think we should go with what you proposed. But if there is a way to have it for both this would also be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using a different database? is it possible?
We could probably store them in a file instead which will work for whatever type of node we are running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely possible, but I think we want to have it in the same database. Otherwise we just wind up with a separate set of problems, where we need to care about consistency over many files, and keeping that file in sync with when e.g people do removedb or removes the db manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i think in that case a file would work best for all regardless of the type of node.
Should i go ahead with having a file?
If not then do you know which db i can use that would work for all kinds of nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use the leveldb instance we already have throughout the code for everything.

@MariusVanDerWijden

This comment has been minimized.

@MariusVanDerWijden

This comment has been minimized.

@muhammednagy
Copy link
Contributor Author

What about if it's not serving LES requests?
I am just running it locally with no flags and db is not getting set.

@holiman
Copy link
Contributor

holiman commented Jun 2, 2020

You most definitely have a database. It's in the Ethereum object, instantiated in eth/backend.go, line 139.

@muhammednagy muhammednagy requested a review from zsfelfoldi as a code owner June 2, 2020 15:09
@muhammednagy
Copy link
Contributor Author

Hey,
Yes i do.
The example was relying on getting the database if we are running a light sync mode or light chain.
thus it wasn't setting db in normal times.
I've gone about it in a different way.
Tell me what do you think and if you think it's good i can move it to a function in utils or somewhere else to save the duplicate code.

@holiman
Copy link
Contributor

holiman commented Jun 9, 2020

We've discussed this today. This PR is a good first step in the right direction, but it is a bit lacking. We would like to be able to show, in the startup log, when the last unclean shutdown happened.

In order to do that, we cannot delete the unclean-shutdown when we boot up next time, but need to have either two different keys, or extend it to be a collection of timestamps when unclean shutdowns were performed.

@@ -567,6 +577,9 @@ func (s *Ethereum) Stop() error {
s.lesServer.Stop()
}

if err := s.chainDb.Delete([]byte("unsafe-shutdown")); err != nil {
log.Error("err", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to move this down, right above chaindb.Close(). The biggest offender for an unclean shutdown is the blockchain shutdown mechanism.

// Create an invalid shutdown in database in case the app crashed
if err = chainDb.Put([]byte("unsafe-shutdown"), []byte(time.Now().String())); err != nil {
log.Warn("Failed to record possible future unsafe shutdown", "err", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR uses a single unsafe-shutdown key to store a single timestamp. That's a good start, but we need a bit more information, otherwise quirky scenarios get lost. Notably, it's important to retain past crashes too, so a user might see if there was a fault recently, not just in the last restart (user's tend to restart their nodes a few times until attempting to investigate an issue, so a restart cannot delete the cause).

A simplistic suggestion would be to use a slice of timestamps instead of a single timestamp as the value of the entry. When Geth starts, you can append the current time, and on shutdown, you can pop off the tail. That way on startup you can actually list all the recent crashes, not just the last one.

}
// Create an invalid shutdown in database in case the app crashed
if err = chainDb.Put([]byte("unsafe-shutdown"), []byte(time.Now().String())); err != nil {
log.Warn("Failed to record possible future unsafe shutdown", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to return nil, err here. Although it's not a critical system, if writing to the database failed something's going to go belly up either way.

// Check for invalid shutdown
invalidShutdown, _ := chainDb.Get([]byte("unsafe-shutdown"))
if invalidShutdown != nil {
log.Error("unsafe shutdown detected", "time", string(invalidShutdown))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize the log message. Also please use a proper time value, not a pre-stringified version.

log.Error("unsafe shutdown detected", "time", string(invalidShutdown))
}
// Create an invalid shutdown in database in case the app crashed
if err = chainDb.Put([]byte("unsafe-shutdown"), []byte(time.Now().String())); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use time.Now().Unix() to store the current timestamp instead of a localized string

@@ -231,6 +232,15 @@ func New(ctx *node.ServiceContext, config *Config) (*Ethereum, error) {
return nil, err
}

// Check for invalid shutdown
invalidShutdown, _ := chainDb.Get([]byte("unsafe-shutdown"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using unsafe-shutdown in quite a few places and in different packages, it might make sense to move this into core/rawdb/schema.go. We have a lot of other metadata keys defined there.

@muhammednagy
Copy link
Contributor Author

Hey @holiman @karalabe
Thank you so much for your comments.
I was actually waiting for your comments.
I will do the changes as requested and let you know here once am done or have any questions.
It's my first issue to solve here.
Thanks for your understanding.

@muhammednagy
Copy link
Contributor Author

muhammednagy commented Jun 22, 2020

working on this.
Should I show only the last error message on startup? or all of them?
@karalabe @holiman

@holiman
Copy link
Contributor

holiman commented Jul 7, 2020

Should I show only the last error message on startup? or all of them?

You can't show the error message per se, because you don't have it, but I'm thinking you can show something like Unclean shutdown detected "unclean shutdown"="5d3h ago"

There's no need to list every single one

@adamschmideg
Copy link
Contributor

This PR seems to be abandoned. If you feel like taking it from here, please go ahead.

@muhammednagy
Copy link
Contributor Author

Hey
Sorry, I got swamped in other stuff.
I can try to work on it over the weekend.

@holiman holiman mentioned this pull request Nov 23, 2020
1 task
@holiman
Copy link
Contributor

holiman commented Nov 23, 2020

Thanks for getting started on this @muhammednagy! I took your work and extended it with the additional requirements we had, which is implemented in #21893. I'm closing this PR in favour of that one

@holiman holiman closed this Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect unclean shutdowns
6 participants