Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement unclean-shutdown marker #21893

Merged
merged 9 commits into from
Dec 11, 2020
Merged

implement unclean-shutdown marker #21893

merged 9 commits into from
Dec 11, 2020

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Nov 23, 2020

Replaces #21133

Some todos :

  • instead of single marker, use a list of markers and store every unclean shutdown

https://github.com/ethereum/go-ethereum/pull/21133/files#r437253799:

The PR uses a single unsafe-shutdown key to store a single timestamp. That's a good start, but we need a bit more information, otherwise quirky scenarios get lost. Notably, it's important to retain past crashes too, so a user might see if there was a fault recently, not just in the last restart (user's tend to restart their nodes a few times until attempting to investigate an issue, so a restart cannot delete the cause).

A simplistic suggestion would be to use a slice of timestamps instead of a single timestamp as the value of the entry. When Geth starts, you can append the current time, and on shutdown, you can pop off the tail. That way on startup you can actually list all the recent crashes, not just the last one.

@holiman holiman marked this pull request as ready for review November 23, 2020 17:03
@holiman
Copy link
Contributor Author

holiman commented Nov 23, 2020

Example execution

[user@work go-ethereum]$ ./build/bin/geth --goerli --nodiscover --maxpeers 0
...
INFO [11-23|18:59:36.189] Loaded most recent local header          number=3798046 hash="6bbd9e…118ec3" td=5559455 age=1d6h55m
INFO [11-23|18:59:36.189] Loaded most recent local full block      number=0       hash="bf7e33…b88c1a" td=1       age=1y10mo3d
INFO [11-23|18:59:36.189] Loaded most recent local fast block      number=3798046 hash="6bbd9e…118ec3" td=5559455 age=1d6h55m
INFO [11-23|18:59:36.189] Loaded last fast-sync pivot marker       number=3800031
INFO [11-23|18:59:36.193] Loaded local transaction journal         transactions=0 dropped=0
INFO [11-23|18:59:36.193] Regenerated local transaction journal    transactions=0 accounts=0
INFO [11-23|18:59:36.207] Allocated fast sync bloom                size=512.00MiB
WARN [11-23|18:59:36.207] Unclean shutdown(s) detected             latest="1h23m26s ago" count=3
INFO [11-23|18:59:36.207] Starting peer-to-peer node               instance=Geth/v1.9.25-unstable-bf2aedd9-20201123/linux-amd64/go1.15.5
...

And after yet another SIGKILL

WARN [11-23|19:01:20.512] Unclean shutdown(s) detected             latest="16s ago" count=4

@karalabe
Copy link
Member

I think in general it would be more useful to log each shutdown separately.

E.g.

WARN [..] Unclean shutdown detected    time=timestamp age=1m ago

You can also use PrettyDuration to not have too big of a age string there, and I'd also use age=... vs latest=... ago because it allows you to get rid of a quotation mark :)

@karalabe
Copy link
Member

Otherwise we only knwo that there was 5 shutdown, but we don;t know if they are recent or old, so it's not particularly useful.

We might also add some form of cleanup so we don't keep too many old events?

@holiman
Copy link
Contributor Author

holiman commented Nov 23, 2020

Yes, if we show many we definitely need to clean them up. So, keep the most recent 5? 10?

@karalabe
Copy link
Member

10 recent + a counter to how many older ones there were?

@holiman
Copy link
Contributor Author

holiman commented Nov 23, 2020

Ok!
Now it looks like this:

WARN [11-23|23:05:35.803] Old unclean shutdowns found              count=1
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T22:56:03+0100 age=9m32s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T22:56:15+0100 age=9m20s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:00:49+0100 age=4m46s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:12+0100 age=4m23s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:23+0100 age=4m12s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:27+0100 age=4m8s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:31+0100 age=4m4s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:36+0100 age=3m59s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:42+0100 age=3m53s
WARN [11-23|23:05:35.803] Unclean shutdown detected                time=2020-11-23T23:01:47+0100 age=3m48s
INFO [11-23|23:05:35.803] Starting peer-to-peer node               instance=Geth/v1.9.25-unstable-bf2aedd9-20201123/linux-amd64/go1.15.5

@holiman
Copy link
Contributor Author

holiman commented Nov 23, 2020

The very first time it runs, it prints out

WARN [11-23|23:12:10.152] Error reading USM                        error="leveldb: not found"

Copy link
Member

@rjl493456442 rjl493456442 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicks

core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
var uncleanShutdowns ucmList
// Read old data
if data, err := db.Get(uncleanShutdownKey); err != nil {
log.Warn("Error reading USM", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps: Failed to read the unclean shutdown markers?

core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
core/rawdb/schema.go Show resolved Hide resolved
core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
core/rawdb/accessors_metadata.go Outdated Show resolved Hide resolved
if discards > 0 {
log.Warn("Old unclean shutdowns found", "count", discards)
}
for _, tstamp := range uncleanShutdowns {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's inaccurate actually. It's the timestart of the START, not the CRASH. Usually geth will run for a very long time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an underlying thread to update the shutdown marker e.g. every 1 minute? For the cost wise, it's basically no overhead but at least we can decrease the bias

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do that in a follow-up PR?
For now I can replace time with booted, to clarify that it's the startup timestamp, not the crash timestamp

// Add a new (but cap it)
uncleanShutdowns.Recent = append(uncleanShutdowns.Recent, uint64(time.Now().Unix()))
if l := len(uncleanShutdowns.Recent); l > crashesToKeep+1 {
uncleanShutdowns.Recent = uncleanShutdowns.Recent[l-crashesToKeep-1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a pretty fancy way to discard the first element, wouldn't Recent[1:] suffice?
(I assume that only one element can be added/discarded, since Discarded++ is not Discarded += len(Recents)-crashesToKeep)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myeah, I intended for it to handle the case where we discard more than one, e.g. if we decide to change the limit to 5 or there's some mishap.
So the proper thing to do is rather to uncrease uncleanShutdowns.Discarded appropriately. Good catch!

Copy link
Member

@MariusVanDerWijden MariusVanDerWijden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@holiman holiman merged commit 4d48980 into ethereum:master Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants