-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add failover/liveness
#93039
roachtest: add failover/liveness
#93039
Conversation
aac268c
to
4900a2f
Compare
failover/liveness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/cmd/roachtest/tests/failover.go
line 274 at r1 (raw file):
// Setup the prometheus instance and client. We don't collect metrics from n4 // (the failing node) because it's occasionally offline, and StatsCollector
cc @kvoli that seems like an annoying limitation (if I'm reading this right Erik is saying that a missed scrape throws StatsCollector off?), is this difficult to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/cmd/roachtest/tests/failover.go
line 274 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
cc @kvoli that seems like an annoying limitation (if I'm reading this right Erik is saying that a missed scrape throws StatsCollector off?), is this difficult to fix?
Yeah, it errors out here if the time series have different numbers of samples:
cockroach/pkg/cmd/roachtest/clusterstats/exporter.go
Lines 254 to 261 in ee0fa07
if streamSize != len(series) { | |
return ret, errors.Newf( | |
"Differing lengths on stream size on query %s, expected %d, actual %d", | |
summaryQuery.Stat.Query, | |
streamSize, | |
len(series), | |
) | |
} |
4900a2f
to
8136a2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/cmd/roachtest/tests/failover.go
line 274 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah, it errors out here if the time series have different numbers of samples:
cockroach/pkg/cmd/roachtest/clusterstats/exporter.go
Lines 254 to 261 in ee0fa07
if streamSize != len(series) { return ret, errors.Newf( "Differing lengths on stream size on query %s, expected %d, actual %d", summaryQuery.Stat.Query, streamSize, len(series), ) }
It is an annoying limitation. We could fix it by matching timestamps but there's some manual aggregation functions that probably don't handle this. It would be a med size change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @tbg)
pkg/cmd/roachtest/tests/failover.go
line 274 at r1 (raw file):
Previously, kvoli (Austen) wrote…
It is an annoying limitation. We could fix it by matching timestamps but there's some manual aggregation functions that probably don't handle this. It would be a med size change.
Not important for my purposes.
This patch adds a roachtest that measures the duration of *user* range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well. The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the `replicas_leaders_invalid_lease` metric over time. Epic: none Release note: None
8136a2c
to
a2f9adb
Compare
bors r+ |
Build failed: |
bors retry |
Build succeeded: |
This patch adds a roachtest that measures the duration of user range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well.
The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the
replicas_leaders_invalid_lease
metric over time.Touches #88443.
Epic: none
Release note: None