Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add failover/liveness #93039

Merged
merged 1 commit into from
Dec 8, 2022

Conversation

erikgrinaker
Copy link
Contributor

This patch adds a roachtest that measures the duration of user range unavailability following a liveness leaseholder failure, as well as the number of expired leases. When the liveness range is unavailable, other nodes are unable to heartbeat and extend their leases, which can cause them to expire and these ranges to become unavailable as well.

The test sets up a 4-node cluster with all other ranges on n1-n3, and the liveness range on n1-n4 with the lease on n4. A kv workload is run against n1-n3 while n4 fails and recovers repeatedly (both with process crashes and network outages). Workload latency histograms are recorded, where the pMax latency is a measure of the failure impact, as well as the replicas_leaders_invalid_lease metric over time.

Touches #88443.

Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Dec 5, 2022
@erikgrinaker erikgrinaker requested a review from tbg December 5, 2022 15:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker requested review from a team December 5, 2022 15:21
@erikgrinaker erikgrinaker force-pushed the roachtest-failover-liveness branch from aac268c to 4900a2f Compare December 5, 2022 15:44
@erikgrinaker erikgrinaker changed the title roachtest: add `failover/liveness roachtest: add failover/liveness Dec 5, 2022
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/cmd/roachtest/tests/failover.go line 274 at r1 (raw file):

	// Setup the prometheus instance and client. We don't collect metrics from n4
	// (the failing node) because it's occasionally offline, and StatsCollector

cc @kvoli that seems like an annoying limitation (if I'm reading this right Erik is saying that a missed scrape throws StatsCollector off?), is this difficult to fix?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)


pkg/cmd/roachtest/tests/failover.go line 274 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

cc @kvoli that seems like an annoying limitation (if I'm reading this right Erik is saying that a missed scrape throws StatsCollector off?), is this difficult to fix?

Yeah, it errors out here if the time series have different numbers of samples:

if streamSize != len(series) {
return ret, errors.Newf(
"Differing lengths on stream size on query %s, expected %d, actual %d",
summaryQuery.Stat.Query,
streamSize,
len(series),
)
}

@erikgrinaker erikgrinaker force-pushed the roachtest-failover-liveness branch from 4900a2f to 8136a2c Compare December 6, 2022 16:43
Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/cmd/roachtest/tests/failover.go line 274 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, it errors out here if the time series have different numbers of samples:

if streamSize != len(series) {
return ret, errors.Newf(
"Differing lengths on stream size on query %s, expected %d, actual %d",
summaryQuery.Stat.Query,
streamSize,
len(series),
)
}

It is an annoying limitation. We could fix it by matching timestamps but there's some manual aggregation functions that probably don't handle this. It would be a med size change.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @tbg)


pkg/cmd/roachtest/tests/failover.go line 274 at r1 (raw file):

Previously, kvoli (Austen) wrote…

It is an annoying limitation. We could fix it by matching timestamps but there's some manual aggregation functions that probably don't handle this. It would be a med size change.

Not important for my purposes.

This patch adds a roachtest that measures the duration of *user* range
unavailability following a liveness leaseholder failure, as well as the
number of expired leases. When the liveness range is unavailable, other
nodes are unable to heartbeat and extend their leases, which can cause
them to expire and these ranges to become unavailable as well.

The test sets up a 4-node cluster with all other ranges on n1-n3, and
the liveness range on n1-n4 with the lease on n4. A kv workload is run
against n1-n3 while n4 fails and recovers repeatedly (both with process
crashes and network outages). Workload latency histograms are recorded,
where the pMax latency is a measure of the failure impact, as well as
the `replicas_leaders_invalid_lease` metric over time.

Epic: none
Release note: None
@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 8, 2022

Build failed:

@erikgrinaker
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented Dec 8, 2022

Build succeeded:

@craig craig bot merged commit 6b7290d into cockroachdb:master Dec 8, 2022
@erikgrinaker erikgrinaker deleted the roachtest-failover-liveness branch December 28, 2022 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants