Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include last-committed data in publication #92259

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Dec 9, 2022

The cluster coordination consistency layer relies on a couple of fields within Metadata which record the last committed values on each node. In contrast, the rest of the cluster state can only be changed at accept time.

In the past we would copy these fields over from the master on every publication, but since #90101 we don't copy anything at all if the Metadata is unchanged on the master. However, the master computes the diff against the last committed state whereas the receiving nodes apply the diff to the last accepted state, and this means if the master sends a no-op Metadata diff then the receiving node will revert its last-committed values to the ones included in the state it last accepted.

With this commit we include the last-committed values alongside the cluster state diff so that they are always copied properly.

Closes #90158

@DaveCTurner DaveCTurner added >bug WIP :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.7.0 labels Dec 9, 2022
@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Dec 9, 2022

Alternative to #92258 that doesn't involve changing the safety algorithm. Not as bad as I thought it would be.

Repeating here: I think this bug is more of a liveness concern than a safety one: it means that an election may incorrectly wait to collect votes from the previous configuration as well as the current one.

NB no tests yet, I was to see this run through CI first. (edit: now has tests)

The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since elastic#90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.
@DaveCTurner DaveCTurner force-pushed the 2022-12-09-include-committed-fields-in-publication branch from 173d526 to ad1574f Compare December 11, 2022 19:50
@DaveCTurner DaveCTurner removed the WIP label Dec 11, 2022
@DaveCTurner DaveCTurner marked this pull request as ready for review December 11, 2022 20:37
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 11, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Dec 11, 2022

We should backport this to 8.6, but I'm unsure if it'll land in 8.6.0 or 8.6.1. It's not serious enough to call it a blocker for 8.6.0 IMO. However the 8.6.1 version constant doesn't exist yet so I am going to have to set the BwC version to 8.6.0 while backporting, and then we may have to adjust this when bumping versions after the 8.6.0 release.

cc @henningandersen who may have a different opinion about calling this a blocker for 8.6.0.
cc @kingherc who I believe will be doing the 8.6.0 version bump.

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, probably to be considered questions.

final var hasClusterUuid = lastAcceptedState.metadata().clusterUUID().equals(Metadata.UNKNOWN_CLUSTER_UUID) == false;
assert hasClusterUuid : "received cluster state with empty cluster uuid: " + lastAcceptedState;

if (hasClusterUuid && lastAcceptedState.metadata().clusterUUIDCommitted() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels odd to protect the logging by something we assert on above?

Suggested change
if (hasClusterUuid && lastAcceptedState.metadata().clusterUUIDCommitted() == false) {
if (lastAcceptedState.metadata().clusterUUIDCommitted() == false) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh I just kept the same logic as before but I don't think I've ever seen this assertion trip. I'd be ok to drop it.

setLastAcceptedState(ClusterState.builder(lastAcceptedState).metadata(metadataBuilder).build());

final var adjustedMetadata = lastAcceptedState.metadata()
.withLastCommittedValues(hasClusterUuid, lastAcceptedState.getLastAcceptedConfiguration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass in hasClusterUuid || lastAcceptedState.metadata().clusterUUIDCommitted() here? I mean, we are outside assertions already, so could also pass in true. But assuming assertions have some case we are trying to protect against, passing in the original committed value seems safer and also in line with original code.

@@ -556,27 +556,18 @@ public interface PersistedState extends Closeable {
* marked as committed.
*/
default void markLastAcceptedStateAsCommitted() {
final ClusterState lastAcceptedState = getLastAcceptedState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is cleaner, but I failed to spot the important difference, is it just the logging you wanted out of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything different, there's just no need to use a Metadata.Builder here any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. In that case, let us revert this and move this to a follow-up, where we can also assume that assertions are held (i.e., use true instead of hasClusterUuid later in the code).

@@ -86,6 +86,8 @@ public class PublicationTransportHandler {
TransportRequestOptions.Type.STATE
);

public static final Version INCLUDES_LAST_COMMITTED_DATA_VERSION = Version.V_8_7_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner should this be 8.6.0 since it's intended to be backported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust that while backporting. If I set this to 8.6.0 now then the BwC test suite would fail on this PR, but it's important for those tests to pass because they tell us whether there are any other more subtle BwC problems here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joining the dots here:

@DaveCTurner
Copy link
Contributor Author

I reverted the tidy-up to CoordinationState in 768a864 because it's not necessary for the backport. I'll follow-up with that change in a separate PR

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Dec 12, 2022
@elasticsearchmachine elasticsearchmachine merged commit fb756a9 into elastic:main Dec 12, 2022
@DaveCTurner DaveCTurner deleted the 2022-12-09-include-committed-fields-in-publication branch December 12, 2022 11:56
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 12, 2022
The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since elastic#90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.

Closes elastic#90158
Backport of elastic#92259 to 8.6
DaveCTurner added a commit that referenced this pull request Dec 12, 2022
The cluster coordination consistency layer relies on a couple of fields
within `Metadata` which record the last _committed_ values on each node.
In contrast, the rest of the cluster state can only be changed at
_accept_ time.

In the past we would copy these fields over from the master on every
publication, but since #90101 we don't copy anything at all if the
`Metadata` is unchanged on the master. However, the master computes the
diff against the last _committed_ state whereas the receiving nodes
apply the diff to the last _accepted_ state, and this means if the
master sends a no-op `Metadata` diff then the receiving node will revert
its last-committed values to the ones included in the state it last
accepted.

With this commit we include the last-committed values alongside the
cluster state diff so that they are always copied properly.

Closes #90158
Backport of #92259 to 8.6
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 12, 2022
Adjusts the BWC version for the wire protocol and re-enables the BWC
tests.
elasticsearchmachine pushed a commit that referenced this pull request Dec 12, 2022
Adjusts the BWC version for the wire protocol and re-enables the BWC
tests.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 12, 2022
DaveCTurner added a commit that referenced this pull request Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.6.0 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] CoordinatorTests testUnhealthyLeaderIsReplaced failing
4 participants