Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367
[DPE-4575][DPE-4886][DPE-4983] Add voting exclusions management #367
Changes from 25 commits
52fcb0d
8802410
d360fe9
05c828b
186dadd
784dd0f
b4b9fde
b9becc8
f22f75b
ce0c3e6
3e59a79
b3a9c11
1bc6555
ee49e33
5a9e006
e633312
fc148a5
68ca4f0
8fa6f29
48c511e
3b16370
4004ba5
6c2137c
c45e7e9
4591f2b
6ff1ad9
9e0e20d
c3b71e7
3ebadee
f0e0fec
bbca393
adeef52
3d74325
e14ae76
b909c16
af9fa25
7c1595a
3851b2d
631b6a8
8972bf3
ad524e6
77b1c67
d84e1d5
bd97709
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While before this cleanup was done on every
update_status
, now it's only done when the Health isgreen
. Is this on purpose?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not the case... It is done on every case, except
HealthColors.UNKNOWN
. Indeed, we defer the event if it is not green... I put it down there because I need the API to be responsive before configuring voting exclusions. If it is not responsive, we will get UNKNOWN anyways and retry later anyways.I will add some comments to clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the shards allocation exclusion cleanup is postponed until later in the hook? As long as there is connectivity to a host, we should be able to cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the health checks below allow anything pass, unless the cluster is really on a bad state (i.e. UNKNOWN). So, moved the check below these first health checks because it makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen here if no other node is online
self.alt_hosts == []
? I believe this will try for 5min and eventually crash with a RetryError.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss so,
self.alt_hosts
will not return the local host in the list? How can I get it then?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the way we call both ClusterTopology's
elected_manager
andnodes
. They will now do more checks on self.alt_hosts.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put back those comments? the description is still current and valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first description, yes. The second:
TODO: we should probably NOT have any exclusion on restart
is outdated. Now, we only add nodes to the voting exclusions IF we are scaling between 3-1 nodes or vice versa.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed.
self.alt_hosts
is enough, it already contains all possible alternative hosts incl. from the large deployment relations.Note: Please do not rely on
rel_data
for as long as you can - we should aim for a clear separation of concerns. The 2 checks can be replaced more robustly withself.opensearch_peer_cm.is_consumer()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
self.alt_hosts
returns the right value in any case, do I still need that if andself.opensearch_peer_cm.is_consumer()
then?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this? if we call
settle_voting_exclusions
frompost_start_init
, why should we have this check + exception?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, settle voting logic must be executed synchronously with a (re)start. This is why I am adding these two retry loops (one now, another at its end). Ideally, if we have a problem here, that may cause the entire cluster to get stuck in "metadata error" as we were seeing before.
Because between powering up and actually having the service up takes a while, this for loop assures we wait long enough for the opensearch service to come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why is this needed? if the cluster scales down from 2 to 1 unit, doesn't that mean that an election will happen anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we made to this point, it means any previous voting exclusions have been already removed. So, we have 2x voting units available in this cluster.
The main problem now is, if we have 2x voting units, then the minimum quorum for us to keep working will be:
If I remove one unit, then the other cannot re-elect. So, I apply this exclusion to reduce the voting number down to 1x unit + also moving the cluster manager, if the unit going away was the manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must happen before stopping, so we are sure the metadata will be updated on the unit that is staying.
This is the only way I got on my own tests to safely downscale from 2->1 unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why you chose to exclude node [0] from voting and remove it from the list? This will result in a new
cluster manager node
being elected when you scale up from 1 unit to more and in the process of removing the application. I just saw this locally when testing, not that it creates that much latency, but wouldn't it be better to use the last one from the list instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, @reneradoi yes, I noticed that as well. But when I was discussing with @Mehdi-Bendriss, we've agreed to make this list predictable instead of keeping track of the cluster manager. I could add a check here, for that, but then the other check, right before, gets slightly more complicated:
Given this is quite an exception (i.e. going from 1->2 or 2->1), I took the simpler approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can improve the comments around here, this logic is pretty brittle tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I overlooked and missed a safety component there, where if we remain with 2 units - and the one being removed is not the current elected CM; maybe it makes sense to add the other unit to the voting exclusion.
This should reduce switchovers and the risk it entails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mehdi-Bendriss I would not recommend that. The main reason is because I noticed moving the elected manager between nodes is far faster than juju hooks. We need to be predictable in this specific case, even if it means moving the elected manager.