Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain-specific reboot minimisation enhancement #159

Merged
merged 19 commits into from
Jul 13, 2020
Merged
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 247 additions & 0 deletions enhancements/mcd-reboot-exceptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
---
title: mcd-reboot-dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "dictionary" does not appear anywhere else in this document. It is unclear what the name "mcd-reboot-dictionary" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was the old title. Will look for any other occurrences too

authors:
- "@beekhof"
reviewers:
- "@crawford"
- "@eparis"
- "@rphillips"
- "@runcom"
- "@yuqi-zhang"
- "@kikisdeliveryservice"
- "@sinnykumari"
- "@ericavonb"
approvers:
- "@crawford"
- "@eparis"
- "@rphillips"
- "@runcom"
creation-date: 2019-12-15
last-updated: 2020-06-18
status: provisional
see-also:
- NA
replaces:
- NA
superseded-by:
- NA
---

# Domain specific minimisation of config driven cluster reboots

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [x] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Open Questions [optional]

- Should the parts of the configuration that do and don't require reboot be
part of user facing documentation?

## Summary

A node's configuration is managed by the MCO. When the configuation is changed
by the system or an admin, the MCO and MCD write a new version of the file to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCD does the write (nit) the MCO just starts the whole process in particular cases (eg the network config changes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the new wording is more precise :)

disk and reboot the system to ensure it is applied.

This proposal is to allow drain and reboot to be optional and independant
actions, to implement the ability to restart a systemd service, and to add a
function to MCD that compares two MachineConfigs and can decide which actions
are needed to apply it.


## Motivation

The current MCO mechanism for applying config changes, is to reboot the machine.

While it is expected and normal that any cluster node can reboot at any time,
clusters have a practical limit to the number of full cluster reboots they can
achieve in a given time period.

This means we need to weigh the correctness and simplicity of using full cluster
reboots, against their frequency and user impact.

In environments where the limit is generally lower, such as bare metal, the
balance can be different - creating a need to find alternative ways to apply
configuration changes for some scenarios.

### Goals

Provide a mechanism for avoiding a full cluster reboot in domain specific
scenarios where engineering has verified that the configuration change:

1. can be proven to be safely and immediately applied by some other mechanism,
1. is not combined with other changes that do require a reboot
1. has a demonstrable negative user impact when applied via reboot

### Non-Goals

1. Removing all reboots from the system
1. Allowing the admin to configure additional domains or strategies for applying
configuration updates

## Proposal

1. Add a function to MCD that compares two MachineConfigs, and outputs an
ordered list of known actions that are needed to apply it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it know what changes require what actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of any other information, the function would normally return drain+reboot if there is any difference detected.

The intention is that in very specific cases where an alternative has been proven to be safe (such as drain + service restart for changes that only affect ICSP), the function will modified to return a different set of actions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the answer here would be that we're going to base this on which files changed between 2 rendered machine configs and compute the actions based on that - we can add whatever complexity we want here also. I think the concern has been captured here #159 (comment) (addition vs deletion mainly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I mean, how does it know that modifications to file X require a reboot but modifications to file Y don't? Especially, is this something which is hardcoded into the MCO by us, or is it something configured at runtime by the administrator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I mean, how does it know that modifications to file X require a reboot but modifications to file Y don't?

the MCO only knows that a modification to a file X "doesn't require a reboot" - anything else will fall into "reboot as always"

Especially, is this something which is hardcoded into the MCO by us, or is it something configured at runtime by the administrator?

The aim is to avoid leaking this choice to the admin so this logic is gonna be computed by us in the MCO (and we drive that for future cases where we might want to avoid a reboot)

1. Modify any parts of MCD that assume drain and reboot always happens after
writing to disk
1. Modify MCD to perform only the list of actions produced in step 1

The full list of post-write actions required for this feature to be useful are:

- drain
- reboot
- restart a named systemd service

If the action results in an error, or takes too long, a reboot will be
performed as in prior 4.x versions.

### User Stories [optional]

The user stories below are intended to be illustrative of the different types of
services we need to build capabilities for.

#### Story 1

As an administrator of a disconnected bare metal cluster, I want the ability to
update the image catalog mirror[1] without potentially causing a full cluster
reboot, so that I can obtain the latest fixes quickly and without needing to be
concerned about the cumulative disruption to the cluster.

[1] As well as the ICSP associated with it, which is sensitive to the addition,
subtraction, or renaming of images.

#### Story 2

As an adminstrator of a large bare metal cluster, I want the addition or removal
of SSH keys to happen without a full cluster reboot, so that the credentials are
usable (or revoked) in a timeframe comparable with traditional environments not
hours.

#### Story 3

As an adminstrator of a bare metal cluster that takes a day to reboot, I want
the system to use the least invasive method for applying configuration changes,
so that I have periods of relative stability where I can accomodate the
maintenance windows of less enlightened vendors.

#### Story 4

As an admin, I want to continue to force reboots for network configuration
updates and changes to kernel flags, so that they are applied in a timely
manner.


### Implementation Details/Notes/Constraints [optional]

How the comparision determines what constitutes a domain, whether by target file
or by configuration element, is left for later discussion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what "domain" means here. It seems to mean something much more specific than in, eg, "Domain specific minimisation of config driven cluster reboots"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to explain the intended usage/meaning


### Risks and Mitigations

The risks are of false positives (an update that requires a reboot does not
trigger one), and false negatives (an update that does not require a reboot
triggers one anyway). While the latter may be unexpected by an admin that knows
of this feature, at worst it is a performance issue and cannot be considered a
regression over the current behaviour.

False negatives can be mitigated by defaulting to the current behaviour
(reboot), preventing the admin from tampering with the list of domains with
exceptions, and only adding new domains after Engineering and QE validate that
the changes are safe.

## Design Details

### Test Plan

- Unit tests for the comparision function that cover
- Identical, missing, and invalid MachineConfigs
- Changes to domains for which no exception exists
- Changes to only domains for which a 'do nothing' exception exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no previous explanation of "exceptions"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explanation: the case when unit of configuration can be applied by a means other than the default drain+reboot behaviour

- Changes to only domains for which a 'service restart' exception exists
- Changes to only domains for which a 'drain + service restart' exception exists
- Changes to domains for which exceptions exists, combined with changes which require reboot
- Changes to only domains for which an 'do nothing' exception exists
- Performance testing of the comparison function
- Unit tests for handling unknown/invalid actions from the comparision function
- Unit tests for handling actions that fail or takes too long
- Testing that covers failures during the write phase
- Testing that covers on-disk changes detected during the write phase

### Graduation Criteria

##### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- Sufficient test coverage
- Support for handling ICSP changes

##### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback

##### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

As per the current MCD strategy.

### Version Skew Strategy

If the MCD does not contain this feature, or has an outdated view of how an
change needs to be applied, then the first change that affects a new entry will
trigger a reboot (which can not be a regression) and subsequent changes will be
acted on with the new information. Any changes affecting entries that have been
removed will have the old behaviour (also not a regression).

If the MCD relies on capabilities that the rest of the system doesn't have yet,
then there is a risk that a reboot will not occur for a version of the component
that still requires it. This could be mitigated through proper error handling
and/or ensuring that the making an exception for a new domain always happens N
versions after any engineering work that was required to make it possible.

## Implementation History

- Initial version: 15-12-2019
- Updated: 18-06-2020

## Drawbacks

The risks primarily fall into 4 categories:

- configuration: that domains could be be added to the comparision function even
though they require a reboot in unidentified scenarios,

- behavioural: that some config changes will result in a reboot, others not, and
that admins cannot determine which in advance

- support: that the implementation could introduce bugs that prevent necessary
reboots from being triggered

- tech debt: that the implementation may need to be replaced if a more holistic
approach is identified

## Alternatives

- Improve the MCD's logic for handling changes to CRDs. This may require a
lot of things that need to be plumbed through in the MCO and may not be
possible in the short term.

- Define a higher level syntax that would contain additional information about
the action a field needs in order for any changes to be applied; and was be
able to be transformed into the appropriate Ignition format. This would be a
much larger body of work that would not be possible to complete in the short
term.

## Infrastructure Needed [optional]

None