openshift · openshift-merge-robot · Jul 13, 2020 · Dec 18, 2019 · Jan 6, 2020 · Jan 6, 2020
diff --git a/enhancements/mcd-reboot-exceptions.md b/enhancements/mcd-reboot-exceptions.md
@@ -0,0 +1,247 @@
+---
+title: mcd-reboot-dictionary
+authors:
+  - "@beekhof"
+reviewers:
+  - "@crawford"
+  - "@eparis"
+  - "@rphillips"
+  - "@runcom"
+  - "@yuqi-zhang"
+  - "@kikisdeliveryservice"
+  - "@sinnykumari"
+  - "@ericavonb"
+approvers:
+  - "@crawford"
+  - "@eparis"
+  - "@rphillips"
+  - "@runcom"
+creation-date: 2019-12-15
+last-updated: 2020-06-18
+status: provisional
+see-also:
+  - NA
+replaces:
+  - NA
+superseded-by:
+  - NA
+---
+
+# Domain specific minimisation of config driven cluster reboots
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [x] Design details are appropriately documented from clear requirements
+- [x] Test plan is defined
+- [x] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Open Questions [optional]
+
+- Should the parts of the configuration that do and don't require reboot be
+  part of user facing documentation?
+
+## Summary
+
+A node's configuration is managed by the MCO.  When the configuation is changed
+by the system or an admin, the MCO and MCD write a new version of the file to
+disk and reboot the system to ensure it is applied.
+
+This proposal is to allow drain and reboot to be optional and independant
+actions, to implement the ability to restart a systemd service, and to add a
+function to MCD that compares two MachineConfigs and can decide which actions
+are needed to apply it.
+
+
+## Motivation
+
+The current MCO mechanism for applying config changes, is to reboot the machine.
+
+While it is expected and normal that any cluster node can reboot at any time,
+clusters have a practical limit to the number of full cluster reboots they can
+achieve in a given time period.
+
+This means we need to weigh the correctness and simplicity of using full cluster
+reboots, against their frequency and user impact.
+
+In environments where the limit is generally lower, such as bare metal, the
+balance can be different - creating a need to find alternative ways to apply
+configuration changes for some scenarios.
+
+### Goals
+
+Provide a mechanism for avoiding a full cluster reboot in domain specific
+scenarios where engineering has verified that the configuration change:
+
+1. can be proven to be safely and immediately applied by some other mechanism,
+1. is not combined with other changes that do require a reboot
+1. has a demonstrable negative user impact when applied via reboot
+
+### Non-Goals
+
+1. Removing all reboots from the system
+1. Allowing the admin to configure additional domains or strategies for applying
+   configuration updates
+
+## Proposal
+
+1. Add a function to MCD that compares two MachineConfigs, and outputs an
+   ordered list of known actions that are needed to apply it
+1. Modify any parts of MCD that assume drain and reboot always happens after
+   writing to disk
+1. Modify MCD to perform only the list of actions produced in step 1
+
+The full list of post-write actions required for this feature to be useful are:
+
+- drain
+- reboot
+- restart a named systemd service
+
+If the action results in an error, or takes too long, a reboot will be
+performed as in prior 4.x versions.
+
+### User Stories [optional]
+
+The user stories below are intended to be illustrative of the different types of 
+services we need to build capabilities for.
+
+#### Story 1
+
+As an administrator of a disconnected bare metal cluster, I want the ability to
+update the image catalog mirror[1] without potentially causing a full cluster
+reboot, so that I can obtain the latest fixes quickly and without needing to be
+concerned about the cumulative disruption to the cluster.
+
+[1] As well as the ICSP associated with it, which is sensitive to the addition,
+subtraction, or renaming of images.
+
+#### Story 2
+
+As an adminstrator of a large bare metal cluster, I want the addition or removal
+of SSH keys to happen without a full cluster reboot, so that the credentials are
+usable (or revoked) in a timeframe comparable with traditional environments not
+hours.
+
+#### Story 3
+
+As an adminstrator of a bare metal cluster that takes a day to reboot, I want
+the system to use the least invasive method for applying configuration changes,
+so that I have periods of relative stability where I can accomodate the
+maintenance windows of less enlightened vendors.
+
+#### Story 4
+
+As an admin, I want to continue to force reboots for network configuration 
+updates and changes to kernel flags, so that they are applied in a timely 
+manner.
+
+
+### Implementation Details/Notes/Constraints [optional]
+
+How the comparision determines what constitutes a domain, whether by target file
+or by configuration element, is left for later discussion.
+
+### Risks and Mitigations
+
+The risks are of false positives (an update that requires a reboot does not
+trigger one), and false negatives (an update that does not require a reboot
+triggers one anyway).  While the latter may be unexpected by an admin that knows
+of this feature, at worst it is a performance issue and cannot be considered a
+regression over the current behaviour.
+
+False negatives can be mitigated by defaulting to the current behaviour
+(reboot), preventing the admin from tampering with the list of domains with
+exceptions, and only adding new domains after Engineering and QE validate that
+the changes are safe.
+
+## Design Details
+
+### Test Plan
+
+- Unit tests for the comparision function that cover
+  - Identical, missing, and invalid MachineConfigs
+  - Changes to domains for which no exception exists
+  - Changes to only domains for which a 'do nothing' exception exists
+  - Changes to only domains for which a 'service restart' exception exists
+  - Changes to only domains for which a 'drain + service restart' exception exists
+  - Changes to domains for which exceptions exists, combined with changes which require reboot
+  - Changes to only domains for which an 'do nothing' exception exists
+- Performance testing of the comparison function
+- Unit tests for handling unknown/invalid actions from the comparision function
+- Unit tests for handling actions that fail or takes too long
+- Testing that covers failures during the write phase
+- Testing that covers on-disk changes detected during the write phase
+
+### Graduation Criteria
+
+##### Dev Preview -> Tech Preview
+
+- Ability to utilize the enhancement end to end
+- Sufficient test coverage
+- Support for handling ICSP changes
+
+##### Tech Preview -> GA 
+
+- More testing (upgrade, downgrade, scale)
+- Sufficient time for feedback
+
+##### Removing a deprecated feature
+
+- Announce deprecation and support policy of the existing feature
+- Deprecate the feature
+
+### Upgrade / Downgrade Strategy
+
+As per the current MCD strategy.
+
+### Version Skew Strategy
+
+If the MCD does not contain this feature, or has an outdated view of how an
+change needs to be applied, then the first change that affects a new entry will
+trigger a reboot (which can not be a regression) and subsequent changes will be
+acted on with the new information.  Any changes affecting entries that have been
+removed will have the old behaviour (also not a regression).
+
+If the MCD relies on capabilities that the rest of the system doesn't have yet,
+then there is a risk that a reboot will not occur for a version of the component
+that still requires it.  This could be mitigated through proper error handling
+and/or ensuring that the making an exception for a new domain always happens N
+versions after any engineering work that was required to make it possible.
+
+## Implementation History
+
+- Initial version: 15-12-2019
+- Updated: 18-06-2020
+
+## Drawbacks 
+
+The risks primarily fall into 4 categories:
+
+- configuration: that domains could be be added to the comparision function even
+  though they require a reboot in unidentified scenarios,
+
+- behavioural: that some config changes will result in a reboot, others not, and
+  that admins cannot determine which in advance
+
+- support: that the implementation could introduce bugs that prevent necessary
+  reboots from being triggered
+
+- tech debt: that the implementation may need to be replaced if a more holistic
+  approach is identified
+
+## Alternatives
+
+- Improve the MCD's logic for handling changes to CRDs.  This may require a
+  lot of things that need to be plumbed through in the MCO and may not be 
+  possible in the short term.
+
+- Define a higher level syntax that would contain additional information about
+  the action a field needs in order for any changes to be applied; and was be
+  able to be transformed into the appropriate Ignition format.  This would be a
+  much larger body of work that would not be possible to complete in the short 
+  term.
+
+## Infrastructure Needed [optional]
+
+None