Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain-specific reboot minimisation enhancement #159

Merged
merged 19 commits into from
Jul 13, 2020
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions enhancements/mcd-reboot-dictionary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
title: mcd-reboot-dictionary
authors:
- "@beekhof"
reviewers:
- "@crawford"
- "@eparis"
- "@rphillips"
- "@runcom"
beekhof marked this conversation as resolved.
Show resolved Hide resolved
approvers:
- "@crawford"
- "@eparis"
- "@rphillips"
- "@runcom"
creation-date: 2019-12-15
last-updated: 2019-12-15
status: provisional
see-also:
- NA
replaces:
- NA
superseded-by:
- NA
---

# MCD Reboot Dictionary

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [x] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Open Questions [optional]

- Should we allow the whitelist to apply recursively to a directory, or make
use of pattern matching?

- Is there any way to mitigate the version skew problem?

- Should the parts of the configuration that do and don't require reboot be
part of user facing documentation?


## Summary

A node's configuration is managed by the MCO. When the configuation is changed
by the system or an admin, the MCO and MCD write a new version of the file to
disk and reboot the system to ensure it is applied.

This proposal is to create a dictionary that defines a whitelist of file paths
for which alternative application strategies should be used.


## Motivation

Always rebooting a node is reasonable for cloud based systems, since they can
be rebooted very quickly. Additionally, it is the only way for kernel
parameter changes to take effect.

However on baremetal systems, rebooting a node takes significantly longer (in
the order of minutes) and the machine configuration includes files that need
either no action to take effect, or can be more quickly applied by restarting a
systemd service.

### Goals

- Creation of a minimally invasive solution that can be delivered in the
short-term, allowing time for a more comprehensive approach to be developed
- A system defined whitelist of file paths that do not trigger a reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be thinking of this in terms of white-listing paths or supporting optimized reboots for certain functions? im not convinced that whitelisting paths won't cause some other issues and is overbroad, whereas iterating on optimizing specific reboots for specific functionalities ie "don't reboot a node when you update ssh keys" is more scalpel-like with more predictable results and a clear articulation of what is supported and what is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear what you're saying about focusing on specific functions, but don't they all eventually decompose into "i wrote this file(s) to disk, what do I do about it?"

Surely something, somewhere, needs know which file(s) a given CRD field is associated with, and what needs to be done to apply it.

Is there a practical difference between a lookup table that only has an entry for ${XYZ} (authkeys for the least complex example), and adding conditionals through the code that check if the current file is ${XYZ}? Or is the thinking that responsibility for maintaining ${XYZ} would be transferred to another operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting perspective @beekhof but one I'm not sold on. Adding a file to /etc/containers/registry.d/ is something I could agree is safe to do without a reboot. Removing such a file I'd argue is not. Thus you can't say /etc/container/registry.d/* can be manipulated without reboot.

It takes domain specific knowledge. And it's really tricky because I'm arguing for 'immediate consistency' in an 'eventual consistency' system. I know I'm arguing out of 2 sides of my mouth, but apply a change that won't cause problems until some time in the future is much worse than a reboot. Weekly reboots are normal. We design for that to be the norm.

- Changes to systemd service configurations are applied by restarting the service
- Specific files can be updated with no action (eg. ssh keys)
- Any file not covered by the whitelist triggers a reboot

### Non-Goals

- Allowing the admin to specify additional paths or strategies
- Allowing different behaviour based on which option or part of the file has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried this needs to be a goal to be safe...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would like the solution to avoid the tight coupling of having to understand the contents of the files. So the thought process would be that if there is any part of a file that requires a reboot to be applied, then take the pessimistic approach and always reboot if that file changes.

However, I assume you're thinking about ensuring that changes are purely additive in the ICSP case.

I think that would be possible to do reasonable generically, but do we need to do so if we still drain the node and remove any cached images? Would this approach result in a worse experience if a broken ICSP is supplied than we have today?

been updated

## Proposal

The proposal calls for a whitelist that MCD uses to decide which action to take
after updating a configuration file for the MCO. To prevent the admin from
tampering with the whitelist, it will be included as part of the MCD image as a
static file in yaml format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file going to be versioned, or there is a strict compatibility with the current MCD only?

Copy link
Contributor Author

@beekhof beekhof Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason for including it with the MCD was to avoid the need for versioning.
But I don't have strong feelings about it.


The information could be baked into the the MCD itself, but assuming we treat it
as data, the file will be a list of entries containing:

- filename
- path
- action to perform
- action specific data (eg. the service name, timeout)
beekhof marked this conversation as resolved.
Show resolved Hide resolved

After the MCD writes out a configuation change, it will consult the whitelist
before deciding if a reboot is required. If the filename is present, has a
valid action and any associated data, the action will be performed. Otherwise,
a reboot will be performed as in prior 4.x versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a case where MCD was expected to drain the node even before any configuration change is applied... and here proposal is regarding actions after configuration has been applied.. so it seems like skip drain is not an action that can be supported?? would love to see details wrt to drain before reboot...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is, if the configuration change (which is usually multiple file/units) has conflicting actions ie some files require reboot, some say skip reboot.. what should be the expected behavior. Would like to see those details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any pointers to additional information around the drain-before-reboot usecase would be great. It would be good to understand the motivation before making any claims about how the two features should interact.


If the action results in an error, or takes too long, a reboot will be
performed as in prior 4.x versions.
beekhof marked this conversation as resolved.
Show resolved Hide resolved


### User Stories [optional]

Detail the things that people will be able to do if this is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.

beekhof marked this conversation as resolved.
Show resolved Hide resolved
#### Story 1

As a telco admin, I want the system to use the least invasive method for
applying configuration file updates, to reduce periods of degraded system and
application availability.
beekhof marked this conversation as resolved.
Show resolved Hide resolved

#### Story 2

None

### Implementation Details/Notes/Constraints [optional]

This is an interim solution designed to quickly address a market need and
gather data for a long term approach.

### Risks and Mitigations

The risks are of false positives (an update that requires a reboot does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't reboot for something that it is absolutely required, then that update is a failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see how this would not be considered a big problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we already have a long-term intention to optimize reboots [...]

I'd rather the MCO team, when we are ready work towards that [...]

That in a nutshell is why this enhancement exists, we're looking for an interim solution because promises have been made that are not especially compatible with MCO timelines :-)

I really don't know about a dictionary/lookup table being the best mechanism for this...

I really think that this enhancement should be focused on the general idea of optimizing reboots, etc... as opposed to this very specific implementation.

These points are related.

I hold no illusions that a lookup table is the best mechanism for this, it's definitely not. However the enhancement is advocating for a specific implementation because it is something that can be done in the timeframe customers are requesting (ie. yesterday).

The hope is that we can show implement something quickly, that is minimally disruptive to the MCO codebase, and because it's not user facing, can be thrown away once the MCO team has identified and implemented this feature the right way.

Part of the rationale for doing a PoC is to determine if "minimally disruptive" is actually achievable.

trigger one), and false negatives (an update that does not require a reboot
triggers one anyway). While the latter may be unexpected by an admin that
knows of this feature, at worst it is a performance issue and cannot be
considered a regression over the current behaviour.

False negatives can be mitigated by defaulting to the current behaviour
(reboot), preventing the admin from tampering with the whitelist, and only
adding entries to the whitelist after Engineering and QE validate that no
possible contents of the file could require a reboot.

## Design Details

### Test Plan

- Missing and empty whitelist
- Changes to files not on the whitelist
- Changes to files on the whitelist requiring no action
- Changes to files on the whitelist requiring a service restart
- Changes to files on the whitelist with invalid actions/data
- Changes to files on the whitelist with valid actions/data but the action fails or takes too long
- Large whitelist performance

### Graduation Criteria

##### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- Whitelist format stability
- Sufficient test coverage
- Provisional whitelist contents

##### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- A supportable whitelist

##### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

As per the current MCD strategy.

### Version Skew Strategy

If the MCD does not contain this feature, or only knows about the old
whitelist, then the first change that affects a new entry will trigger a reboot
(which can not be a regression) and subsequent changes will be acted on with
the new list. Any changes affecting entries that have been removed will have
the old behaviour (also not a regression).

If the MCD knows about a newer whitelist than the rest of the system, then
there is a risk that a reboot will not occur for a version of the component
that still requires it. This could be mitigated by ensuring the adding a file
to the whitelist always happens N versions after any engineering work that was
required to make it possible.

## Implementation History

- Initial version: 15-12-2019
beekhof marked this conversation as resolved.
Show resolved Hide resolved

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.
beekhof marked this conversation as resolved.
Show resolved Hide resolved
beekhof marked this conversation as resolved.
Show resolved Hide resolved

## Alternatives

- Improve the MCD's logic for handling changes to CRDs. This may require a
lot of things that need to be plumbed through in the MCO and may not be
possible in the short term.

- Define a higher level syntax that would contain additional information about
the action a field needs in order for any changes to be applied; and was be
able to be transformed into the appropriate Ignition format. This would be a
much larger body of work that would not be possible to complete in the short
term.


## Infrastructure Needed [optional]

None