Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a flag to control CRD management #3445

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

vincepri
Copy link
Contributor

Closes #[issue number]

What this PR does / why we need it:

Currently, there is no way to decide if the ASO manager should control all CRDs, or just use the ones already installed in the cluster without any modifications.

This PR adds a flag called enable-crd-management, enabled by default (to retain backward compatibility).

Special notes for your reviewer:

More information was captured in the upstream slack thread: https://kubernetes.slack.com/archives/C046DEVLAQM/p1697665840941649.

How does this PR make you feel:
gif

If applicable:

  • this PR contains documentation
  • this PR contains tests

// By default, assume the existing CRDs are the goal CRDs. If CRD management is enabled, we will
// load the goal CRDs from disk and apply them.
goalCRDs := existingCRDs
if flgs.EnableCRDManagement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to some very hard to diagnose failure modes.

The CRD versions in the cluster must exactly match the version of ASO that's being run. Upgrading either one in isolation will create an environment where some, if not all resources, will not work.

The underlying issue is to do with which version of each resource is tagged as the storage version - if an ASO release introduces a new version of a resource, the storage version changes (e.g. from v1api20200401storage to v120230801storage); a mismatch of versions between the CRDs and the running ASO operator is a fatal error.

I think we need to retain the check, but provide an option on what to do if a CRD mismatch is discovered. I can think of three options:

  • auto-upgrade - the existing behaviour
  • abort - the operator refuses to run, with suitable logging to inform users
  • degrade - the operator runs, but only supporting correctly versioned resources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We considered using a central "hub" storage version during design, but that had some problems.

The approach we've taken is chained storage versions; that case study goes into a great deal of information.

Copy link
Member

@theunrepentantgeek theunrepentantgeek Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussions on Slack, I better understand the usefulness of this flag.

Based on your comments, perhaps we rename the flag to better indicate that the user is disabling standard functionality, rather than enabling optional functionality?

For example, instead of --enable-crd-management with values true (the default) and false, what if it was --crd-compatibility-check with values of auto-upgrade (the default), verify and none.

We'd document them as follows:

  • auto-upgrade - ASO CRDs installed in the cluster will be checked to verify they are up to date and automatically updated by the operator as required. (Recommended).
  • verify - ASO CRDs installed in the cluster will be checked to verify they are up to date, with the operator exiting if any are outdated.
  • none - no checks are done to verify that ASO CRDs are up to date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reversed phrasing, (about disabling a feature versus enabling one), if we were leaving this a boolean. I'm not sold on the crd-compatibility-check naming, as IMO "upgrading the CRDs" isn't really a compatibility check and that's the default mode.

Maybe crd-management-mode auto|none, which the default being auto? I also agree that a verify might make sense but not sure we need to add it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like crd-management-mode with the options of auto|none - if we were to add verify in a future PR, that wouldn't be a breaking change.

v2/cmd/controller/app/flags.go Outdated Show resolved Hide resolved
v2/cmd/controller/app/flags.go Outdated Show resolved Hide resolved
// By default, assume the existing CRDs are the goal CRDs. If CRD management is enabled, we will
// load the goal CRDs from disk and apply them.
goalCRDs := existingCRDs
if flgs.EnableCRDManagement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reversed phrasing, (about disabling a feature versus enabling one), if we were leaving this a boolean. I'm not sold on the crd-compatibility-check naming, as IMO "upgrading the CRDs" isn't really a compatibility check and that's the default mode.

Maybe crd-management-mode auto|none, which the default being auto? I also agree that a verify might make sense but not sure we need to add it in this PR.

@matthchr
Copy link
Member

/ok-to-test sha=2b60a67

@matthchr
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@matthchr
Copy link
Member

@vincepri, looks like this is failing a golint check:

[controller:lint] cmd/controller/app/setup.go:169:7: shadow: declaration of "err" shadows declaration at line 108 (govet)
[controller:lint] var err error
[controller:lint] ^

Can you fix this and then we're happy to merge.

Signed-off-by: Vince Prignano <vince@prigna.com>
@vincepri
Copy link
Contributor Author

@matthchr done!

@matthchr
Copy link
Member

/ok-to-test sha=b0f4d08

@matthchr
Copy link
Member

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@matthchr matthchr added this pull request to the merge queue Oct 24, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #3445 (b0f4d08) into main (ee2e124) will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #3445      +/-   ##
==========================================
+ Coverage   54.31%   54.34%   +0.02%     
==========================================
  Files        1545     1545              
  Lines      651527   651527              
==========================================
+ Hits       353882   354055     +173     
+ Misses     240078   239897     -181     
- Partials    57567    57575       +8     

see 29 files with indirect coverage changes

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 24, 2023
@matthchr matthchr added this pull request to the merge queue Oct 24, 2023
Merged via the queue into Azure:main with commit 57b4a7a Oct 25, 2023
8 checks passed
@theunrepentantgeek theunrepentantgeek added this to the v2.4.0 milestone Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants