Prevent creating a controller if a CRD does not exist. #840

sophieliu15 · 2020-03-05T23:54:46Z

I hope to create a controller for each GVK that I added dynamically. Previously I noticed that if I started the manager and then installed a new CRD on the fly, I would get an error saying the CRD did not exist when I called NewControllerManagedBy(mgr)...Complete(r). Specifically, this line would return an error. The error message was the following, where CronTab was a new CRD that I installed after starting the controller manager:

Error while trying to create ObjectReconciler","gvk":"stable.example.com/v1, Kind=CronTab","error":"no matches for kind \"CronTab\" in version \"stable.example.com/v1\"

Then I found #554 and updated controller-runtime version to the latest one (v0.5.0) to use DynamicRESTMapper. After updated, I started manager without installing CRD for CronTab. I found that I could create controller for CronTab successfully (i.e., this line does not return any error). However, when the Reconcile method for the controller of CronTab was called, I got the error from pkg/source/source.go complaining the CRD was not found. Following was the error:

{"level":"error","ts":1583448155.5018284,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"CronTab.stable.example.com","error":"no matches for kind \"CronTab\" in version \"stable.example.com/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/source/source.go:88\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:165\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:198\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Add.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/manager/internal.go:226"}

My questions are:

Is there any way to prevent creating controller if a specific CRD does not exist in the apiserver? I am currently using:

_, err := r.Manager.GetRESTMapper().RESTMapping(gvk.GroupKind(), gvk.Version)
if err != nil {
  // Do not create controller
}

Not sure if there is a better way (or common practice) to do that?

After updated to the latest controller-runtime, I would expect the controller creation to fail if I did not install the CRD and succeed after I created the CRD on the fly. I am curious why I was able to create the controller without installing CRD after the update? Is this change expected?

The text was updated successfully, but these errors were encountered:

sophieliu15 · 2020-03-05T23:55:00Z

/help

k8s-ci-robot · 2020-03-05T23:55:01Z

@sophieliu15:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/triage/support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-03-05T23:55:02Z

@sophieliu15: The label(s) triage/ cannot be applied, because the repository doesn't have them

In response to this:

/help
/triage/support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DirectXMan12 · 2020-03-06T21:43:35Z

Using the RESTMapper is a reasonable choice if you want to continue on, but not create the controller.

As for the "no longer fails on .Complete", I have a suspicion that this is due to 0fdf465, which moved around when we start Watches to avoid leaks.

DirectXMan12 · 2020-03-06T21:48:57Z

Based on the linked issue, this is mainly a "don't start this optional controller till this thing is installed", right?

DirectXMan12 · 2020-03-06T21:53:58Z

As for the "no longer fails on .Complete"

Specifically, it should now fail when you start the manager instead.

DirectXMan12 · 2020-03-06T21:55:46Z

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

DirectXMan12 · 2020-03-06T21:58:39Z

In retrospect this is semi, expected behavior, but it was not intended to break workflows. I've added a note to the v0.4.0 release. Technically, we're okay because that was a breaking release, but we shouldn't have done it without at least a note.

Reason for upgrade: The new version uses [DynamicRESTMapper](https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/client/apiutil/dynamicrestmapper.go) as default `RESTMapper` for controller runtime manager. `DynamicRESTMapper` will "reload the delegated `meta.RESTMapper` on a cache miss" (see kubernetes-sigs/controller-runtime#554), which can solve problem that we need to restart HNC after adding a new CRD to create corresponding object reconciler when using controller-runtime v0.2.2 (see details in kubernetes-retired#488). Incompatibility issues addressed in this PR: - Upgrade go version in `go.mod` and `Dockerfile` to 1.13. The [errors.As](https://golang.org/pkg/errors/#As) in [dynamicrestmapper.go](https://github.com/kubernetes-sigs/controller-runtime/blob/bfc982769615817ee15db8a543662652d900d27b/pkg/client/apiutil/dynamicrestmapper.go#L48) requires go 1.13. - A higher version for k8s.io/cli-runtime and k8s.io/client-go are required after upgrading controller-runtime to v0.5.0 - Version changes of other packages in `go.mod` are updated automatically. - serializer.DirectCodecFactory was renamed to [serializer.WithoutConversionCodecFactory](https://godoc.org/k8s.io/apimachinery/pkg/runtime/serializer#WithoutConversionCodecFactory) after k8s.io/apimachinery 1.16 (see [here](kubernetes/apimachinery@ed8af17), and [here](kubernetes/apimachinery@4fac835)) - Default [LeaderElectionID](https://github.com/kubernetes-sigs/controller-runtime/blob/bfc982769615817ee15db8a543662652d900d27b/pkg/leaderelection/leader_election.go#L46) in controller-runtime manager is removed after kubernetes-sigs/controller-runtime#446. - [NewlineReporter](https://godoc.org/sigs.k8s.io/controller-runtime/pkg/envtest/printer) is moved from `sigs.k8s.io/controller-runtime/pkg/envtest/` to `https://godoc.org/sigs.k8s.io/controller-runtime/pkg/envtest/printer` by [this](kubernetes-sigs/controller-runtime@748f55d#diff-42de1d59fbe8f8b90154f01dd01f5748) commit. - In controller-runtime v0.2.2, if a resource does not exist, a controller cannot be created successfully. After update controller-runtime to v0.5.0, a controller can be created without error. However, when the `Reconcile` method is triggered, there will be an error complaining the resource does not exist. Therefore, we will explicitly check if a resource exists before creating the corresponding object reconciler in `createObjectReconciler` in `hnc_config.go` (see details in kubernetes-sigs/controller-runtime#840) Tested: - Unit tests. - Went through [demo script](https://docs.google.com/document/d/1tKQgtMSf0wfT3NOGQx9ExUQ-B8UkkdVZB6m4o3Zqn64/edit#) to make sure HNC behaves as expected on a GKE cluster. - Manually test if the PR solves the restart problem as described in kubernetes-retired#488 with following workflow: - Install HNC - Install a new CRD - Config the new type in `config` singleton Before this PR, corresponding object reconciler for the new type will not be created unless we restart HNC. After the change, corresponding object reconciler can be created and it reconciles objects of the new type as expected without restarting HNC. This partly solve: kubernetes-retired#488

DirectXMan12 · 2020-03-06T22:06:39Z

I do feel mildly uncomfortable about inducing manager failure here, but the general policy is that individual controller failure causes general manager failure, so it's in line with that.

sophieliu15 · 2020-03-06T22:15:54Z

Hi @DirectXMan12

Thanks so much for the explanation! See my replies below.

Based on the linked issue, this is mainly a "don't start this optional controller till this thing is installed", right?

Yes that is correct!

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

The manager is already started. Looks like in v0.2.2, if the kind does not exist, we cannot create the new controller, but the manager is also not shutted down. In the latest version, the controller can be created, but then it will fail at controller start time and cause manager shutdown (basically what you said in the release notes).

Thanks for the answers and updating the release notes.

adrianludwin · 2020-03-09T14:37:12Z

or if the manager is already started (which it looks like from the stack trace), this error should induce manager shutdown

Hey @DirectXMan12 , wdyt about just having NewControllerManagedBy(mgr)...Complete(r) fail if the type doesn't exist, as opposed to having the manager as a whole terminate? There's now several use cases of users being able to effectively start new reconcilers on the fly (HNC and Gatekeeper), so it might be nice to have controller-runtime provide an error rather than just shutting down if there's a problem.

Obviously this can wait for a future release of controller-runtime.

… reconciler After upgrading sigs.k8s.io/controller-runtime version to v0.5.0, we can create reconciler successfully even when the resource does not exist in the cluster. Therefore, we explicitly check if the resource exists before creating the reconciler. See detailed discussion in: kubernetes-sigs/controller-runtime#840 Tested: unit tests, GKE cluster

After upgrading sigs.k8s.io/controller-runtime version to v0.5.0, we can create reconciler successfully even when the resource does not exist in the cluster. Therefore, we explicitly check if the resource exists before creating the reconciler. See detailed discussion in: kubernetes-sigs/controller-runtime#840 Tested: unit tests, GKE cluster

DirectXMan12 · 2020-03-12T21:07:04Z

Hmm...

The whole weird thing here is that now non-existence is potentially transient, and that it doesn't actually matter if the type doesn't exist on a call to New -- it only matters when we go to run the watches.

I think this is a question of semantics of New -- it's traditionally in CR (mostly) been "prepare a thing", while Start is "make the thing go". I generally see New as "struct initialization + extra setup" (since Go can't do RAII), so I'm not sure I'd expect "type doesn't exist on the server" to fail, since that's more of a "make the thing go"-type error.

On the other hand, delaying it till start doesn't really make sense when the manager is the one in charge of calling start on all the things -- there's no good way to capture the specific failure without wrapping the controller is a new runnable, and there's currently no good way to say "run this and don't stop the manager if it fails".

All of this is a long-winded way of saying "I'm not sure". I can't help but think this a symptom of something being wrong with the overall architecture of manager, but I'm not quite certain what (maybe error handling in Start and the stoppability of controllers? something with context? unclear).

Anyone have any thoughts? Making .Complete go back to failing, while somewhat semantically troubling, is a pretty straightforward solution, so it's possible I'm just overcomplicating things.

DirectXMan12 · 2020-03-12T21:08:54Z

FWIW, doing non-initialization things in New has led to leaks in the past. This obv wouldn't, but making the semantic distinction is potentially useful for avoiding that.

fejta-bot · 2020-06-24T16:50:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-07-24T17:31:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-08-23T18:12:49Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-08-23T18:12:56Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 5, 2020

sophieliu15 mentioned this issue Mar 6, 2020

HNC: Avoid restart HNC after adding a new CRD kubernetes-retired/multi-tenancy#488

Closed

DirectXMan12 added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. and removed help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 6, 2020

sophieliu15 mentioned this issue Mar 6, 2020

Upgrade controller-runtime from v0.2.2 to v0.5.0 and its dependencies. kubernetes-retired/multi-tenancy#504

Merged

sophieliu15 mentioned this issue Mar 10, 2020

Explicitly check if a GVK exists before creating corresponding object reconciler kubernetes-retired/multi-tenancy#512

Merged

vincepri added this to the Next milestone Mar 26, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 24, 2020

k8s-ci-robot closed this as completed Aug 23, 2020

skonto mentioned this issue Nov 5, 2020

[SRVKS-187] Add telemetry support openshift-knative/serverless-operator#657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent creating a controller if a CRD does not exist. #840

Prevent creating a controller if a CRD does not exist. #840

sophieliu15 commented Mar 5, 2020 •

edited

Loading

sophieliu15 commented Mar 5, 2020 •

edited

Loading

k8s-ci-robot commented Mar 5, 2020

k8s-ci-robot commented Mar 5, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020 •

edited

Loading

DirectXMan12 commented Mar 6, 2020

sophieliu15 commented Mar 6, 2020

adrianludwin commented Mar 9, 2020

DirectXMan12 commented Mar 12, 2020

DirectXMan12 commented Mar 12, 2020

fejta-bot commented Jun 24, 2020

fejta-bot commented Jul 24, 2020

fejta-bot commented Aug 23, 2020

k8s-ci-robot commented Aug 23, 2020

Prevent creating a controller if a CRD does not exist. #840

Prevent creating a controller if a CRD does not exist. #840

Comments

sophieliu15 commented Mar 5, 2020 • edited Loading

sophieliu15 commented Mar 5, 2020 • edited Loading

k8s-ci-robot commented Mar 5, 2020

k8s-ci-robot commented Mar 5, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020

DirectXMan12 commented Mar 6, 2020 • edited Loading

DirectXMan12 commented Mar 6, 2020

sophieliu15 commented Mar 6, 2020

adrianludwin commented Mar 9, 2020

DirectXMan12 commented Mar 12, 2020

DirectXMan12 commented Mar 12, 2020

fejta-bot commented Jun 24, 2020

fejta-bot commented Jul 24, 2020

fejta-bot commented Aug 23, 2020

k8s-ci-robot commented Aug 23, 2020

sophieliu15 commented Mar 5, 2020 •

edited

Loading

sophieliu15 commented Mar 5, 2020 •

edited

Loading

DirectXMan12 commented Mar 6, 2020 •

edited

Loading