-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade: retry if default DSCI creation fails #1008
upgrade: retry if default DSCI creation fails #1008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/retest |
1 similar comment
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AjayJagan, zdtsw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
After removing leader election, operator fails to start if it is instructed to create default DSCI. Looks like webhook is not ready by the time: ``` create default DSCI CR. {"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"} ``` Leader election added some delay. The problem does not happen in default configuration since it explicitly disables DSCI creation in the manifests: ``` containers: - command: - /manager env: - name: DISABLE_DSC_CONFIG value: 'true' args: - --operator-name=opendatahub image: controller:latest ``` Make a wrapper function cluster.CreateWithRetry for client.Object creation with timeout. Use hardcoded 5s interval, just seems reasonable, and timeout in minutes as the parameter. It requires disable linter nilerr since for the polling function error in creation is a valid condition, something the function wait to disappear. Fixes: 3610b0b ("feat: remove leader election for operator (opendatahub-io#1000)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
a7889fb
to
a892b10
Compare
Actually, generic is overthinking, taking interface here is enough. |
/retest |
/retest-required |
/lgtm |
e26100e
into
opendatahub-io:incubation
* Update version to v2.12.0 (#1007) * upgrade: retry if default DSCI creation fails (#1008) After removing leader election, operator fails to start if it is instructed to create default DSCI. Looks like webhook is not ready by the time: ``` create default DSCI CR. {"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"} ``` Leader election added some delay. The problem does not happen in default configuration since it explicitly disables DSCI creation in the manifests: ``` containers: - command: - /manager env: - name: DISABLE_DSC_CONFIG value: 'true' args: - --operator-name=opendatahub image: controller:latest ``` Make a wrapper function cluster.CreateWithRetry for client.Object creation with timeout. Use hardcoded 5s interval, just seems reasonable, and timeout in minutes as the parameter. It requires disable linter nilerr since for the polling function error in creation is a valid condition, something the function wait to disappear. Fixes: 3610b0b ("feat: remove leader election for operator (#1000)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> Co-authored-by: Yauheni Kaliuta <ykaliuta@redhat.com>
DSC creation is checked by the webhook so will not work from the main before manager starts if webhook is enabled. The same as DSCI. It also requires CreateWithRetry() for the same reasons as e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
DSC creation is checked by the webhook so will not work from the main before manager starts if webhook is enabled. The same as DSCI. It also requires CreateWithRetry() for the same reasons as e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
* main: move DSC creation after DSCIs one Reorder a bit to keep definition of cleanExistingResourceFunc next to the Add() call. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * main: convert DSC creation to RunableFunc DSC creation is checked by the webhook so will not work from the main before manager starts if webhook is enabled. The same as DSCI. It also requires CreateWithRetry() for the same reasons as e26100e ("upgrade: retry if default DSCI creation fails (#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
…1008) After removing leader election, operator fails to start if it is instructed to create default DSCI. Looks like webhook is not ready by the time: ``` create default DSCI CR. {"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"} ``` Leader election added some delay. The problem does not happen in default configuration since it explicitly disables DSCI creation in the manifests: ``` containers: - command: - /manager env: - name: DISABLE_DSC_CONFIG value: 'true' args: - --operator-name=opendatahub image: controller:latest ``` Make a wrapper function cluster.CreateWithRetry for client.Object creation with timeout. Use hardcoded 5s interval, just seems reasonable, and timeout in minutes as the parameter. It requires disable linter nilerr since for the polling function error in creation is a valid condition, something the function wait to disappear. Fixes: 3610b0b ("feat: remove leader election for operator (red-hat-data-services#1000)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
* main: move DSC creation after DSCIs one Reorder a bit to keep definition of cleanExistingResourceFunc next to the Add() call. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * main: convert DSC creation to RunableFunc DSC creation is checked by the webhook so will not work from the main before manager starts if webhook is enabled. The same as DSCI. It also requires CreateWithRetry() for the same reasons as e26100e ("upgrade: retry if default DSCI creation fails (red-hat-data-services#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
* main: move DSC creation after DSCIs one Reorder a bit to keep definition of cleanExistingResourceFunc next to the Add() call. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * main: convert DSC creation to RunableFunc DSC creation is checked by the webhook so will not work from the main before manager starts if webhook is enabled. The same as DSCI. It also requires CreateWithRetry() for the same reasons as e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> (cherry picked from commit 835959a)
Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The status error looks like: ``` error(*k8s.io/apimachinery/pkg/api/errors.StatusError) *{ ErrStatus: k8s.io/apimachinery/pkg/apis/meta/v1.Status { TypeMeta: (*"k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta")(0xc000504500), ListMeta: (*"k8s.io/apimachinery/pkg/apis/meta/v1.ListMeta")(0xc000504520), Status: "Failure", Message: "Internal error occurred: failed calling webhook \"operator.openda...+194 more", Reason: "InternalError", Details: *(*"k8s.io/apimachinery/pkg/apis/meta/v1.StatusDetails")(0xc00046c9c0), Code: 500,},} ``` which makes filtering of it tricky, other real errors can be in theory reported as "InternalError" code 500. So, the original code considered any error as a retry case. Unfortunately, it made AlreadyExists as a retry condition which is acutally the real goal of the function (make the object exists). Ignore the error in error check making it return successfully in that case. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The status error looks like: ``` error(*k8s.io/apimachinery/pkg/api/errors.StatusError) *{ ErrStatus: k8s.io/apimachinery/pkg/apis/meta/v1.Status { TypeMeta: (*"k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta")(0xc000504500), ListMeta: (*"k8s.io/apimachinery/pkg/apis/meta/v1.ListMeta")(0xc000504520), Status: "Failure", Message: "Internal error occurred: failed calling webhook \"operator.openda...+194 more", Reason: "InternalError", Details: *(*"k8s.io/apimachinery/pkg/apis/meta/v1.StatusDetails")(0xc00046c9c0), Code: 500,},} ``` which makes filtering of it tricky, other real errors can be in theory reported as "InternalError" code 500. So, the original code considered any error as a retry case. Unfortunately, it made AlreadyExists as a retry condition which is acutally the real goal of the function (make the object exists). Ignore the error in error check making it return successfully in that case. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
* cluster: CreateWithRetry: make work for already existing object Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * upgrade: cleanup CreateWithRetry usage CreateWithRetry() now checks AlreadyExists condition inside, so skip its handling. sleep() is not needed for DSCI with proper working CreateWithRetry since it is the actual point of existance of the function. Keep checking number of DSCI instances for non-webhook configuration. Basically, using CreateWithRetry for DSC is redundant from webhook unavailability point of view since it is created after DSCI which garantees working webhook. But it handles already existed object. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
* cluster: CreateWithRetry: make work for already existing object Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (red-hat-data-services#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * upgrade: cleanup CreateWithRetry usage CreateWithRetry() now checks AlreadyExists condition inside, so skip its handling. sleep() is not needed for DSCI with proper working CreateWithRetry since it is the actual point of existance of the function. Keep checking number of DSCI instances for non-webhook configuration. Basically, using CreateWithRetry for DSC is redundant from webhook unavailability point of view since it is created after DSCI which garantees working webhook. But it handles already existed object. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> (cherry picked from commit 5c755e6)
* cluster: CreateWithRetry: make work for already existing object Intention of the function was to ensure that the object is created and retry in case of webhook is temporary not available. The original code lacks handling already existing object. Handling with and without webhook configurations make the things more complicated. Let's check Create() errors for different cases: If webhook enabled: - no error (err == nil) - 500 InternalError likely if webhook is not available (yet) - 403 Forbidden if webhook blocks creation (check of existance) - some problem (real error) else, if webhook disabled: - no error (err == nil) - 409 AlreadyExists if object exists - some problem (real error) Check already existing object after Create() with a call to Get(). It covers both with and without webhook configurations. Reuse the same object for Get() to avoid fetching Gvk for it. It doesn't make harm, the object structure is not used after creation. 500 InternalError is not 100% webhook problem, but consider it as a reason for retry. Fixes: e26100e ("upgrade: retry if default DSCI creation fails (#1008)") Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> * upgrade: cleanup CreateWithRetry usage CreateWithRetry() now checks AlreadyExists condition inside, so skip its handling. sleep() is not needed for DSCI with proper working CreateWithRetry since it is the actual point of existance of the function. Keep checking number of DSCI instances for non-webhook configuration. Basically, using CreateWithRetry for DSC is redundant from webhook unavailability point of view since it is created after DSCI which garantees working webhook. But it handles already existed object. Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> --------- Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com> (cherry picked from commit 5c755e6)
Description
After removing leader election, operator fails to start if it is instructed to create default DSCI. Looks like webhook is not ready by the time:
Leader election added some delay.
The problem does not happen in default configuration since it explicitly disables DSCI creation in the manifests:
Make a wrapper function cluster.CreateWithRetry for client.Object creation with timeout. Use 5s interval, just seems reasonable, and timeout in minutes as the parameter.
It requires disable linter nilerr since for the polling function error in creation is a valid condition, something the function wait to disappear.
Fixes: 3610b0b ("feat: remove leader election for operator (#1000)")
How Has This Been Tested?
Merge criteria: