Skip to content

Commit

Permalink
upgrade: retry if default DSCI creation fails
Browse files Browse the repository at this point in the history
After removing leader election, operator fails to start if it is
instructed to create default DSCI. Looks like webhook is not ready
by the time:

```
create default DSCI CR.
{"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/manager/runnable_group.go:219"}
```
Leader election added some delay.

The problem does not happen in default configuration since it
explicitly disables DSCI creation in the manifests:

```
       containers:
       - command:
         - /manager
         env:
           - name: DISABLE_DSC_CONFIG
             value: 'true'
         args:
         - --operator-name=opendatahub
         image: controller:latest
```

Make a wrapper function cluster.CreateWithRetry for client.Object
creation with timeout. Use hardcoded 5s interval, just seems
reasonable, and timeout in minutes as the parameter.

It requires disable linter nilerr since for the polling function
error in creation is a valid condition, something the function wait
to disappear.

Fixes: 3610b0b ("feat: remove leader election for operator (#1000)")

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
  • Loading branch information
ykaliuta committed May 14, 2024
1 parent 308ed86 commit a892b10
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 1 deletion.
13 changes: 13 additions & 0 deletions pkg/cluster/resources.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,3 +174,16 @@ func WaitForDeploymentAvailable(ctx context.Context, c client.Client, componentN
return true, nil
})
}

func CreateWithRetry(ctx context.Context, cli client.Client, obj client.Object, timeoutMin int) error {
interval := time.Second * 5 // arbitrary value
timeout := time.Duration(timeoutMin) * time.Minute

return wait.PollUntilContextTimeout(ctx, interval, timeout, true, func(ctx context.Context) (bool, error) {
err := cli.Create(ctx, obj)
if err != nil {
return false, nil //nolint:nilerr
}
return true, nil
})
}
2 changes: 1 addition & 1 deletion pkg/upgrade/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ func CreateDefaultDSCI(ctx context.Context, cli client.Client, _ cluster.Platfor
return nil
case len(instances.Items) == 0:
fmt.Println("create default DSCI CR.")
err := cli.Create(ctx, defaultDsci)
err := cluster.CreateWithRetry(ctx, cli, defaultDsci, 1) // 1 min timeout
if err != nil {
return err
}
Expand Down

0 comments on commit a892b10

Please sign in to comment.