Subresource status failing to stop infinite loop #2795

cacois · 2020-04-07T13:50:40Z

Bug Report

What did you do?
A clear and concise description of the steps you took (or insert a code snippet).

As documented here, in order to avoid CR status changes causing an infinite loop of the reconcile() function, we should make the status block a subresource, following steps outlined here.

However, we have found that even though we successfully set the CRD status attribute as a subresource, we still see infinite looping of reconcile() when status is changed. A simple example to reproduce, using the canonical PodSet operator demo, would have the following types spec:

// +kubebuilder:subresource:status
// +kubebuilder:resource:path=podsets,scope=Namespaced
type PodSet struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   PodSetSpec   `json:"spec,omitempty"`
	Status PodSetStatus `json:"status,omitempty"`
}

which results in a CRD with a subresource, as expected:

...
spec:
  group: app.example.com
  names:
    kind: PodSet
    listKind: PodSetList
    plural: podsets
    singular: podset
  scope: Namespaced
  subresources:
    status: {}
  validation:
...

However, placing the following code in the controller reconcile() function:

// NOTE: the 'Updated' timestamp will change each time reconcile() is called
// If status is not treated correctly as a subresource
instance.Status.Updated = time.Now().String()
err = r.client.Status().Update(context.TODO(), podSet)
if err != nil {
	reqLogger.Error(err, "failed to update the podSet")
	return reconcile.Result{}, err
}

This updates the 'Updated' timestamp on the status object every time reconcile() is called. According to the documentation, this change to the status object should not trigger a call of the reconcile() function because the status object is a subresource, thus avoiding an infinite loop. However, the infinite loop occurs. Is the documentation wrong, is there a bug, or am I missing something?

What did you expect to see?

When setting the 'status' attribute as a subresource of a CRD, I should be able to update status in the reconcile() function without re-invoking the reconcile() loop.

What did you see instead? Under which circumstances?

When updating the status attribute, even though its a subresource, I see an infinite loop of the reconcile() function.

Environment

operator-sdk version: v0.15.2

go version: 1.13.8

Kubernetes version information: 1.16.2

Kubernetes cluster kind: openshift
Are you writing your operator in ansible, helm, or go? go

Possible Solution

Additional context
I've created a functional reproduction of the issue in this repo

The text was updated successfully, but these errors were encountered:

joelanford · 2020-04-13T18:45:01Z

This is the way Kubernetes control loops are intended to work. Whenever a change occurs in your CR, it will trigger an event to reconcile your CR again. To solve your problem, you can either use a predicate (e.g. the GenerationChangedPredicate that ignores status updates), or you can make sure you only ever update your CR status Updated field when your reconcile loop actually makes a change to converge the current state to the desired state.

We'll take an action to see if some documentation needs to be updated to make this more clear.

cacois · 2020-04-13T19:59:31Z

@joelanford I'm very confused then. The docs specifically say:

Under normal circumstances, If we were updating our resource every time we execute the reconcile cycle, this would trigger an update event which in turn would trigger a reconcile cycle in an endless loop.

For this reason, Status should be modeled as a subresource as explained here.

Is this not specifically saying that this is not the way the control loop should work when updating a subresource (as opposed to a normal resource)?

I can't see any other way to read this, and I have found the same sentiment documented in a number of places.

joelanford · 2020-04-13T22:25:09Z

~~@cacois Yeah that seems misleading at best.~~

Status updates, even if using status subresources, bump the generation on CRs. So while changing the status as a subresource doesn’t directly change the main resource, it does for all intents and purposes because the Kubernetes control plane will make that change automatically.

~~I’ll look into getting that doc updated.~~

EDIT: This was not correct. Thanks @raffaelespazzoli for the clarification.

cacois · 2020-04-14T01:02:40Z

@joelanford Thanks for the explanation! I assume those docs aren't open source, otherwise I'd offer to help. I'll look into the other options you provided.

Thanks again!

raffaelespazzoli · 2020-04-14T01:04:49Z

let me try to clarify. By using status as subresource, when we upgrade just the status with this syntax: r.Status().Update(context.Background(), instance), the operator still receives an update event.
However in this update event the generation field is not changed. So if we have a predicate that discards update events where the generation field is not changed we avoid an infinite loop.

joelanford · 2020-04-14T02:28:25Z

@raffaelespazzoli Thanks for the clarification. I was getting mixed up in my previous comments.

So TL;DR (and to make sure my understanding is now correct) 🙂

Using status subresource prevents resourceVersion from getting bumped in the main resource.
Status changes do not affect generation (even if not using the status subresource)
Updates to status subresources still do cause watchers to see an update event for the main resource.
To reconcile events where only the spec has changed, use the GenerationChangedPredicate

raffaelespazzoli · 2020-04-14T10:46:39Z

point #2 is wrong. generation is increased by the apiserver every time something changes in a resource (but not if it changes in a subsresource).

See: - operator-framework/operator-sdk#2795 - kubernetes-sigs/kubebuilder#1103 - kubernetes-sigs/controller-runtime#553 - https://book-v1.book.kubebuilder.io/basics/status_subresource.html

joelanford self-assigned this Apr 13, 2020

joelanford added the triage/support Indicates an issue that is a support question. label Apr 13, 2020

joelanford closed this as completed Apr 13, 2020

bradtopol mentioned this issue Apr 16, 2020

New document for describing best practices for troubleshooting reconcile loops #2857

Closed

nebojsa-prodana mentioned this issue Jun 8, 2021

Fix status update Skyscanner/applicationset-progressive-sync#96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subresource status failing to stop infinite loop #2795

Subresource status failing to stop infinite loop #2795

cacois commented Apr 7, 2020

joelanford commented Apr 13, 2020

cacois commented Apr 13, 2020 •

edited

Loading

joelanford commented Apr 13, 2020 •

edited

Loading

cacois commented Apr 14, 2020

raffaelespazzoli commented Apr 14, 2020 •

edited

Loading

joelanford commented Apr 14, 2020

raffaelespazzoli commented Apr 14, 2020

Subresource status failing to stop infinite loop #2795

Subresource status failing to stop infinite loop #2795

Comments

cacois commented Apr 7, 2020

Bug Report

joelanford commented Apr 13, 2020

cacois commented Apr 13, 2020 • edited Loading

joelanford commented Apr 13, 2020 • edited Loading

cacois commented Apr 14, 2020

raffaelespazzoli commented Apr 14, 2020 • edited Loading

joelanford commented Apr 14, 2020

raffaelespazzoli commented Apr 14, 2020

cacois commented Apr 13, 2020 •

edited

Loading

joelanford commented Apr 13, 2020 •

edited

Loading

raffaelespazzoli commented Apr 14, 2020 •

edited

Loading