Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a delay for transition of Ready condition from Unknown to False #6784

Closed
rhuss opened this issue Feb 8, 2020 · 7 comments
Closed
Labels
kind/feature Well-understood/specified features, ready for coding. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@rhuss
Copy link
Contributor

rhuss commented Feb 8, 2020

Describe the feature

Please reconcile to a ConditionReady to ConditionFalse only when it's clear that multi-reconcile action fails overall. The use case is when a user creates a Service then multiple other dependency resources are created in parallel, like a Route and a Revision. However, the route can be only ready when the revision is ready, so there is the chance of a race. Currently, when the route can't find the referenced revision because of this race, the overall Service goes from ConditionUnknown to ConditionFalse for the ready condition, but switches to ConditionTrue as soon as the revision is ready and the route is reconciled.

This confuses clients who are waiting synchronously on a service creation (or update) and return immediately with "ok" for a transition unknown -> true, or an error for unknown -> false. In the situation above this would falsely detect an error as the overall action very quickly reconciles to ready == true (but with the temporary false state).

For the Knative client, this caused a 50% flake in the E2E tests which is solved now by introducing an error window to wait on an eventual true ready state.

It would be very helpful also for other clients if exercising this kind of patience on the server-side, so that the first transition to false indicates the error of a combined reconciliation step.

// cc: @dprotaso @evankanderson

@rhuss rhuss added the kind/feature Well-understood/specified features, ready for coding. label Feb 8, 2020
@dgerd
Copy link

dgerd commented Feb 11, 2020

ConditionReady to ConditionFalse only when it's clear that multi-reconcile action fails overall

From a high-level this is not something that we will be able to guarantee in all cases. I would rather like to focus this discussion on the current race and flake at hand.

However, the route can be only ready when the revision is ready, so there is the chance of a race. Currently, when the route can't find the referenced revision because of this race, the overall Service goes from ConditionUnknown to ConditionFalse for the ready condition, but switches to ConditionTrue as soon as the revision is ready and the route is reconciled.

What is the 'Spec' when this occurs? Do you see this happen on updates or only on Service creates? We currently serialize changes to prevent this when using BYO Revision name since a Revision can be directly referenced in the same YAML it is created, but have not done so for normal Revisions as the name is randomized.

Looking at the reconcilers I would suspect that this might happen on Service creation when using 'latestRevision', but I would not expect it to happen on an update if the previous Revision reached 'Ready=True'. Does this match what you are seeing?

@rhuss
Copy link
Contributor Author

rhuss commented Feb 12, 2020

Thanks for coming back on this issue. In the meantime we increased our debugging in case of an client E2E error, and it turns out that the issue is probably something completely different: Its not so much of a race, but an optimistic locking error when updating the Configuration.

You find the full log here but the events which lead to a Revision "hello-hnzcf-1" referenced in traffic not found error are:

 Events:
  Type     Reason         Age                From                Message
  ----     ------         ----               ----                -------
  Normal   Created        10m                service-controller  Created Configuration "hello"
  Normal   Created        10m                service-controller  Created Route "hello"
  Normal   Updated        10m (x3 over 10m)  service-controller  Updated Service "hello"
  Warning  InternalError  10m                service-controller  failed to reconcile Configuration: Operation cannot be fulfilled on configurations.serving.knative.dev "hello": the object has been modified; please apply your changes to the latest version and try again 

This happens just when creating a service with kn service create hello --image gcr.io/knative-samples/helloworld-go -n kne2etests1 (the simplest possible way to create a service).

@rhuss
Copy link
Contributor Author

rhuss commented Feb 13, 2020

I'm creating a new issue for what we found.

Feel free to close this issue, if you think this kind of delay doesn't make sense.

@rhuss
Copy link
Contributor Author

rhuss commented Feb 13, 2020

I carried over the issue to #6837

@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2020
@dprotaso
Copy link
Member

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2020
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen.Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Well-understood/specified features, ready for coding. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants