Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the deletion of the CR be accomplished. #1503

Closed
camilamacedo86 opened this issue May 31, 2019 · 15 comments
Labels
language/ansible Issue is related to an Ansible operator project

Comments

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented May 31, 2019

Bug Report

CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the deletion of the CR to be accomplished.

NOTE: Issue opened in order to make clear the scenario/bug raised in #1493

What did you do?

---
- version: v1alpha1
  group: cache.example.com
  kind: Memcached
  role: /opt/ansible/roles/memcached
  finalizer:
    name: finalizer.cache.example.com
  • Create an namespace and apply the example.
$ oc new-project memcached

$ kubectl create -f deploy/crds/cache_v1alpha1_memcached_crd.yaml
$ kubectl create -f deploy/service_account.yaml
$ kubectl create -f deploy/role.yaml
$ kubectl create -f deploy/role_binding.yaml
$ kubectl create -f deploy/operator.yaml
$ kubectl create -f deploy/crds/cache_v1alpha1_memcached_cr.yaml
  • Try to delete the namespace
$ oc project delete memcached

What did you expect to see?
The CR + Operator + Namespace be deleted with success.

What did you see instead? Under which circumstances?
The namespace is marked to be deleted, the operator is deleted, but the CR is not which not allows the namespace to be deleted as well and is hanging it.

  • Reason: The operator has been deleted before the CR then it cannot remove the finalizer metadata from it which causes the bug.

  • Workarround: manual deletion of the finalizer metadata from the CR which would be made by the operator if it was not deleted first. E.g oc patch memcached example-memcached -p '{"metadata":{"finalizers": []}}' --type=merge
    OR

Delete the CR before deleting the namespace for the operator be able to remove the finalizer metadata.
oc delete deploy/crds/cache_v1alpha1_memcached_cr.yaml

Environment

  • operator-sdk version: 0.8.1
  • go version: go version go1.12.5 darwin/amd64
  • Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-03-01T23:34:27Z", GoVersion:"go1.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2019-05-02T11:52:09Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind: Minishift
  • Are you writing your operator in ansible, helm, or go? Ansible

Additional context
Following the images to illustrate the bug.

  • Namespaced marked to be deleted and hanged by the CR with finalizer.

Screenshot 2019-05-31 at 10 22 53

  • See that the operator was deleted before the CR.

Screenshot 2019-05-31 at 10 23 27

  • See that the CR still there with the finalizer metadata which should be removed by the operator.

Screenshot 2019-05-31 at 10 23 20

@camilamacedo86 camilamacedo86 changed the title CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the CR to be deleted. CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the deletion of the CR be accomplished. May 31, 2019
@camilamacedo86
Copy link
Contributor Author

c/c @jmazzitelli @rcernich

@jmazzitelli
Copy link

My original replication test code/procedures for this issue can also be found here, in case it is helpful: #1493 (comment)

@rcernich
Copy link

The operator being deleted before the CR is a different issue. If you configure the operator to watch a different namespace, create a CR, delete the namespace, you will see that the namespace and CR are not deleted. This will only occur if the finalization process takes longer than the grace period.

The reason for this is that k8s updates the deletion timestamp on the CR, which updates the resource version, so when you go to update the CR (i.e. remove the finalizer), you get a conflict error. The solution, as shown in the linked example from the old issue, is to get a fresh version of the CR, remove the finalize, and send the update. Note too, that the update loop for the namespace deletion is scheduled for every second, which means you have less than a second to get and update the CR. This is why you have to use oc patch and cannot use oc edit.

@jmazzitelli
Copy link

The operator being deleted before the CR is a different issue. If you configure the operator to watch a different namespace, create a CR, delete the namespace, you will see that the namespace and CR are not deleted.

this is an important point - the title of this github issue is not entirely accurate because it isn't about deleting the operator, its about deleting the namespace where the CR lives

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 1, 2019

Hi @rcernich,

The operator being deleted before the CR is a different issue. If you configure the operator to watch a different namespace, create a CR, delete the namespace, you will see that the namespace and CR are not deleted. This will only occur if the finalization process takes longer than the grace period.

  • Please, check that in the above steps to reproduce the scenario is NOT using a sleep/timer. Has no action to be accomplished at all. The operator will just check the finalizer and update the instance to remove it
  • If the operator is NOT watching the namespace where the CR is applied how it can update the respective CR to remove the metadata? It never will works.

The reason for this is that k8s updates the deletion timestamp on the CR, which updates the resource version, so when you go to update the CR (i.e. remove the finalizer), you get a conflict error. The solution, as shown in the linked example from the old issue, is to get a fresh version of the CR, remove the finalize, and send the update. Note too, that the update loop for the namespace deletion is scheduled for every second, which means you have less than a second to get an update the CR. This is why you have to use oc patch and cannot use oc edit.

  • It is not the scenario faced at all. If your affirmation is true, we will see the conflict errors in the logs of the operator when it will try to do it. However, the Operator was removed before so your affirmation here is impossible since it NO LONGER EXIST and in this way cannot try to perform any update or face any issue.

  • It works when the deletion is applied directly to the CR specifically because the operator still there to update the CR in order to remove the finalizer which proves that your affirmation is not true at all. Note that if this scenario described for you was true the same behaviour would be faced when we run oc delete -f deploy/crds/cache_v1alpha1_memcached_cr.yaml since the operator will perform the actions to update the CR and face constantly the issues which not allows it to be accomplished as you describe.

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 1, 2019

@lilic could you please give a hand here? Please, could you check and address this issue for the Ansible Operator?

@rcernich
Copy link

rcernich commented Jun 3, 2019

@jmazzitelli, do you see errors in your operator log, when it goes to remove the finalizer? https://github.com/operator-framework/operator-sdk/blob/master/pkg/ansible/controller/reconcile.go#L104

@camilamacedo86, you're missing the point. That same block of code, posted above will fail under the following circumstances:

  • The operator is installed in a separate namespace from the CR (i.e. the operator is watching all or another namespace)
  • The namespace containing the CR is deleted prior to deleting the CR.
  • The deletion process takes longer than 15-60 seconds

In that case, you will get a conflict error and the finalizer will most likely never be removed. (Even though the reconcile will be called again, if the finalization process now takes longer than one second, the update will fail.)

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 3, 2019

@jmazzitelli and @rcernich,

Note that in the POC created and described in the first comment has NOT any time/sleep to delete the CR and it will take milliseconds.

I did this POC to isolate your problem and I just opened it to try to help you since its really hard to understand the #1493. I'd like to suggest you read with attention the first comment and reproduce the steps to understand the issue better as check that is possible in both projects remove the CR directly which proves that the assumption made for you folks is not the case at all.

However, please, feel free to open any other issue if you still think that it is not the case.

@rcernich
Copy link

rcernich commented Jun 3, 2019

@camilamacedo86 removing the CR directly is not a problem. The problem exhibits specifically when the namespace containing the CR is deleted without deleting the CR. When the namespace is deleted, k8s deletes all the resources in the namespace as part of the namespace's finalization. Because our CR contains a finalizer, the namespace finalizer needs to wait until the CR finalizer is removed. The problem, and upstream issue that was fixed in k8s 1.14+, is that the namespace finalization process continues to update the deletion timestamp (CR.metadata.deletionTimestamp), which updates the resource version (CR.metadata.resourceVersion) on the CR, which results in a conflict error when the controller goes to remove the finalizer from the CR.

I can sympathize with you, as this was a very difficult issue to track down. It's also intermittent, in that if the finalization process is quick enough, you never see the problem. It's also hard for a user to fix, because they have to use the patching mechanism, which most users probably haven't ever used.

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 3, 2019

Hi @rcernich

My point is:

  • Note that when we deleted the CR directly or the namespaces in both cases the process is the same: the operator needs to remove the finalizer metadata to allow the CR to be deleted as its parents.

  • When we mark to delete the namespace the operator has been deleted BEFORE the CR and because of it nothing can remove the finalizer or face any issue. (Just for the Ansible ones, the same issue is not faced with Golang which is NOT using k8s 1.14+)

So, I'd like to make a few questions based on your comments:

The problem, and upstream issue that was fixed in k8s 1.14+, is that the namespace finalization process continues to update the deletion timestamp (CR.metadata.deletionTimestamp), which updates the resource version (CR.metadata.resourceVersion) on the CR, which results in a conflict error when the controller goes to remove the finalizer from the CR.

  1. Why the same bug is not faced in the Golang operators? (both are NOT using the k8s 1.14+)

  2. How can results in a conflict error when the controller goes to remove the finalizer from the CR. be true if the operator/controller does NOT exist? How a project/controller that was REMOVED and NO LONGER EXIST can face some issue or try to do something? Did you perform the test and try to check the logs?

It's also intermittent, in that if the finalization process is quick enough, you never see the problem.

We are able to face the same issue in both projects(#1493 (comment)) ALWAYS

  1. Did you read the first comment and try to follow the steps? Did you check that in the POC created has no action to perform just the finalizer so has no way that something is quicker than that?

@estroz estroz added the language/ansible Issue is related to an Ansible operator project label Jun 3, 2019
@jmazzitelli
Copy link

I am going to tell our QE folks that there is nothing we can do if they delete the namespace (which deletes the operator and attempts to delete the CR). Anyone who deletes the namespace that houses the ansible operator and CR will see this problem - so the solution is to not do that. I'll tell them not to delete the namespace without first deleting the operator and CR, let that delete finish, and then delete the namespace (where the namespace also houses the CR).

I suspect we need to have some kind of FAQ or doc in operator-sdk explicitly telling the user not to blindly delete the namespace where the operator / CR is.

@rcernich
Copy link

rcernich commented Jun 3, 2019

  • Note that when we deleted the CR directly or the namespaces in both cases the process is the same: the operator needs to remove the finalizer metadata to allow the CR to be deleted as its parents.

Correct.

  • When we mark to delete the namespace the operator has been deleted BEFORE the CR and because of it nothing can remove the finalizer or face any issue. (Just for the Ansible ones, the same issue is not faced with Golang which is NOT using k8s 1.14+)

Correct.

The problem, and upstream issue that was fixed in k8s 1.14+, is that the namespace finalization process continues to update the deletion timestamp (CR.metadata.deletionTimestamp), which updates the resource version (CR.metadata.resourceVersion) on the CR, which results in a conflict error when the controller goes to remove the finalizer from the CR.

  1. Why the same bug is not faced in the Golang operators? (both are NOT using the k8s 1.14+)

It does exist, see my response to your second question.

  1. How can results in a conflict error when the controller goes to remove the finalizer from the CR. be true if the operator/controller does NOT exist? How a project/controller that was REMOVED and NO LONGER EXIST can face some issue or try to do something? Did you perform the test and try to check the logs?

I think this is the point you are missing. In the scenario we're talking about, the operator exists in a different namespace from the one containing the CR and is able to perform finalization.

It's also intermittent, in that if the finalization process is quick enough, you never see the problem.

We are able to face the same issue in both projects(#1493 (comment)) ALWAYS

If you delete the operator first, yes, you will always see the problem. The issue we're trying to get addressed is when the operator exists and can remove the finalizer, but fails to because of the conflict error.

  1. Did you read the first comment and try to follow the steps? Did you check that in the POC created has no action to perform just the finalizer so has no way that something is quicker than that?

Yes. That's why I commented that the steps were incorrect for the issue @jmazzitelli was originally reporting.

@rcernich
Copy link

rcernich commented Jun 3, 2019

@camilamacedo86, sorry, I didn't realize that @jmazzitelli was collocating his operator with the CR. I don't believe there's a fix or workaround for that use case.

That said, the issue that I was talking about is a real issue and does affect all operators that use finalizers. However, the issue will only appear if their finalization processing exceeds the termination grace period used by the namespace deletion controller (which I think is 15s).

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 3, 2019

Hi @rcernich and @jmazzitelli

Following some clarifications.

  • The same behaviour is faced ALWAYS and no matter the time
  • Golang operators are not facing this scenario and they are not using 1.14+ ks8. Just the ansible ones.
  • I did NOT delete the operator manually, the problem is that it has been deleted when should NOT before of the CR
  • Regards I think this is the point you are missing. In the scenario we're talking about, the operator exists in a different namespace from the one containing the CR and is able to perform finalization. It is NOT the scenario of the issue addressed here in the first comment or in finalizer deletion bug in k8s needs to be worked around in ansible operator #1493 (comment).

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Jun 3, 2019

Hi @rcernich and @jmazzitelli,

Unfortunately, shows that my comments still not clear. I think this issue open here has too many comments which are NOT helpful at all for who will try to solve it.

IMHO the best way for us to move forward is:

I will close this issue and open a new one to address this scenario/bug which is described here and it is the same that I could check in your project in #1493. Please, feel free to follow up but I'd like to kindly ask for leave this new one there for the maintainers be able to check.

Also, please feel free to raise new issues as you wish.
Thank you in advance for your understanding and collaboration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language/ansible Issue is related to an Ansible operator project
Projects
None yet
Development

No branches or pull requests

4 participants