-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the deletion of the CR be accomplished. #1503
Comments
My original replication test code/procedures for this issue can also be found here, in case it is helpful: #1493 (comment) |
The operator being deleted before the CR is a different issue. If you configure the operator to watch a different namespace, create a CR, delete the namespace, you will see that the namespace and CR are not deleted. This will only occur if the finalization process takes longer than the grace period. The reason for this is that k8s updates the deletion timestamp on the CR, which updates the resource version, so when you go to update the CR (i.e. remove the finalizer), you get a conflict error. The solution, as shown in the linked example from the old issue, is to get a fresh version of the CR, remove the finalize, and send the update. Note too, that the update loop for the namespace deletion is scheduled for every second, which means you have less than a second to get and update the CR. This is why you have to use oc patch and cannot use oc edit. |
this is an important point - the title of this github issue is not entirely accurate because it isn't about deleting the operator, its about deleting the namespace where the CR lives |
Hi @rcernich,
|
@lilic could you please give a hand here? Please, could you check and address this issue for the Ansible Operator? |
@jmazzitelli, do you see errors in your operator log, when it goes to remove the finalizer? https://github.com/operator-framework/operator-sdk/blob/master/pkg/ansible/controller/reconcile.go#L104 @camilamacedo86, you're missing the point. That same block of code, posted above will fail under the following circumstances:
In that case, you will get a conflict error and the finalizer will most likely never be removed. (Even though the reconcile will be called again, if the finalization process now takes longer than one second, the update will fail.) |
@jmazzitelli and @rcernich, Note that in the POC created and described in the first comment has NOT any time/sleep to delete the CR and it will take milliseconds. I did this POC to isolate your problem and I just opened it to try to help you since its really hard to understand the #1493. I'd like to suggest you read with attention the first comment and reproduce the steps to understand the issue better as check that is possible in both projects remove the CR directly which proves that the assumption made for you folks is not the case at all. However, please, feel free to open any other issue if you still think that it is not the case. |
@camilamacedo86 removing the CR directly is not a problem. The problem exhibits specifically when the namespace containing the CR is deleted without deleting the CR. When the namespace is deleted, k8s deletes all the resources in the namespace as part of the namespace's finalization. Because our CR contains a finalizer, the namespace finalizer needs to wait until the CR finalizer is removed. The problem, and upstream issue that was fixed in k8s 1.14+, is that the namespace finalization process continues to update the deletion timestamp (CR.metadata.deletionTimestamp), which updates the resource version (CR.metadata.resourceVersion) on the CR, which results in a conflict error when the controller goes to remove the finalizer from the CR. I can sympathize with you, as this was a very difficult issue to track down. It's also intermittent, in that if the finalization process is quick enough, you never see the problem. It's also hard for a user to fix, because they have to use the patching mechanism, which most users probably haven't ever used. |
Hi @rcernich My point is:
So, I'd like to make a few questions based on your comments:
We are able to face the same issue in both projects(#1493 (comment)) ALWAYS
|
I am going to tell our QE folks that there is nothing we can do if they delete the namespace (which deletes the operator and attempts to delete the CR). Anyone who deletes the namespace that houses the ansible operator and CR will see this problem - so the solution is to not do that. I'll tell them not to delete the namespace without first deleting the operator and CR, let that delete finish, and then delete the namespace (where the namespace also houses the CR). I suspect we need to have some kind of FAQ or doc in operator-sdk explicitly telling the user not to blindly delete the namespace where the operator / CR is. |
Correct.
Correct.
It does exist, see my response to your second question.
I think this is the point you are missing. In the scenario we're talking about, the operator exists in a different namespace from the one containing the CR and is able to perform finalization.
If you delete the operator first, yes, you will always see the problem. The issue we're trying to get addressed is when the operator exists and can remove the finalizer, but fails to because of the conflict error.
Yes. That's why I commented that the steps were incorrect for the issue @jmazzitelli was originally reporting. |
@camilamacedo86, sorry, I didn't realize that @jmazzitelli was collocating his operator with the CR. I don't believe there's a fix or workaround for that use case. That said, the issue that I was talking about is a real issue and does affect all operators that use finalizers. However, the issue will only appear if their finalization processing exceeds the termination grace period used by the namespace deletion controller (which I think is 15s). |
Hi @rcernich and @jmazzitelli Following some clarifications.
|
Hi @rcernich and @jmazzitelli, Unfortunately, shows that my comments still not clear. I think this issue open here has too many comments which are NOT helpful at all for who will try to solve it. IMHO the best way for us to move forward is: I will close this issue and open a new one to address this scenario/bug which is described here and it is the same that I could check in your project in #1493. Please, feel free to follow up but I'd like to kindly ask for leave this new one there for the maintainers be able to check. Also, please feel free to raise new issues as you wish. |
Bug Report
CR with finalizer hang when the namespace is deleted because of Ansible operator is allowing the deletion of the operator before the deletion of the CR to be accomplished.
NOTE: Issue opened in order to make clear the scenario/bug raised in #1493
What did you do?
What did you expect to see?
The CR + Operator + Namespace be deleted with success.
What did you see instead? Under which circumstances?
The namespace is marked to be deleted, the operator is deleted, but the CR is not which not allows the namespace to be deleted as well and is hanging it.
Reason: The operator has been deleted before the CR then it cannot remove the finalizer metadata from it which causes the bug.
Workarround: manual deletion of the finalizer metadata from the CR which would be made by the operator if it was not deleted first. E.g
oc patch memcached example-memcached -p '{"metadata":{"finalizers": []}}' --type=merge
OR
Delete the CR before deleting the namespace for the operator be able to remove the finalizer metadata.
oc delete deploy/crds/cache_v1alpha1_memcached_cr.yaml
Environment
0.8.1
go version go1.12.5 darwin/amd64
Additional context
Following the images to illustrate the bug.
The text was updated successfully, but these errors were encountered: