-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG 1836141: Assert VM exists if VM state is Creating #147
BUG 1836141: Assert VM exists if VM state is Creating #147
Conversation
@JoelSpeed: This pull request references Bugzilla bug 1836141, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not. On the first attempt to create, we always get an asynchronous time out: ``` E0709 16:44:25.752572 1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed ``` This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error: ``` E0709 16:44:26.642473 1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile ``` The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase `Exists` claims that it does not exist. This can lead to issues if there is a transient error after the first `Create` but before the VM becomes `Running`. We could see an error, determine the creation failed and move the Machine to the `Failed` phase. If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs. This PR fixes `Exists` such that if the VM exists on the API, but is in the `Creating` phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in the `Creating` phase.
e8d6568
to
9a4d561
Compare
Just noticed that this change will prevent us from entering this block of code cluster-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go Lines 607 to 632 in fa840d1
However, if the VM goes failed, exist will be false, but the machine is provisioned, so the Machine will go failed anyway (which is arguably more idiomatic behaviour, rather than retrying), and the other details are updated anyway.
I'd be tempted to keep the scope of this narrow and create a separate BZ for the fact that we should have a deleting check here too. Currently we remove the node object before the VM is gone which is not the intention of the code in the machine controller, it checks the VM exists and is meant to wait for it to go before removing the node and finalizer. So we are artificially saying the VM is gone before it is presently. |
/retest |
1 similar comment
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@JoelSpeed: All pull requests linked via external trackers have merged: openshift/cluster-api-provider-azure#147. Bugzilla bug 1836141 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not.
On the first attempt to create, we always get an asynchronous time out:
This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error:
The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase
Exists
claims that it does not exist. This can lead to issues if there is a transient error after the firstCreate
but before the VM becomesRunning
. We could see an error, determine the creation failed and move the Machine to theFailed
phase.If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs.
This PR fixes
Exists
such that if the VM exists on the API, but is in theCreating
phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in theCreating
phase.