-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Live migration for bridged pod network #182
proposal: Live migration for bridged pod network #182
Conversation
/assign @maiqueb @rthallisey |
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
af65532
to
17fb66c
Compare
/cc @AlonaKaplan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to come up w/ this proposal.
I'm mostly worried / concerned about the workloads that are bound to certain addresses.
I'm unsure we can accept this in without offering a way to protect those.
@AlonaKaplan would you review this proposal as well ?
The live-migration of virtual machines between the pods with different MAC address will invoke NIC reattaching procedure. This might affect some applications which are binding to the specific interface inside the VM. | ||
|
||
The live-migration of virtual machines between the pods with same MAC address will invoke the procedure to link down and up for the VM to renew DHCP lease with IP address and routes inside the VM. This is less destructive operation, but still may affect some workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we care a lot about this type of workloads. I don't think you can have this as non-goals - you need to account for this in your design imo.
@AlonaKaplan can you chime in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yeah, the IP is changed so processes inside the guest the are using the IP will be affected. Same as external processes. Documenting it as first stage should be enough.
47b1b30
to
eb0d5dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this interesting proposal!
I like it and at the same time a bit worried.
It is not very clear to me how much potential adoption it may have due to the disruptions and limitations. But perhaps this is a good opportunity to try it and see.
I have commented inline about a few questions and concerns.
Thank you for bringing this to the table.
|
||
# Design | ||
|
||
To add a new feature gate `NetworkAwareLiveMigration`, which enables two additional methods in live-migration procedure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think the name better be specific here, e.g.
DisruptiveBridgeBindingMigration
. - Could you please elaborate on when in the (migration) flow you plan to execute the detach/attach operation?
- Could you please elaborate on when in the (migration) flow you plan to execute the link down/up operation?
- If I understand correctly, the IP address is always expected to change, no matter if the mac is changed or not. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name better be specific here, e.g. DisruptiveBridgeBindingMigration.
I wouldn't specify any specific mode in feature gate name. The next PR is upcoming, it will bring macvtap binding mode which also will work with this feature.
- Could you please elaborate on when in the (migration) flow you plan to execute the detach/attach operation?
- Could you please elaborate on when in the (migration) flow you plan to execute the link down/up operation?
That was described here
community/design-proposals/live-migration-for-bridged-pod-network.md
Lines 26 to 32 in eb0d5dd
- If MAC address changed: detach / attach interface after live migration to have correct MAC address set | |
The live-migration of virtual machines between the pods with different MAC address will invoke NIC reattaching procedure. This may affect some applications which are binding to the specific interface inside the VM. | |
- If MAC address is not changed: link down / link up interface to force VM request new IP address and routes from DHCP | |
The live-migration of virtual machines between the pods with same MAC address will invoke the procedure to link down and up for the VM to renew DHCP lease with IP address and routes inside the VM. This is less destructive operation, but still may affect some workloads listening on particular IP addresses. |
If you talking about when exactly is it going to happen, sure, I will add the information that this will happened after the successful live-migration
If I understand correctly, the IP address is always expected to change, no matter if the mac is changed or not. Right?
Not exactly, some CNIs allow to specify IP-address in annotation, this way IP-address will not be changed, but we still need to link down and link up the interface to make sure that VM has correct routes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the criteria for this feature to graduate from a feature flag to being default enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the criteria for this feature to graduate from a feature flag to being default enabled?
it would need to function for the majority of guest OSs and not cause undesirable behavior for existing installs. I'm unsure if this will reach a point where we can make it on by default, but it's a possibility.
|
||
## Motivation | ||
Masquerade and slirp are not so performant compared to bridge mode. | ||
We want to use the less latency modes with the opportunity to live-migrate the virtual machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested to understand the motivation to have a disruptive migration.
If such a migration can cause up to a minute of disruption, aren't there other means to recover a VM on a different node? I was under the impression that live-migration with high disruption of connectivity can cause more harm then good (applications will still run while disruption occurs, vs applications with the whole VM freeze and get restored).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If such a migration can cause up to a minute of disruption, aren't there other means to recover a VM on a different node?
From my tests this is happening within a few seconds not minutes, so even tcp sessions do not have time to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do tcp connections survive an ip address change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CNIs support specifying IP-address per pod. So in this way IP-address is not changed, but still requires to renew DHCP lease for updating routes inside the VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They won't survive. It will not survive even a link flickering... those TCP connections will just reset.
But this is the case with masquerade as well, so this is not something new.
I was more worried about the duration of the downtime (e.g. in the case of attach/detach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I just performed a test, I used the flowing commands:
server vm:
mkdir cgi-bin
printf '%s\n' '#!/bin/sh' 'cat /dev/urandom' > cgi-bin/1.sh
chmod +x cgi-bin/1.sh
python -m http.server --cgi
client vm:
curl 10.10.10.1:8000/cgi-bin/1.sh > /dev/null
Now I can assure that the tcp-connections are surviving in case if IP address is not changed.
Also I checked flood ping in both cases. When MAC-address is changes and when is not, result:
When MAC-address is the same (link down and link up interface):
47 packets lost
[fedora@vm1 ~]$ sudo ping -f 10.10.10.2 PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data. ............................................... --- 10.10.10.2 ping statistics --- 72544 packets transmitted, 72497 received, 0.0647883% packet loss, time 14943ms rtt min/avg/max/mdev = 0.069/0.117/8.138/0.160 ms, ipg/ewma 0.205/0.106 ms
When MAC-address is changed (reattach network card):
54 packets lost
[fedora@vm1 ~]$ sudo ping -f 10.10.10.2 PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data. ...................................................... --- 10.10.10.2 ping statistics --- 102443 packets transmitted, 102389 received, 0.0527122% packet loss, time 20785ms rtt min/avg/max/mdev = 0.069/0.116/39.293/0.222 ms, pipe 4, ipg/ewma 0.202/0.355 ms
- live-migrate client VM | ||
- Run ping from client to server | ||
|
||
# Implementation Phases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the topic of maintaining this special flow.
Unfortunately, our code is not well fitted to introduce a new option and at the same time keep it isolated and centralized. When the additions are scattered across many areas, it makes it harder to maintain it.
Well, it may be well worth investing in maintaining it by the sig-network if it has enough attraction, interest and at the end real usage.
I think it is worth raising these points now, so we could be clear that even if this feature is accepted as an Alpha, its existence for the long run is dependent on adoption and usage. I guess this is similar to the process of Kuberenetes, but we should have one as well, especially for such controversy features.
## User Stories | ||
* As a user / admin, I want to have an opportunity for live-migration of a VM with bridged pod-network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have some user stories that focus on why we'd want to live migrate with bridge mode, given the limitations this functionality imposes on applications.
So for example, what kinds of applications and scenarios tolerate this kind of live migration where the IP and Mac change for a VM during the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually be more interested in what apps / scenarios do not tolerate this kind of live migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may affect some applications which are binding to the specific interface inside the VM.
Isn't it enough? Or should I specify cases with exact applications (eg. apache2 server configured to bind on pod IP instead of 0.0.0.0
and so on)?
- Create two VMs: client and server in bridge mode | ||
- Wait for launch | ||
- Get IP address for server VM | ||
- Run ping from client to server | ||
- live-migrate client VM | ||
- Run ping from client to server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like i mentioned in the PR, i'm curious how pinging the same server IP after the live migration works. My expectation was that the VMI after the migration is responding to a new IP address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise. I share your expectation, and I think all the established tcp connections will go down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is same approach we having in test for macvtap-cni. It's not the same ping command, but two different:
eg:
- Run ping from client to server (client has an IP 10.10.10.1, server has IP 10.10.10.20, server responding to 10.10.10.1)
- live-migrate client VM (client has an IP 10.10.10.2, server has IP 10.10.10.20, ping is not running)
- Run ping from client to server (client has an IP 10.10.10.2, server has IP 10.10.10.20, server responding to 10.10.10.2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The server is not the one migrated, it is the client (line 49). So initiating a ping from the client should work fine.
- The test is on ICMP packets, not TCP.
- This is not much different from masquerade binding tests. In masquerade binding, it visible IP is also changing, so this is not something new.
- TCP connections are dropped/reset in masquerade binding as well, our existing migration is not smooth as well.
I would add here TCP/UDP tests as well, including ones with a service (to show that even if the IP changes, access through the service is returned eventually).
|
||
## Motivation | ||
Masquerade and slirp are not so performant compared to bridge mode. | ||
We want to use the less latency modes with the opportunity to live-migrate the virtual machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do tcp connections survive an ip address change?
Hi @kvaps thank you very much for the proposal and the PR. The idea generally looks good. @maiqueb is currently working on it and we will update this PR once moving the bridge binding to a CNI will be ready. |
I want to make it clear that you should adjust your expectations: I started to look into this very recently, and currently can't provide a time-line for it. I'll CC you both on the design document once I have a plan set up for it. Unfortunately, this is not my current primary focus, meaning it will take some time. The motivation behind it is to enable you (and anyone interested in providing another binding type / custom binding behavior) to do so, experiment with it, while keeping the maintainability of the KubeVirt project to something manageable. I would welcome any sort of input / collaboration for this task, going forward @kvaps . |
@AlonaKaplan @maiqueb this PR does not bring are new binding modes to KubeVirt, its just allowing to live-migrate in case if you use bridge. |
|
||
## Motivation | ||
Masquerade and slirp are not so performant compared to bridge and macvtap modes. | ||
Masquerade and slirp are not so performant compared to bridge mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I never thought about performance as a reason to use bridge binding on the primary network.
If you need better performance, why not using a secondary network?
Can you please elaborate about the use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no other options to expose VM as Kubernetes service.
We still need to communicate with the other pods in the cluster and reuse cloud load-balancers or MetalLB.
We forced to use pod network in any case we want to communicate with Kubernetes.
The pod networking also provides unified security policies which are just working out-of-the-box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try to use masquerade with multiqueue?
That's the result of a comparison we did in the past (masquerade binding, 8vcpu 8multiqueue)-
Masquerade overhead is minimized with multiqueue.
Masquerade vs. bridge binding
TCP Stream: -37% (nomq) -13% (mq)
UDP Stream: -2% (nomq) -6% (mq)
TCP Latency: -24% (nomq) -3% (mq)
UDP Latency: -6% (nomq) -2% (mq)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlonaKaplan which tool you used to benchmark this?
This and the other PR made us think about it. And we realized there is a need to allow easily adding new bindings/extend the existing ones. Bridge binding on the primary network is a legacy binding. We still support it, we are fixing bugs on it. But we try to avoid adding new functionality to it. However, once each binding will be implemented as a separate CNI it will be possible to add whatever binding CNI you need or extend an existing one - without the need to get an approval from the kubevirt team. |
@kvaps throwing another idea that you may use until the CNI binding mechanism will be introduced. |
/remove-lifecycle stale @kvaps I think this is close. Do you plan on doing another revision? |
This patch works nice in our fork. It can't be implemented using hooks, as KubeVirt blocking all the migrations with bridge binding on pod networking. So I'm waiting for plug-in interface to implement this as external plugin for KubeVirt. The design of this feature is not going to change any way, so I'd like to keep this proposal open |
Co-authored-by: Miguel Duarte Barroso <mdbarroso@redhat.com> Co-authored-by: Ryan Hallisey <rhallisey@nvidia.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
eb0d5dd
to
f8fc617
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I would like this too... we are using Calico with a dedicate IPPool for VMs and specifying a fixed IP from this via a calico annotation. So VM has same IP as virt-lauincher Pod, and this IP is always the same even if it is recreated on a new node. We don't want masquerade because of extra added NAT even between 2 VMs talking on same network, and and VMs won't know their actual IP configured inside. With Multus, and a L2 model and DHCP and kubemacpool, this is fine for some cases but doesn't let us scale network creations as it requires us to tag infra and add physical gateways for each L2 network. We tried kube-ovn to allow virtual networks, but hit many issues, apparently it doesn't work well and isnt tested as a secondary CNI nor with Kubevirt. Multus networks also aren't first class K8s citizens and there is no firewalling/security group features really So, we'd really like to use a primary CNI like Calico for its feature set and L3 networking, use a fixed IP for VMs, and have it be in bridge mode so VM doesn't have private IP and extra NAT. Why would live migration be an issue in this case? |
Is anybody know why a pod network with bridge mode would be a problem? or why multus with bridge works fine. what if I use multus default interface with bridge mode?
In the current kubevirt code, the below usage is allowed to live migrate. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
I migrated my fork from |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
@kvaps Heya - now that plugins have landed, what's the state of this proposal, and do you need something to get it over the line? |
Hi @aburdenthehand, I think this approach can be implemented using new pluggable API, but this will not work with default in-tree plugins and third-party out-tree plugins. Right now I have no sponsors interested in this feature, so I think, it can be clused. |
Okay I'll close it for now. /close |
@kvaps: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
https://kubernetes.slack.com/archives/C8ED7RKFE/p1652296916563589?thread_ts=1650371688.593359&cid=C8ED7RKFE