Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Live migration for bridged pod network #182
proposal: Live migration for bridged pod network #182
Changes from 1 commit
17fb66c
f8fc617
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested to understand the motivation to have a disruptive migration.
If such a migration can cause up to a minute of disruption, aren't there other means to recover a VM on a different node? I was under the impression that live-migration with high disruption of connectivity can cause more harm then good (applications will still run while disruption occurs, vs applications with the whole VM freeze and get restored).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my tests this is happening within a few seconds not minutes, so even tcp sessions do not have time to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do tcp connections survive an ip address change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CNIs support specifying IP-address per pod. So in this way IP-address is not changed, but still requires to renew DHCP lease for updating routes inside the VM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They won't survive. It will not survive even a link flickering... those TCP connections will just reset.
But this is the case with masquerade as well, so this is not something new.
I was more worried about the duration of the downtime (e.g. in the case of attach/detach).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I just performed a test, I used the flowing commands:
server vm:
client vm:
Now I can assure that the tcp-connections are surviving in case if IP address is not changed.
Also I checked flood ping in both cases. When MAC-address is changes and when is not, result:
When MAC-address is the same (link down and link up interface):
47 packets lost
When MAC-address is changed (reattach network card):
54 packets lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we care a lot about this type of workloads. I don't think you can have this as non-goals - you need to account for this in your design imo.
@AlonaKaplan can you chime in ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yeah, the IP is changed so processes inside the guest the are using the IP will be affected. Same as external processes. Documenting it as first stage should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have some user stories that focus on why we'd want to live migrate with bridge mode, given the limitations this functionality imposes on applications.
So for example, what kinds of applications and scenarios tolerate this kind of live migration where the IP and Mac change for a VM during the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually be more interested in what apps / scenarios do not tolerate this kind of live migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it enough? Or should I specify cases with exact applications (eg. apache2 server configured to bind on pod IP instead of
0.0.0.0
and so on)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the topic of maintaining this special flow.
Unfortunately, our code is not well fitted to introduce a new option and at the same time keep it isolated and centralized. When the additions are scattered across many areas, it makes it harder to maintain it.
Well, it may be well worth investing in maintaining it by the sig-network if it has enough attraction, interest and at the end real usage.
I think it is worth raising these points now, so we could be clear that even if this feature is accepted as an Alpha, its existence for the long run is dependent on adoption and usage. I guess this is similar to the process of Kuberenetes, but we should have one as well, especially for such controversy features.