-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use VxLAN overlay tunnels for inter-cluster traffic #140
Use VxLAN overlay tunnels for inter-cluster traffic #140
Conversation
@mangelajo , @skitt , @tpantelis , @mpeterson Can you please review the patch, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t finished my review yet, I’m pushing my initial comments.
(Please also shorten the commit message; since this is Submariner, “Use VxLAN overlay tunnels for inter-cluster traffic” works fine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sill reviewing, some basic comments while I finish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still havent finished, but more comments.
|
||
klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE) | ||
ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"} | ||
if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect inserting/expecting rules at specific positions could be problematic when CNIs also interact with those rules. Do we have alternatives to this? May be creating our own chain, and handling the order inside our specific chain. Then we only need to ensure one rule in the FORWARD/INPUT/ etc that calls our chain,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.
Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).
iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT
return fmt.Errorf("error listing the rules in %s chain: %v", chain, err) | ||
} | ||
|
||
if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something external inserts another rule in this position, next time you won't see it, and we will have leftover rules.
Probably you need to also scan the remaining rules to make sure it does not exist on other positions, and possibly remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.
What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do it preventively as won't ever have guarantees on how do CNIs will behave. But not necessarily in this patch, may be we can open an issue with medium priority so we don't forget ? You could keep a link to this conversation/code.
|
||
link *net.Interface | ||
vxlanDevice *vxLanIface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vxlanDevice
is shared between the Pod and Endpoint processing and thus can be accessed by multiple threads so needs to be synchronized or use the same workqueue goroutine for both Pod and Endpoint. I think the latter would be safest to avoid any potential concurrency issues (also wrt remoteVTEPs
and isGatewayNode
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the vxlanDevice is created its only a reference to the "linux interface" and the vxlanDevice properties/values does not change over the course of time. When its accessed via (vxlanDevice.AddFdb/DelFdb), we are only using the index of the interface (as a readonly parameter) to program some bridge fdb entries on the host. There is no issue with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential issue is that the vxlanDevice
variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice
written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice
need to be protected by a mutex or vxlanDevice
needs to be an atomic reference. Similarly with isGatewayNode
.
Another possible issue is that you set isGatewayNode
before vxlanDevice
however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice
but not isGatewayNode
which would lead to incorrect state (ie it may not call AddFDB
when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we can't enqueue removes as mentioned in a prior comment, it wouldn't buy us anything to share the same workqueue as we need shared memory anyway. However I still think we should synchronize access to vxlanDevice
and isGatewayNode
atomically to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice
and isGatewayNode
as discussed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it was accidentally marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah @sridhargaddam remember the thread about memory synchronization between CPU cores we had. If we don't mutex, we have no guarantee that the vxlanDevice will be read by other threads than the one which wrote it, although, caches are small and it's very likely, what when it hits a processor with a bigger cache?
pod := obj.(*k8sv1.Pod) | ||
|
||
// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache. | ||
if !r.remoteVTEPs.Contains(pod.Status.HostIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest doing this check in processNextPod
for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO I don't see any major advantage with that. I feel its currently good and we can revisit this code if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code looks ok for me too in this sense.
@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244 pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to
pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to
[submariner]$ ./scripts/ci keep 1.14.2 false false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244
pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to
for fdbAddress := range ...
(gosimple)for fdbAddress, _ := range r.remoteVTEPs.Set { ^
pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to
for cidrBlock := range ...
(gosimple)for cidrBlock, _ := range r.remoteSubnets.Set { ^
[submariner]$ ./scripts/ci keep 1.14.2 false false
Thank you @mangelajo . I'll update the code.
|
||
klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE) | ||
ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"} | ||
if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.
Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).
iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT
return fmt.Errorf("error listing the rules in %s chain: %v", chain, err) | ||
} | ||
|
||
if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.
What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?
pod := obj.(*k8sv1.Pod) | ||
|
||
// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache. | ||
if !r.remoteVTEPs.Contains(pod.Status.HostIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.
e9510fa
to
ef95464
Compare
|
||
link *net.Interface | ||
vxlanDevice *vxLanIface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential issue is that the vxlanDevice
variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice
written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice
need to be protected by a mutex or vxlanDevice
needs to be an atomic reference. Similarly with isGatewayNode
.
Another possible issue is that you set isGatewayNode
before vxlanDevice
however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice
but not isGatewayNode
which would lead to incorrect state (ie it may not call AddFDB
when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.
pod := obj.(*k8sv1.Pod) | ||
|
||
// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache. | ||
if !r.remoteVTEPs.Contains(pod.Status.HostIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...
// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node. | ||
if r.isGatewayNode { | ||
if r.vxlanDevice != nil { | ||
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible r.vxlanDevice
could be nil right after the check on line 383. To avoid this, you should first store r.vxlanDevice
in a local var. Or eliminate concurrency as I suggested in another comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why will vxlanDevice become nil? Once we create a VxLAN interface on the host, we will not delete it. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In handleRemovedEndpoint
, you set r.vxlanDevice
to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice
non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint
to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice
is nil. Doing the following would alleviate that potential issue:
localVxlanDevice := r.vxlanDevice
if localVxlanDevice != nil {
localVxlanDevice. AddFDB(...)
{
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However there's still the issue of synchronizing updates to r.vxlanDevice
and r.isGatewayNode
across threads, for that you need a sync.Mutex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.
Adds another E2E test to ensure that source IP address is preserved on connections. Depends-On: submariner-io#140
b76f078
to
188dabb
Compare
// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node. | ||
if r.isGatewayNode { | ||
if r.vxlanDevice != nil { | ||
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In handleRemovedEndpoint
, you set r.vxlanDevice
to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice
non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint
to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice
is nil. Doing the following would alleviate that potential issue:
localVxlanDevice := r.vxlanDevice
if localVxlanDevice != nil {
localVxlanDevice. AddFDB(...)
{
|
||
link *net.Interface | ||
vxlanDevice *vxLanIface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice
and isGatewayNode
as discussed above.
…ffic As part of supporting Network policies and for ease of debugging, this patch implements the following. 1. Creates VxLAN tunnels in the local Cluster between the worker nodes and the Cluster Gateway Node. 2. Programms the necessary iptable rules on the Cluster nodes to allow inter-cluster traffic. 3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby preserving the original source ip of the POD all the way until the destination POD. 4. Programs the routing rules on the workerNodes to forward the remoteCluster traffic over the VxLAN interface that is created between the worker node and Cluster GatewayNode. This patch depends on the following other patches Depends-On: submariner-io#135 Depends-On: submariner-io/submariner-charts#3 Depends-On: submariner-io/submariner-charts#4
As part of supporting Network policies and for ease of debugging, this patch implements the following. 1. Creates VxLAN tunnels in the local Cluster between the worker nodes and the Cluster Gateway Node. 2. Programms the necessary iptable rules on the Cluster nodes to allow inter-cluster traffic. 3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby preserving the original source ip of the POD all the way until the destination POD. 4. Programs the routing rules on the workerNodes to forward the remoteCluster traffic over the VxLAN interface that is created between the worker node and Cluster GatewayNode. This patch depends on the following other patches Depends-On: submariner-io#135 Depends-On: submariner-io/submariner-charts#3 Depends-On: submariner-io/submariner-charts#4
1. Modified the vxlan interface to vx-submariner 2. error handling in iptables.go and route.go files 3. endpoint delete event
Mostly formatting fixes, passing args and error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing @tpantelis
|
||
link *net.Interface | ||
vxlanDevice *vxLanIface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it was accidentally marked as resolved.
// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node. | ||
if r.isGatewayNode { | ||
if r.vxlanDevice != nil { | ||
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.
59160c3
to
3a2ab8e
Compare
Thanks very much @tpantelis for reviewing the patch and approving it. I had to update the patch to fix golangci-lint errors. @mangelajo, can you please take a look when you find time. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you already handle the mutex concerns, awesome, merging.
As part of supporting Network policies and for ease of debugging, this
patch implements the following.
the Cluster Gateway Node.
inter-cluster traffic.
preserving the original source ip of the POD all the way until the
destination POD.
traffic over the VxLAN interface that is created between the worker node
and Cluster GatewayNode.
This patch depends on the following other patches
Depends-On: #135
Depends-On: submariner-io/submariner-charts#3
Depends-On: submariner-io/submariner-charts#4