Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use VxLAN overlay tunnels for inter-cluster traffic #140

Merged
merged 9 commits into from
Sep 16, 2019

Conversation

sridhargaddam
Copy link
Member

@sridhargaddam sridhargaddam commented Sep 2, 2019

As part of supporting Network policies and for ease of debugging, this
patch implements the following.

  1. Creates VxLAN tunnels in the local Cluster between the worker nodes and
    the Cluster Gateway Node.
  2. Programms the necessary iptable rules on the Cluster nodes to allow
    inter-cluster traffic.
  3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby
    preserving the original source ip of the POD all the way until the
    destination POD.
  4. Programs the routing rules on the workerNodes to forward the remoteCluster
    traffic over the VxLAN interface that is created between the worker node
    and Cluster GatewayNode.

This patch depends on the following other patches

Depends-On: #135
Depends-On: submariner-io/submariner-charts#3
Depends-On: submariner-io/submariner-charts#4

@sridhargaddam
Copy link
Member Author

@mangelajo , @skitt , @tpantelis , @mpeterson Can you please review the patch, thank you.

Copy link
Member

@skitt skitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t finished my review yet, I’m pushing my initial comments.

(Please also shorten the commit message; since this is Submariner, “Use VxLAN overlay tunnels for inter-cluster traffic” works fine.)

@sridhargaddam sridhargaddam changed the title Enhance Submariner to use VxLAN Overlay Tunnels for inter-cluster tra… Use VxLAN overlay tunnels for inter-cluster traffic Sep 3, 2019
Copy link
Contributor

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sill reviewing, some basic comments while I finish

Copy link
Member Author

@sridhargaddam sridhargaddam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing

Copy link
Contributor

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still havent finished, but more comments.


klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE)
ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"}
if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect inserting/expecting rules at specific positions could be problematic when CNIs also interact with those rules. Do we have alternatives to this? May be creating our own chain, and handling the order inside our specific chain. Then we only need to ensure one rule in the FORWARD/INPUT/ etc that calls our chain,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.

Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).

iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT

return fmt.Errorf("error listing the rules in %s chain: %v", chain, err)
}

if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something external inserts another rule in this position, next time you won't see it, and we will have leftover rules.

Probably you need to also scan the remaining rules to make sure it does not exist on other positions, and possibly remove them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.

What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it preventively as won't ever have guarantees on how do CNIs will behave. But not necessarily in this patch, may be we can open an issue with medium priority so we don't forget ? You could keep a link to this conversation/code.


link *net.Interface
vxlanDevice *vxLanIface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vxlanDevice is shared between the Pod and Endpoint processing and thus can be accessed by multiple threads so needs to be synchronized or use the same workqueue goroutine for both Pod and Endpoint. I think the latter would be safest to avoid any potential concurrency issues (also wrt remoteVTEPs and isGatewayNode).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the vxlanDevice is created its only a reference to the "linux interface" and the vxlanDevice properties/values does not change over the course of time. When its accessed via (vxlanDevice.AddFdb/DelFdb), we are only using the index of the interface (as a readonly parameter) to program some bridge fdb entries on the host. There is no issue with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue is that the vxlanDevice variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice need to be protected by a mutex or vxlanDevice needs to be an atomic reference. Similarly with isGatewayNode.

Another possible issue is that you set isGatewayNode before vxlanDevice however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice but not isGatewayNode which would lead to incorrect state (ie it may not call AddFDB when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we can't enqueue removes as mentioned in a prior comment, it wouldn't buy us anything to share the same workqueue as we need shared memory anyway. However I still think we should synchronize access to vxlanDevice and isGatewayNode atomically to be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice and isGatewayNode as discussed above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it was accidentally marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @sridhargaddam remember the thread about memory synchronization between CPU cores we had. If we don't mutex, we have no guarantee that the vxlanDevice will be read by other threads than the one which wrote it, although, caches are small and it's very likely, what when it hits a processor with a bigger cache?

pod := obj.(*k8sv1.Pod)

// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
if !r.remoteVTEPs.Contains(pod.Status.HostIP) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest doing this check in processNextPod for safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO I don't see any major advantage with that. I feel its currently good and we can revisit this code if required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code looks ok for me too in this sense.

@mangelajo
Copy link
Contributor

@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244

pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to for fdbAddress := range ... (gosimple)

	for fdbAddress, _ := range r.remoteVTEPs.Set {

	                ^

pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to for cidrBlock := range ... (gosimple)

for cidrBlock, _ := range r.remoteSubnets.Set {

               ^

[submariner]$ ./scripts/ci keep 1.14.2 false false

Copy link
Member Author

@sridhargaddam sridhargaddam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244

pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to for fdbAddress := range ... (gosimple)

	for fdbAddress, _ := range r.remoteVTEPs.Set {

	                ^

pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to for cidrBlock := range ... (gosimple)

for cidrBlock, _ := range r.remoteSubnets.Set {

               ^

[submariner]$ ./scripts/ci keep 1.14.2 false false

Thank you @mangelajo . I'll update the code.


klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE)
ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"}
if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.

Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).

iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT

return fmt.Errorf("error listing the rules in %s chain: %v", chain, err)
}

if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.

What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?

pod := obj.(*k8sv1.Pod)

// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
if !r.remoteVTEPs.Contains(pod.Status.HostIP) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.


link *net.Interface
vxlanDevice *vxLanIface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue is that the vxlanDevice variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice need to be protected by a mutex or vxlanDevice needs to be an atomic reference. Similarly with isGatewayNode.

Another possible issue is that you set isGatewayNode before vxlanDevice however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice but not isGatewayNode which would lead to incorrect state (ie it may not call AddFDB when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.

pod := obj.(*k8sv1.Pod)

// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
if !r.remoteVTEPs.Contains(pod.Status.HostIP) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...

// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
if r.isGatewayNode {
if r.vxlanDevice != nil {
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible r.vxlanDevice could be nil right after the check on line 383. To avoid this, you should first store r.vxlanDevice in a local var. Or eliminate concurrency as I suggested in another comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will vxlanDevice become nil? Once we create a VxLAN interface on the host, we will not delete it. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handleRemovedEndpoint, you set r.vxlanDevice to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice is nil. Doing the following would alleviate that potential issue:

localVxlanDevice := r.vxlanDevice
if localVxlanDevice != nil {
    localVxlanDevice. AddFDB(...)
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However there's still the issue of synchronizing updates to r.vxlanDevice and r.isGatewayNode across threads, for that you need a sync.Mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.

mangelajo added a commit to mangelajo/submariner that referenced this pull request Sep 9, 2019
Adds another E2E test to ensure that source IP address is preserved
on connections.

Depends-On: submariner-io#140
// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
if r.isGatewayNode {
if r.vxlanDevice != nil {
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In handleRemovedEndpoint, you set r.vxlanDevice to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice is nil. Doing the following would alleviate that potential issue:

localVxlanDevice := r.vxlanDevice
if localVxlanDevice != nil {
    localVxlanDevice. AddFDB(...)
{


link *net.Interface
vxlanDevice *vxLanIface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice and isGatewayNode as discussed above.

…ffic

As part of supporting Network policies and for ease of debugging, this
patch implements the following.

1. Creates VxLAN tunnels in the local Cluster between the worker nodes and
   the Cluster Gateway Node.
2. Programms the necessary iptable rules on the Cluster nodes to allow
   inter-cluster traffic.
3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby
   preserving the original source ip of the POD all the way until the
   destination POD.
4. Programs the routing rules on the workerNodes to forward the remoteCluster
   traffic over the VxLAN interface that is created between the worker node
   and Cluster GatewayNode.

This patch depends on the following other patches

Depends-On: submariner-io#135
Depends-On: submariner-io/submariner-charts#3
Depends-On: submariner-io/submariner-charts#4
As part of supporting Network policies and for ease of debugging, this
patch implements the following.

1. Creates VxLAN tunnels in the local Cluster between the worker nodes and
   the Cluster Gateway Node.
2. Programms the necessary iptable rules on the Cluster nodes to allow
   inter-cluster traffic.
3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby
   preserving the original source ip of the POD all the way until the
   destination POD.
4. Programs the routing rules on the workerNodes to forward the remoteCluster
   traffic over the VxLAN interface that is created between the worker node
   and Cluster GatewayNode.

This patch depends on the following other patches

Depends-On: submariner-io#135
Depends-On: submariner-io/submariner-charts#3
Depends-On: submariner-io/submariner-charts#4
1. Modified the vxlan interface to vx-submariner
2. error handling in iptables.go and route.go files
3. endpoint delete event
Mostly formatting fixes, passing args and error handling.
Copy link
Member Author

@sridhargaddam sridhargaddam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing @tpantelis


link *net.Interface
vxlanDevice *vxLanIface
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it was accidentally marked as resolved.

// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
if r.isGatewayNode {
if r.vxlanDevice != nil {
err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.

tpantelis
tpantelis previously approved these changes Sep 13, 2019
@sridhargaddam
Copy link
Member Author

Thanks very much @tpantelis for reviewing the patch and approving it. I had to update the patch to fix golangci-lint errors.

@mangelajo, can you please take a look when you find time. Thanks.

Copy link
Contributor

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you already handle the mutex concerns, awesome, merging.

@mangelajo mangelajo merged commit 4b7def5 into submariner-io:master Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants