Use VxLAN overlay tunnels for inter-cluster traffic #140

sridhargaddam · 2019-09-02T14:55:39Z

As part of supporting Network policies and for ease of debugging, this
patch implements the following.

Creates VxLAN tunnels in the local Cluster between the worker nodes and
the Cluster Gateway Node.
Programms the necessary iptable rules on the Cluster nodes to allow
inter-cluster traffic.
This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby
preserving the original source ip of the POD all the way until the
destination POD.
Programs the routing rules on the workerNodes to forward the remoteCluster
traffic over the VxLAN interface that is created between the worker node
and Cluster GatewayNode.

This patch depends on the following other patches

Depends-On: #135
Depends-On: submariner-io/submariner-charts#3
Depends-On: submariner-io/submariner-charts#4

sridhargaddam · 2019-09-02T15:04:10Z

@mangelajo , @skitt , @tpantelis , @mpeterson Can you please review the patch, thank you.

skitt

I haven’t finished my review yet, I’m pushing my initial comments.

(Please also shorten the commit message; since this is Submariner, “Use VxLAN overlay tunnels for inter-cluster traffic” works fine.)

pkg/routeagent/controllers/route/route.go

pkg/cableengine/ipsec/ipsec.go

pkg/routeagent/controllers/route/route.go

mangelajo

sill reviewing, some basic comments while I finish

pkg/routeagent/controllers/route/iptables.go

pkg/routeagent/controllers/route/route.go

pkg/routeagent/controllers/route/vxlan.go

sridhargaddam

Thanks for reviewing

pkg/routeagent/controllers/route/route.go

pkg/routeagent/controllers/route/vxlan.go

pkg/routeagent/controllers/route/iptables.go

mangelajo

Still havent finished, but more comments.

mangelajo · 2019-09-05T09:00:01Z

pkg/routeagent/controllers/route/iptables.go

+
+	klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE)
+	ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"}
+	if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil {


I suspect inserting/expecting rules at specific positions could be problematic when CNIs also interact with those rules. Do we have alternatives to this? May be creating our own chain, and handling the order inside our specific chain. Then we only need to ensure one rule in the FORWARD/INPUT/ etc that calls our chain,

We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.

Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).

iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT

pkg/routeagent/controllers/route/route.go

mangelajo · 2019-09-05T12:15:18Z

pkg/routeagent/controllers/route/iptables.go

+		return fmt.Errorf("error listing the rules in %s chain: %v", chain, err)
+	}
+
+	if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) {


If something external inserts another rule in this position, next time you won't see it, and we will have leftover rules.

Probably you need to also scan the remaining rules to make sure it does not exist on other positions, and possibly remove them.

So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.

What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?

I would do it preventively as won't ever have guarantees on how do CNIs will behave. But not necessarily in this patch, may be we can open an issue with medium priority so we don't forget ? You could keep a link to this conversation/code.

pkg/routeagent/controllers/route/route.go

pkg/routeagent/controllers/route/iptables.go

pkg/routeagent/controllers/route/route.go

tpantelis · 2019-09-05T18:31:57Z

pkg/routeagent/controllers/route/route.go


-	link *net.Interface
+	vxlanDevice *vxLanIface


vxlanDevice is shared between the Pod and Endpoint processing and thus can be accessed by multiple threads so needs to be synchronized or use the same workqueue goroutine for both Pod and Endpoint. I think the latter would be safest to avoid any potential concurrency issues (also wrt remoteVTEPs and isGatewayNode).

Once the vxlanDevice is created its only a reference to the "linux interface" and the vxlanDevice properties/values does not change over the course of time. When its accessed via (vxlanDevice.AddFdb/DelFdb), we are only using the index of the interface (as a readonly parameter) to program some bridge fdb entries on the host. There is no issue with this.

One potential issue is that the vxlanDevice variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice need to be protected by a mutex or vxlanDevice needs to be an atomic reference. Similarly with isGatewayNode.

Another possible issue is that you set isGatewayNode before vxlanDevice however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice but not isGatewayNode which would lead to incorrect state (ie it may not call AddFDB when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.

Given that we can't enqueue removes as mentioned in a prior comment, it wouldn't buy us anything to share the same workqueue as we need shared memory anyway. However I still think we should synchronize access to vxlanDevice and isGatewayNode atomically to be safe.

I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice and isGatewayNode as discussed above.

Oops, it was accidentally marked as resolved.

Yeah @sridhargaddam remember the thread about memory synchronization between CPU cores we had. If we don't mutex, we have no guarantee that the vxlanDevice will be read by other threads than the one which wrote it, although, caches are small and it's very likely, what when it hits a processor with a bigger cache?

pkg/routeagent/controllers/route/route.go

tpantelis · 2019-09-05T18:43:05Z

pkg/routeagent/controllers/route/route.go

+	pod := obj.(*k8sv1.Pod)
+
+	// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
+	if !r.remoteVTEPs.Contains(pod.Status.HostIP) {


I'd suggest doing this check in processNextPod for safety.

The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.

yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...

IMHO I don't see any major advantage with that. I feel its currently good and we can revisit this code if required.

Current code looks ok for me too in this sense.

pkg/routeagent/controllers/route/iptables.go

pkg/routeagent/controllers/route/route.go

mangelajo · 2019-09-06T13:26:51Z

@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244

pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to for fdbAddress := range ... (gosimple)

	for fdbAddress, _ := range r.remoteVTEPs.Set {

	                ^

pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to for cidrBlock := range ... (gosimple)

for cidrBlock, _ := range r.remoteSubnets.Set {

               ^

[submariner]$ ./scripts/ci keep 1.14.2 false false

sridhargaddam

@sridhargaddam this fails before e2e being ran (now we have the right helm charts) because of this: https://travis-ci.com/submariner-io/submariner/builds/126219311#L1244

pkg/routeagent/controllers/route/route.go:313:19: S1005: should omit value from range; this loop is equivalent to for fdbAddress := range ... (gosimple)
	for fdbAddress, _ := range r.remoteVTEPs.Set {

	                ^
pkg/routeagent/controllers/route/route.go:621:17: S1005: should omit value from range; this loop is equivalent to for cidrBlock := range ... (gosimple)
for cidrBlock, _ := range r.remoteSubnets.Set {

               ^
[submariner]$ ./scripts/ci keep 1.14.2 false false

Thank you @mangelajo . I'll update the code.

pkg/routeagent/controllers/route/iptables.go

sridhargaddam · 2019-09-06T09:55:27Z

pkg/routeagent/controllers/route/iptables.go

+
+	klog.V(4).Infof("Insert rule to allow traffic over %s interface in FORWARDing Chain", VXLAN_IFACE)
+	ruleSpec = []string{"-o", VXLAN_IFACE, "-j", "ACCEPT"}
+	if err = r.insertUnique(ipt, "filter", "FORWARD", 1, ruleSpec); err != nil {


We can create own chain but the requirement here is to prepend the rule at the beginning of the Forward Chain so that Submariner behavior is preserved and SDN rules do not modify the behavior of the traffic.

Also, this is a single rule as shown below and I'm not sure if creating a new chain and adding this rule to the chain would make any difference. Agree that if we have multiple rules, then yes its always a good idea to create a chain and program rules in that chain (this is exactly what we are doing with SUBMARINER-POSTROUTING chain).

iptables rule: iptables -I FORWARD -o vxlan100 -j ACCEPT

pkg/routeagent/controllers/route/iptables.go

sridhargaddam · 2019-09-06T10:07:22Z

pkg/routeagent/controllers/route/iptables.go

+		return fmt.Errorf("error listing the rules in %s chain: %v", chain, err)
+	}
+
+	if strings.Contains(rules[position], strings.Join(ruleSpec, " ")) {


So far based on my testing with couple of CNIs, I've not seen that behavior, but I see your point.
We can add those checks, but it could be an additional overhead, if CNIs do not behave like that.

What are your thoughts on adding a comment "TODO/REVISIT" saying that if any such behavior is observed with any CNIs we should add the necessary validation?

pkg/routeagent/controllers/route/route.go

sridhargaddam · 2019-09-06T14:47:31Z

pkg/routeagent/controllers/route/route.go

+	pod := obj.(*k8sv1.Pod)
+
+	// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
+	if !r.remoteVTEPs.Contains(pod.Status.HostIP) {


The UpdateFunc handler would invoke the registered callback for every sync duration (which currently is configured as 1 min) even if there is no change.
If we simply add it to the workqueue, in a large cluster, I feel we will be adding un-necessary load.
The main purpose of this event is just to figure out the HOST/POD-IP. If the IP already exists in remoteVTEPs we can avoid adding this to the workqueue and reading/processing it.

pkg/routeagent/controllers/route/route.go

tpantelis · 2019-09-06T21:01:40Z

pkg/routeagent/controllers/route/route.go


-	link *net.Interface
+	vxlanDevice *vxLanIface


One potential issue is that the vxlanDevice variable is not synchronized so it's possible, albeit unlikely, that the thread processing pods never sees the value for vxlanDevice written by the thread processing endpoints due to CPU caching. To be completely safe, all accesses of vxlanDevice need to be protected by a mutex or vxlanDevice needs to be an atomic reference. Similarly with isGatewayNode.

Another possible issue is that you set isGatewayNode before vxlanDevice however w/o synchronization, it's possible that if the pod thread ran concurrently, it may observe the update to vxlanDevice but not isGatewayNode which would lead to incorrect state (ie it may not call AddFDB when it should). There may be other concurrency issues lurking. It may be that things would eventually converge properly due to the 30 sec resync period but it's very difficult/impossible to think of and test all such scenarios and ensure correctness. So I think it's safest to take concurrency out of the equation altogether by using the same workqueue for both pods and endpoints rather than using shared memory.

pkg/routeagent/controllers/route/route.go

tpantelis · 2019-09-06T23:45:35Z

pkg/routeagent/controllers/route/route.go

+	pod := obj.(*k8sv1.Pod)
+
+	// Add the POD event to the workqueue only if the sm-route-agent podIP does not exist in the local cache.
+	if !r.remoteVTEPs.Contains(pod.Status.HostIP) {


yeah there can be hundreds of pods so shortcutting should be fine. However I'm wondering if we should only do this for an update...

tpantelis · 2019-09-06T23:53:28Z

pkg/routeagent/controllers/route/route.go

+		// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
+		if r.isGatewayNode {
+			if r.vxlanDevice != nil {
+				err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")


It's possible r.vxlanDevice could be nil right after the check on line 383. To avoid this, you should first store r.vxlanDevice in a local var. Or eliminate concurrency as I suggested in another comment.

Why will vxlanDevice become nil? Once we create a VxLAN interface on the host, we will not delete it. Am I missing something?

In handleRemovedEndpoint, you set r.vxlanDevice to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice is nil. Doing the following would alleviate that potential issue:

localVxlanDevice := r.vxlanDevice if localVxlanDevice != nil { localVxlanDevice. AddFDB(...) {

However there's still the issue of synchronizing updates to r.vxlanDevice and r.isGatewayNode across threads, for that you need a sync.Mutex.

Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.

pkg/routeagent/controllers/route/route.go

pkg/routeagent/controllers/route/vxlan.go

pkg/routeagent/main.go

Adds another E2E test to ensure that source IP address is preserved on connections. Depends-On: submariner-io#140

pkg/routeagent/controllers/route/iptables.go

tpantelis · 2019-09-11T13:27:55Z

pkg/routeagent/controllers/route/route.go

+		// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
+		if r.isGatewayNode {
+			if r.vxlanDevice != nil {
+				err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")


In handleRemovedEndpoint, you set r.vxlanDevice to nil. Let's say thread 1 executes line 394 here and observes r.vxlanDevice non-nil. Thread 2 then interleaves and executes line 566 in handleRemovedEndpoint to set it to nil. Thread 1 resumes and executes line 395 but now r.vxlanDevice is nil. Doing the following would alleviate that potential issue:

localVxlanDevice := r.vxlanDevice if localVxlanDevice != nil { localVxlanDevice. AddFDB(...) {

tpantelis · 2019-09-11T13:48:03Z

pkg/routeagent/controllers/route/route.go


-	link *net.Interface
+	vxlanDevice *vxLanIface


I'm not sure why this conversation was marked as resolved. We still need to address synchronization of vxlanDevice and isGatewayNode as discussed above.

…ffic As part of supporting Network policies and for ease of debugging, this patch implements the following. 1. Creates VxLAN tunnels in the local Cluster between the worker nodes and the Cluster Gateway Node. 2. Programms the necessary iptable rules on the Cluster nodes to allow inter-cluster traffic. 3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby preserving the original source ip of the POD all the way until the destination POD. 4. Programs the routing rules on the workerNodes to forward the remoteCluster traffic over the VxLAN interface that is created between the worker node and Cluster GatewayNode. This patch depends on the following other patches Depends-On: submariner-io#135 Depends-On: submariner-io/submariner-charts#3 Depends-On: submariner-io/submariner-charts#4

As part of supporting Network policies and for ease of debugging, this patch implements the following. 1. Creates VxLAN tunnels in the local Cluster between the worker nodes and the Cluster Gateway Node. 2. Programms the necessary iptable rules on the Cluster nodes to allow inter-cluster traffic. 3. This patch also avoids SNAT/MASQ for inter-cluster traffic, thereby preserving the original source ip of the POD all the way until the destination POD. 4. Programs the routing rules on the workerNodes to forward the remoteCluster traffic over the VxLAN interface that is created between the worker node and Cluster GatewayNode. This patch depends on the following other patches Depends-On: submariner-io#135 Depends-On: submariner-io/submariner-charts#3 Depends-On: submariner-io/submariner-charts#4

1. Modified the vxlan interface to vx-submariner 2. error handling in iptables.go and route.go files 3. endpoint delete event

Mostly formatting fixes, passing args and error handling.

sridhargaddam

Thanks for reviewing @tpantelis

sridhargaddam · 2019-09-13T16:15:53Z

pkg/routeagent/controllers/route/route.go


-	link *net.Interface
+	vxlanDevice *vxLanIface


Oops, it was accidentally marked as resolved.

sridhargaddam · 2019-09-13T16:17:46Z

pkg/routeagent/controllers/route/route.go

+		// On the GatewayDevice, update the vxlan fdb entry (i.e., remote Vtep) for the newly added node.
+		if r.isGatewayNode {
+			if r.vxlanDevice != nil {
+				err := r.vxlanDevice.AddFDB(net.ParseIP(pod.Status.PodIP), "00:00:00:00:00:00")


Thanks for the explanation. The chances of endpoint being removed AFAIU is very less (in general). Anyways, lets protect it with a Mutex, to be safe. Please take a look at the updated code.

sridhargaddam · 2019-09-16T05:37:31Z

Thanks very much @tpantelis for reviewing the patch and approving it. I had to update the patch to fix golangci-lint errors.

@mangelajo, can you please take a look when you find time. Thanks.

mangelajo

Oh I see you already handle the mutex concerns, awesome, merging.

skitt requested changes Sep 2, 2019

View reviewed changes

skitt reviewed Sep 2, 2019

View reviewed changes

pkg/routeagent/controllers/route/route.go Outdated Show resolved Hide resolved

sridhargaddam changed the title ~~Enhance Submariner to use VxLAN Overlay Tunnels for inter-cluster tra…~~ Use VxLAN overlay tunnels for inter-cluster traffic Sep 3, 2019

mangelajo suggested changes Sep 5, 2019

View reviewed changes

sridhargaddam commented Sep 5, 2019

View reviewed changes

mangelajo reviewed Sep 5, 2019

View reviewed changes

tpantelis requested changes Sep 5, 2019

View reviewed changes

mangelajo reviewed Sep 6, 2019

View reviewed changes

mangelajo mentioned this pull request Sep 6, 2019

GKE (and other VPC-cni clusters) do not allow remote-destined traffic to route #6

Closed

sridhargaddam commented Sep 6, 2019

View reviewed changes

sridhargaddam force-pushed the use-vxlan-tunnels branch from e9510fa to ef95464 Compare September 6, 2019 17:20

tpantelis reviewed Sep 6, 2019

View reviewed changes

tpantelis requested changes Sep 6, 2019

View reviewed changes

mangelajo added a commit to mangelajo/submariner that referenced this pull request Sep 9, 2019

E2E test for source IP preservation

2b57b0f

Adds another E2E test to ensure that source IP address is preserved on connections. Depends-On: submariner-io#140

mangelajo mentioned this pull request Sep 9, 2019

E2E test for source IP preservation #144

Merged

sridhargaddam force-pushed the use-vxlan-tunnels branch from b76f078 to 188dabb Compare September 10, 2019 10:26

mangelajo reviewed Sep 10, 2019

View reviewed changes

pkg/routeagent/controllers/route/iptables.go Show resolved Hide resolved

tpantelis requested changes Sep 11, 2019

View reviewed changes

sridhargaddam added 7 commits September 13, 2019 18:52

Moved StringSet to util package

eaeccce

Mutex is used to protect the contents of critical section.

3475e78

Incorporated review comments.

20e5984

1. Modified the vxlan interface to vx-submariner 2. error handling in iptables.go and route.go files 3. endpoint delete event

Updated the code based on the review comments.

4ca4ec8

Mostly formatting fixes, passing args and error handling.

Updated prependUnique API to remove any stale flows

abe5b01

sridhargaddam commented Sep 13, 2019

View reviewed changes

Added sync.Mutex for protecting isGatewayNode and vxlanDevice

3a2ab8e

sridhargaddam force-pushed the use-vxlan-tunnels branch from 59160c3 to 3a2ab8e Compare September 13, 2019 17:45

tpantelis previously approved these changes Sep 13, 2019

View reviewed changes

Addressed golangci-lint errors

dce88ec

sridhargaddam dismissed tpantelis’s stale review via dce88ec September 16, 2019 03:48

mangelajo approved these changes Sep 16, 2019

View reviewed changes

mangelajo merged commit 4b7def5 into submariner-io:master Sep 16, 2019

Use VxLAN overlay tunnels for inter-cluster traffic #140

Use VxLAN overlay tunnels for inter-cluster traffic #140

Conversation

sridhargaddam commented Sep 2, 2019 • edited Loading

sridhargaddam commented Sep 2, 2019

skitt left a comment

Choose a reason for hiding this comment

mangelajo left a comment

Choose a reason for hiding this comment

sridhargaddam left a comment

Choose a reason for hiding this comment

mangelajo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangelajo commented Sep 6, 2019

sridhargaddam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sridhargaddam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sridhargaddam commented Sep 16, 2019

mangelajo left a comment

Choose a reason for hiding this comment

sridhargaddam commented Sep 2, 2019 •

edited

Loading