Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why forwarder could restart too long #664

Open
denis-tingaikin opened this issue Jun 28, 2022 · 50 comments
Open

Investigate why forwarder could restart too long #664

denis-tingaikin opened this issue Jun 28, 2022 · 50 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request performance The problem related to system effectivity
Milestone

Comments

@denis-tingaikin
Copy link
Member

Steps to reproduce

Repeat steps from https://drive.google.com/file/d/1Pwd7T6iYPItXEqjx_Yd5NfPE-T0zgUy5

@denis-tingaikin denis-tingaikin added enhancement New feature or request performance The problem related to system effectivity labels Jun 28, 2022
@denis-tingaikin denis-tingaikin added this to the v1.5.0 milestone Jun 28, 2022
@glazychev-art
Copy link
Contributor

Most likely, this behavior can be explained and there are actually no outliers here.
As I understand, we are concerned about the page 11 of the document - "Upgrade from NSM v1.2.0 to v1.4.0" - forwarder-vpp-6t7x7

The forwarder will be in a ready state only when it can serve GRPC-clients. We have the forwarder readinessProbe:
command: ["/bin/grpc-health-probe", "-spiffe", "-addr=unix:///listen.on.sock"]
But before that, it needs to get certificates from the spire-agent - usually it takes about 3-5 sec.

On page 11 we can see that the upgrade of the NSM is happening gradually including spire-agents.
But what if the forwarder upgrade happened before the spire-agent? Let's take a look:
Screenshot from 2022-07-04 14-49-24_2

spire-agent-gqgr2 and the forwarder-vpp-6t7x7 may be on the same node.
I think the difference between the readiness of pods (about 3 seconds) is understandable (besides, we use periodSeconds: 1)

@szvincze
Copy link

szvincze commented Oct 4, 2023

Coming back to this topic after long time again...

Since the presentation there were several changes in our configuration that made the upgrade faster. But the question is still valid about the forwarder startup time or mainly the traffic outage period.

In the worker node reboot and forwarder-vpp pod restart scenarios we observe quite big differences regarding the period of time while the traffic does not work.
As I checked the startup time of the forwarder-vpp pods after worker node reboot and forwarder-vpp pod restart. The forwarder's startup phases completed within 4-5 seconds which seems to be normal.
In a small and simple configuration it takes additional few seconds to handle traffic again. But it sometimes takes just 10 seconds from the pod restart, sometimes it takes ~20 seconds and sometimes even more than 30 seconds.

We also measured it in a bigger environment where we got various figures between 60 and 120 seconds or even more.

Is there a theory that could explain these huge differences?

@LionelJouin LionelJouin moved this to 🏗 In progress in Meridio Oct 4, 2023
@glazychev-art glazychev-art moved this from Todo to Blocked in Release v1.12.0 Oct 24, 2023
@szvincze
Copy link

I managed to reproduce the issue when we restart a forwarder then the traffic cannot be recovered just a few minutes later. After some investigation it turned out when the forwarder restarted, only one of the NSM interfaces was re-created in the NSC pod, therefore the traffic failed for the affected network service.

The setup is based on NSM v1.11.1 basic kernel to ethernet to kernel example. I used a kind cluster with 4 worker nodes.
I deployed 4 NSCs, one for each node and 2 NSEs (one for IPv4 and one for IPv6), their location does not matter.
I added a container with ctraffic tool included into NSC and NSE pods, for this purpose I used registry.nordix.org/cloud-native/nsm/rvm-tester:ctraffic image.

Then I started two ctraffic servers in the NSC pods.
For example nsc-kernel-5f485bf7c-8gz6c:

ctraffic -server -address 172.16.1.1:5003&
ctraffic -server -address [100:100::1]:5003&

On the NSEs I ran ctraffic as clients for the matching servers on the NSCs, like below:

kubectl exec pods/nse-kernel4-5dd67f7f46-pjrn8 -n ns-kernel2ethernet2kernel -c ctraffic -- ctraffic -address 172.16.1.1:5003 -monitor -nconn 100 -rate 50 -timeout 600s

These clients send traffic to the addressed server via 100 connections with rate of 50 KB/s for 10 minutes. (It can happen that sometimes the ctraffic client reaches the reconnect limit [1000] and needs to be restarted.)

When all the instances are running and connected properly, and the traffic is steady (no failing or connecting state, and no drops), the tool shows similar output:

Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25519/24819/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25567/24867/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25619/24919/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25667/24967/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25719/25019/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25767/25067/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25819/25119/0
Conn act/fail/connecting: 100/0/0, Packets send/rec/dropped: 25867/25167/0

Then I started deleting the forwarder pods one by one in a cycle. I always wait until the traffic is steady again after the new forwarder started up then delete the next one and check if the traffic recovered, like below:

Conn act/fail/connecting: 59/541/41, Packets send/rec/dropped: 8405/7806/0
Conn act/fail/connecting: 1/599/99, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 0/600/100, Packets send/rec/dropped: 8406/7806/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8426/7826/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8472/7872/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8526/7926/0
Conn act/fail/connecting: 100/600/0, Packets send/rec/dropped: 8572/7972/0

After some iterations the traffic to one of the NSCs did not get connected just after few minutes.
As I checked the NSM interfaces in the NSC I realized that the one for IPv6 traffic was missing. In another successful reproduction attempt the same happened but with the one for IPv4 service. And after few minutes the interface was recreated and the traffic connected again.

I tried to reproduce it with less nodes, less NSEs, also with ping instead of ctraffic without success, however most probably there are more ways to catch the same issue.

I collected some logs but unfortunately the others were already rotated.

@denis-tingaikin: Can you please check if there could be a fault in the healing mechanism that causes this behavior?

@denis-tingaikin denis-tingaikin added the bug Something isn't working label Nov 22, 2023
@denis-tingaikin denis-tingaikin moved this from Blocked to In Progress in Release v1.12.0 Nov 22, 2023
@glazychev-art
Copy link
Contributor

@szvincze
I looked at the logs.
Question: is NSM_LIVENESSCHECK enabled for NSC?

@szvincze
Copy link

@szvincze I looked at the logs. Question: is NSM_LIVENESSCHECK enabled for NSC?

No, it is disabled in this case.

@glazychev-art
Copy link
Contributor

@szvincze
We found and fixed one potential problem. To test it you should use the latest cmd-nsc image:
ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4

Most likely for the problem you described this will be enough.
If the problem still exists, then you can also try updating NSMgr and Forwarder-vpp
(ghcr.io/networkservicemesh/ci/cmd-nsmgr:976d9e6
ghcr.io/networkservicemesh/ci/cmd-forwarder-vpp:fe97eed)

@szvincze
Copy link

@glazychev-art: Our tests results shows the fix is working fine.
Do you think the additional NSMgr and forwarder-vpp fixes are also needed for some reason? What do they improve?

@glazychev-art
Copy link
Contributor

Cool, thanks!
I think yes. They improve the transmission of the initial monitor event from component to component.

@szvincze
Copy link

szvincze commented Dec 3, 2023

Could there be backward compatibility issue if the forwarder-vpp and nsmgr contain this update but the NSC does not?

@denis-tingaikin
Copy link
Member Author

I quickly checked and found that we missed the commit with the fix in the release candidate v1.11.2-rc.1 networkservicemesh/sdk@release/v1.11.1...release/v1.11.2-rc.1

The good news is that it was just a release candidate. 

I'll create v1.11.2-rc.2 in a few hours with the missed fix.

@denis-tingaikin
Copy link
Member Author

@szvincze
Copy link

szvincze commented Dec 3, 2023

It was really quick, thanks!

My original question is still valid.
Does it make any problem if the NSC will be upgraded after NSM system components?

@glazychev-art
Copy link
Contributor

There will be no backward compatibility issues, but the original problem will not be fixed - the connection will still take a time to recover.

@szvincze
Copy link

szvincze commented Dec 4, 2023

Thanks. It was clear that to fix the original problem we need the new NSC but it is important to keep backward compatibility.

@szvincze
Copy link

szvincze commented Dec 4, 2023

Hi @glazychev-art & @denis-tingaikin,

It seems that the problem still occurs with rc.2. However the previous test image (ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4) solved the issue.

Can you please double-check what could cause this behavior with rc.2?

@glazychev-art
Copy link
Contributor

Hi @szvincze,
It would be cool if you could attach the logs. Because rc2 has the required fix, as far as I can see

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Dec 5, 2023

Hello @szvincze 

If ghcr.io/networkservicemesh/ci/cmd-nsc:a99d3e4 solves the problem, then rc.1 contains a needed fix.
The problem on our side is that we can't test it.

Do you have results for v1.11.2-rc.1

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Dec 17, 2023

As I can see from the logs, we have a problem with the registry related to dial

 failed to dial nsm-registry-svc:5002: context deadline exceeded

I'd like suggest to re-test rc.3 with two options:

  1. increase NSM_DIAL_TIMEOUT from 100ms to 300ms for nsmgr.
  2. increase upper bound nsmgr CPU limits in 1.25x times.

In parallel, I'll try to reproduce the problem and make sure that we don't have other issues that may produce the problem with dial.

@denis-tingaikin
Copy link
Member Author

FYI: Also, if you are restarting the node that contains spire-server, please use this patch to avoid problems with dialling to the registry: networkservicemesh/deployments-k8s#10287

@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.12.0 Dec 21, 2023
@denis-tingaikin denis-tingaikin moved this from Blocked to Moved to next release in Release v1.12.0 Jan 30, 2024
@denis-tingaikin
Copy link
Member Author

Current status:

  1. Based on the latest testing results, we considered adding retry for failed refresh requests: [R&D] Improve the recovery of NSM clients sdk#1565.
  2. We're waiting for logs from running v1.12.0-rc.1 and v1.12.0-rc.2.
  3. As soon as we complete p.1, we will schedule a new release, v1.12.1-rc.1

/cc @edwarnicke , @szvincze

@szvincze
Copy link

Hi @denis-tingaikin,

  1. We're waiting for logs from running v1.12.0-rc.1 and v1.12.0-rc.2.

Please find the logs for v1.12.0-rc.1.

Some background about what happened in this test.

Traffic started on the problematic connection at 2024-01-11T09:26:37.541, then forwarder-vpp-cpqz7 deleted at 2024-01-11T09:26:40.600. The new forwarder is ready at 2024-01-11T09:26:49.

After the forwarder deleted this connection did not get packets for almost 10 minutes.

Additional information on the problematic NSE.

[2024-01-11T09:26:34.518Z] DEBUG: Interfaces in nse-ipv4-c47b5df69-hv56b:
[2024-01-11T09:26:34.518Z] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2024-01-11T09:26:34.518Z]     inet 127.0.0.1/8 scope host lo
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 ::1/128 scope host 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 3: eth0@if518: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
[2024-01-11T09:26:34.518Z]     link/ether 12:0d:9b:ef:79:cb brd ff:ff:ff:ff:ff:ff link-netnsid 0
[2024-01-11T09:26:34.518Z]     inet 192.168.127.208/32 scope global eth0
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fd00:eccd:18:ffff:a7ff:3fda:6f8a:7fd0/128 scope global 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::100d:9bff:feef:79cb/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 4: icmp-respo-58d2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:6a:2d:70:b4 brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.96/32 scope global icmp-respo-58d2
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:6aff:fe2d:70b4/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 5: icmp-respo-8db4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:e2:42:cf:42 brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.98/32 scope global icmp-respo-8db4
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:e2ff:fe42:cf42/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 6: icmp-respo-71f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:d0:e0:7a:9d brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.100/32 scope global icmp-respo-71f0
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:d0ff:fee0:7a9d/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z] 7: icmp-respo-47c5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
[2024-01-11T09:26:34.518Z]     link/ether 02:fe:12:32:a5:af brd ff:ff:ff:ff:ff:ff
[2024-01-11T09:26:34.518Z]     inet 172.16.1.102/32 scope global icmp-respo-47c5
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever
[2024-01-11T09:26:34.518Z]     inet6 fe80::fe:12ff:fe32:a5af/64 scope link 
[2024-01-11T09:26:34.518Z]        valid_lft forever preferred_lft forever

The problematic connection to the traffic server address [100:100::1]:5003 in the NSC.

@glazychev-art
Copy link
Contributor

Hi @szvincze
Could you please clarify whether the problem was with IPv4 or IPv6?
I'm asking because you attached the output from nse-ipv4-c47b5df69-hv56b but then mentioned 100:100::1.

@szvincze
Copy link

szvincze commented Feb 2, 2024

Could you please clarify whether the problem was with IPv4 or IPv6? I'm asking because you attached the output from nse-ipv4-c47b5df69-hv56b but then mentioned 100:100::1.

The problem in this case was with IPv6 and most probably I got the IPv4 output by mistake.
However it is not specific to IPv4 or 6, it happens with both in different reproduction attempts.

@zolug
Copy link
Contributor

zolug commented Feb 13, 2024

Hi,
Here are some details about the network setup in use where the nsm default datapath healing cannot be used.
Our problem is two fold: it is related to a linux bridge and on the other hand to capability restrictions in certain NSCs.

1.,
Consider the folliwing setup:
NSC --- NSE/NSC --- NSE

In the NSE/NSC POD there's a linux bridge (with an IPv4/IPv6 address) to which the nsm interface(s) will be attached.
(IPs are assigned from the same IP network for both the NSM interfaces and the bridge.)

Ping between the NSM endpoint IPs belonging to the same NSM connection might not work, or more like won't work in our case:

IPv4:
Due to hardening arp_ignore is set to 1 by default in PODs, preventing arp resolution to work with the bridge setup. Can be easily resolved by setting arp_ignore=0 in NSC/NSE POD.

IPv6:
https://www.linuxquestions.org/questions/linux-networking-3/how-do-i-enable-proxy-ndp-proxy-arp-works-933174/

Requires proxy_ndp enabled, but is more complicated than IPv4, since proxy ndp must be handled on per address bases (for example by using an additional chain element to add/remove ndp entries to/from the ND table upon NSM Request/Close).
For example:

sysctl -w net.ipv6.conf.bridge0.proxy_ndp=1
ip -6 neigh add proxy fd00::5 dev bridge0

These "adjustments" for IPv4 and IPv6 could be skipped for example by using the bridge IP instead for the datapath monitoring. (Either as src or dst depending which link we consider.)

2.,
Sending ping as part of datapath monitoring is not feasible from the leftmost NSC in our case as it belongs to the application.

In order to send an ICMP echo request, you either need a RAW socket or a Datagram socket (UDP Ping, often referred to as unprivileged ping). To create RAW socket NET_RAW privilege is required.
While for the UDP Ping the GID of the process must be authorized to create such special Datagram sockets. This is achieved through ping_group_range sysctl.

ping_group_range - 2 INTEGERS
Restrict ICMP_PROTO datagram sockets to users in the group range. The default is "1 0", meaning, that nobody (not even root) may create ping sockets. Setting it to "100 100" would grant permissions to the single group. "0 4294967294" would enable it for the world, "100 4294967294" would enable it for the users, but not daemons.

Unfortunately we are not allowed to have requirements involving capabilities or setting ping_group_range sysctl towards the application POD.
I had an idea to basically mimic port scanners. That is to replace the datapath monitor with a UDP sender and rely on port unreachable to check peer availability. Of course the port in use had to be coordinated. (But it would scale well when number of connected NSC are increased without imposing additional user space load on the NSE.)

@glazychev-art
Copy link
Contributor

Thanks @zolug ,

  1. Do I understand the topology correctly? Please correct me if I'm wrong
    bridge
    Question: Is the connection between left NSC and right NSE one NSM connection?
    Or is it split into 2: NSC --- NSE/ + /NSC --- NSE?

  2. Actually we already do ping with unprivileged mode:
    https://github.com/networkservicemesh/sdk-kernel/blob/main/pkg/kernel/tools/heal/liveness_check.go#L154-L162
    And you are right, we need to set ping_group_range for that. Forwarder uses this chain element to set it on NSCs:
    https://github.com/networkservicemesh/sdk-kernel/tree/main/pkg/kernel/networkservice/connectioncontextkernel/pinggrouprange

@zolug
Copy link
Contributor

zolug commented Feb 14, 2024

@glazychev-art

1., They are 2 separate connections.
(In fact there could be multiple NSCs on the left and multiple NSEs on the right interconnected by the same central NSE/NSC POD.)

I mostly highlighted the setup to emphasize that in our case there could be 2 "types" of datapath monitoring use cases: when the the bridge is in the NSE and when the bridge is in the NSC.
(So, for example currently if I wanted to make datapath pings work, one way would be to replace either the src or dst of the pings with the bridge IPs (IPv4 and IPv6) depending on the type of the link.)

2., Hmm, I wasn't aware of this chain element enabling unprivileged ping for all group ids.

@glazychev-art
Copy link
Contributor

@zolug
Got it, thanks
The picture I sent is about right, right?

If you have multiple NSC and NSE, do you use the same bridge, or is a unique one created for each pair?

@zolug
Copy link
Contributor

zolug commented Feb 15, 2024

@zolug Got it, thanks The picture I sent is about right, right?

If you have multiple NSC and NSE, do you use the same bridge, or is a unique one created for each pair?

@glazychev-art
Yes, your picture is about right (although connections and the bridge have IPV6 addresses as well, again from the same subnet).
There's one common bridge for all.

@glazychev-art
Copy link
Contributor

@zolug
How is an IP address assigned to the bridge? Are you doing this manually?
Could you please tell me why bridge has an address?

@glazychev-art
Copy link
Contributor

@zolug
Another question:
Do the logs attached here represent this setup or something else?

@zolug
Copy link
Contributor

zolug commented Feb 16, 2024

@zolug How is an IP address assigned to the bridge? Are you doing this manually? Could you please tell me why bridge has an address?

@glazychev-art
Yes, bridge IP is configured manually. (But address is assigned by a custom IPAM functionality just like for the NSM connections.)

The reason why the bridge has IP address has mostly historical reasons and is related to "traffic" issues affecting L2 address resolution explained in the 1st point of this comment: #664 (comment).
Some more background: In left NSC the NSM interface(s) will have addtional virtual IPs configured (/32 IPv4 and /128 IPv6), that are used by application traffic. Thus, left NSC applies Policy Based Routes to point traffic from such virtual IPs to the proper next hop IP (the bridge IP) of the NSC/NSE POD.

@zolug
Copy link
Contributor

zolug commented Feb 16, 2024

@zolug Another question: Do the logs attached here represent this setup or something else?

I don't think they are related.

@glazychev-art
Copy link
Contributor

Thanks @zolug
It seems to me that your case is similar to the vl3 network. Have you considered using vl3 endpoints?
https://github.com/networkservicemesh/deployments-k8s/tree/v1.12.1-rc.1/examples/features/vl3-basic

@glazychev-art
Copy link
Contributor

@zolug
Actually, I still don’t fully understand the setup. You mentioned that NSC has VIP addresses. Does the diagram end up looking like this? (IPv6 was omitted):
topology2
It would be great if you could correct me. Or provide logs illustrated the case.
Thanks

@zolug
Copy link
Contributor

zolug commented Feb 20, 2024

@glazychev-art
Hi,
The NSM interfaces and the bridge are supposed to have an IP from the same /24 subnet.
While NSM interface in the left NSC POD additionally have (at least 1) /32 VIP address (not part of the above subnet).

The NSC --- NSE/NSC --- NSE setup is a simplified version of our architecture. Here's a (slightly outdated) picture better depicting our architecture.
For simplicity we could assume that the right NSE also has a VIP address through which it communicates with the left NSC's VIP.
(So, for traffic from the left NSC, the NSE/NSC would act as a router selecting the right NSE. And for traffic from the right NSE, the right NSE would be aware of the left NSC address and would employ route with that IP as next hop, while NSE/NSE acted as a bridge.)

Also, IMHO all the delicate details are not that important, my goal was mainly to illustrate that a bridge could cause pings involving its slave port IPs to fail without proper arp_ignore and proxy ndp tweaks.
Probably, a simple NSC --- NSE setup where any of the two POD could host a bridge to which the nsm interface would be attached would have been sufficient to present our concerns with the default datapath ping.

Btw, I'm not expecting any changes in NSM to address this use case, just figured it could be worth sharing it with you.
(A custom chain element could cover the proxy ndp entries. Although in the past NSM connection close wasn't really reliable, so I was instead thinking about a custom datapath ping using the bridge IP. But using the bridge IP in the left NSC (if we look at the picture) would be cumbersome, since the IP would not be known from start, instead would be learnt via the NSM connection's IP context eventually.)

@glazychev-art
Copy link
Contributor

@zolug
Got it, thanks
Is it possible for you to send logs from the left NSC, excluding all private information?

@zolug
Copy link
Contributor

zolug commented Feb 27, 2024

@zolug Got it, thanks Is it possible for you to send logs from the left NSC, excluding all private information?

I'm a bit puzzled what logs you might be after. The details around bridge are not (or maybe loosely) connected to the original issue. I merely added this information here as I've been asked to do so.
We do not enable datapath monitoring because by default the bridge network setup breaks any communication in the cases I described. If enabled, heal would constantly run into failed datapath.

@glazychev-art
Copy link
Contributor

glazychev-art commented Feb 27, 2024

@zolug
Let me explain why I asked this. Perhaps I don't fully understand your configuration.
We would just like to know what information the NSC has after the Request. For example, what is in its src/dst ips/routes.

Based on this picture, there are still a few questions.
For example, consider the connection between Target-1 -- Proxy-A (that is, nsc486 and nse475). Both of these interfaces have IP addresses (172.16.0.2/24 and 172.16.0.3/24).
Why can't we use these IPs to check the datapath (without reaching the bridge)? Or are these addresses also assigned manually, just like the address for the bridge, and we cannot start checking the datapath immediately after the Request?

Also the question is at what point the client receives IP 20.0.0.1. Does this mean that immediately after the Request the connection is unavailable because 172.16.0.3/24 is not manually configured?

Perhaps I misunderstood you.
But it seems to me that it would be wrong to check the entire connection at once (for example Target-1 -- load-balancer-A).
Because in fact these are 2 NSM connections that need to be checked separately (Target-1 (nsc486) -- Proxy-A (nse475) and Proxy-A (nsc486) -- load-balancer-A (nse475)).
And as far as I understand, in this case the bridge is not involved. Am I missing something?

Enabling datapath healing would allow us to more reliably solve the healing problem, I think

@zolug
Copy link
Contributor

zolug commented Feb 27, 2024

@glazychev-art

Why can't we use these IPs to check the datapath (without reaching the bridge)? Or are these addresses also assigned manually, just like the address for the bridge, and we cannot start checking the datapath immediately after the Request?

The reason is described in #664 (comment):
The nse475 interface is moved into the linux bridge right after its creation (by us). Assuming net.ipv4.conf.default.arp_ignore is 1, the bridge will inherit the same arp_ignore 1 value when created, thus arp resolution will break for slave port IPs:
The bridge will consider all incoming packets as if itself was the receiving interface, thus due to arp_ignore=1 would only answer if the IP was configured on the bridge interface (it will also forward the request through its other slave ports but that won't help our case).
In case of IPv6 L2 address resolution fails by default (irrespective of any sysctl values AFAIK), and proxy ndp entries have to be set up to workaround it.

The IPs in the example (172.16.0.2/24 and 172.16.0.3/24) are assigned via the NSE logic of the Proxy similar to what an example NSE in NSM repo would do. So, they are available once the initial Request returns on the NSC side.

But it seems to me that it would be wrong to check the entire connection at once (for example Target-1 -- load-balancer-A).

That's correct. (There can be and normally should be multiple paths available in each Proxy POD to send a packet towards a load-balancer POD.)

And as far as I understand, in this case the bridge is not involved. Am I missing something?

Bridge is still involved, but multi-path routing will decide which slave interface the packet should be sent out towards a load-balancer.

Also the question is at what point the client receives IP 20.0.0.1.

It's received separately by our left NSC (Target) through an independent "communication channel", upon which the NSC will update the already established connection by sending out and updated Request with modified IPContext (20.0.0.1 will be appended as src_ip_addrs). (The Request will also contain matching networkservice.PolicyRoute to point traffic sent from 20.0.0.1 to the bridge IP as next hop IP.)
Such IPs might come and go on the fly. Also, in our case these IPs are of no interest when it comes to datapath monitoring (They are used in end-to-end communication.).

Example logs:
Here's an example what is returned as connection on an initial Request (in a Target-X). (The bridge IPs are already passed as extra_prefixes by the Proxy. So, we could create a custom datapath checker to rely on these IPs. Or alternatively play with sysctls arp_ignore and proxy ndp entries as described before.)

"connection":"id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  network_service:\"proxy.load-balancer-a1.trench-a.default\"  mechanism:{cls:\"LOCAL\"  type:\"KERNEL\"  parameters:{key:\"inodeURL\"  value:\"file:///proc/thread-self/ns/net\"}  parameters:{key:\"name\"  value:\"nsm-0\"}}  context:{ip_context:{src_ip_addrs:\"172.16.3.4/24\"  src_ip_addrs:\"fd00:0:0:3::4/64\"  dst_ip_addrs:\"172.16.3.5/24\"  dst_ip_addrs:\"fd00:0:0:3::5/64\"  excluded_prefixes:\"10.96.0.0/16\"  excluded_prefixes:\"10.244.0.0/16\"  extra_prefixes:\"172.16.3.1/24\"  extra_prefixes:\"fd00:0:0:3::1/64\"}  MTU:1500}  labels:{key:\"nodeName\"  value:\"kind-worker4\"}  path:{path_segments:{name:\"target-a-646b8df765-2p5dh\"  id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcwOTA0MjM4OH0.HYz16vOYG95a942lhw4cTRiiGyFalG0cxJdjJ7Sw3tR6fxE9kxGi6gBbKFxkglDeB97JjFc1P1l46ZcYSrjlZw\"  expires:{seconds:1709040767}}  path_segments:{name:\"nsmgr-c6gdx\"  id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.8ALaT7H7TJyDFRenwiuhwdqKT092lqgCWgWEoaKetfjbp6Tv0MXLBDSnZQpX0bN0KRDnB7yyDytxqNuZhLNaTg\"  expires:{seconds:1709039390  nanos:287079298}}  path_segments:{name:\"forwarder-vpp-7sg7s\"  id:\"c7d58de5-0980-49aa-9c89-0e4da18e4510\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.zQwfh1wEpOFT3zK3LCKNz9R7tSxQRWnrb7vlHoVhlOkLB2ozTDL8tA3TMrnJ2Kme5K17331FSUMi6hcdyQscvw\"  expires:{seconds:1709039390  nanos:475315614}  metrics:{key:\"client_drops\"  value:\"1\"}  metrics:{key:\"client_interface\"  value:\"VIRTIO/tap0\"}  metrics:{key:\"client_rx_bytes\"  value:\"110\"}  metrics:{key:\"client_rx_packets\"  value:\"1\"}  metrics:{key:\"client_tx_bytes\"  value:\"0\"}  metrics:{key:\"client_tx_packets\"  value:\"0\"}  metrics:{key:\"server_drops\"  value:\"0\"}  metrics:{key:\"server_interface\"  value:\"VIRTIO/tap7\"}  metrics:{key:\"server_rx_bytes\"  value:\"0\"}  metrics:{key:\"server_rx_packets\"  value:\"0\"}  metrics:{key:\"server_tx_bytes\"  value:\"0\"}  metrics:{key:\"server_tx_packets\"  value:\"0\"}}  path_segments:{name:\"proxy-load-balancer-a1-4rlss\"  id:\"eb70bcf6-70d9-4fbd-8eb1-fca7ae12ae15\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkwfQ.Id73fQJp_ZhQH0iFvaj8ULWSSfz_Ad4jN6v7h4p5XKqOa68xe0I2aBzzVFQuFJ-39imUFd2AQaD7OL6cuOOPLw\"  expires:{seconds:1709039390  nanos:476074482}}}  network_service_endpoint_name:\"proxy-load-balancer-a1-4rlss\"  payload:\"ETHERNET\""

And here's how it looks like once an established NSM connection is updated with VIP addresses (20.0.0.1/32, 10.0.0.1/32, 2000::1/128) and associated Policy Routes:

{"severity":"debug","timestamp":"2024-02-27T12:59:54.275+00:00","service_id":"Meridio-tapa","message":"received eventIn: connections:{key:\"721c438c-974f-486a-ad96-c96ccafad63a\"  value:{id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  network_service:\"proxy.load-balancer-a1.trench-a.default\"  mechanism:{cls:\"LOCAL\"  type:\"KERNEL\"  parameters:{key:\"inodeURL\"  value:\"inode://4/4026537114\"}  parameters:{key:\"name\"  value:\"nsm-0\"}}  context:{ip_context:{src_ip_addrs:\"172.16.3.4/24\"  src_ip_addrs:\"fd00:0:0:3::4/64\"  src_ip_addrs:\"2000::1/128\"  src_ip_addrs:\"20.0.0.1/32\"  src_ip_addrs:\"10.0.0.1/32\"  dst_ip_addrs:\"172.16.3.5/24\"  dst_ip_addrs:\"fd00:0:0:3::5/64\"  excluded_prefixes:\"10.96.0.0/16\"  excluded_prefixes:\"10.244.0.0/16\"  extra_prefixes:\"172.16.3.1/24\"  extra_prefixes:\"fd00:0:0:3::1/64\"  policies:{from:\"2000::1/128\"  routes:{prefix:\"::/0\"  nextHop:\"fd00:0:0:3::1\"}}  policies:{from:\"20.0.0.1/32\"  routes:{prefix:\"0.0.0.0/0\"  nextHop:\"172.16.3.1\"}}  policies:{from:\"10.0.0.1/32\"  routes:{prefix:\"0.0.0.0/0\"  nextHop:\"172.16.3.1\"}}}  MTU:1500}  labels:{key:\"nodeName\"  value:\"kind-worker4\"}  path:{index:1  path_segments:{name:\"target-a-646b8df765-2p5dh\"  id:\"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcwOTA0MjM4OH0.VlhnpXkXW7AXlSCLfzTnvvL06N6_kUYFlkRWuzHVYEPnh713NmYUF8MaOV6JRDzqXmuZlZfcwjSjev7Ek6SyEQ\"  expires:{seconds:1709040767}}  path_segments:{name:\"nsmgr-c6gdx\"  id:\"721c438c-974f-486a-ad96-c96ccafad63a\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkyfQ.rj1WKymPqaIdv37FOXsCspeOlDfpDTb1fIGvRuWkDbiKgGRUe5W1tiWXWFDqN8c_Mu897QsXDrO63kmAEWvNgg\"  expires:{seconds:1709039392  nanos:718167533}}  path_segments:{name:\"forwarder-vpp-7sg7s\"  id:\"c7d58de5-0980-49aa-9c89-0e4da18e4510\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkzfQ.UUL0BFtKxkro3kvrTwaY8QSugCIz6uSH321Blr0XJjRuiLazcKv6OgrFoN1mF5gpD-fyulIWEUt9-ENW7elZAA\"  expires:{seconds:1709039393  nanos:91726945}  metrics:{key:\"client_drops\"  value:\"6\"}  metrics:{key:\"client_interface\"  value:\"VIRTIO/tap0\"}  metrics:{key:\"client_rx_bytes\"  value:\"796\"}  metrics:{key:\"client_rx_packets\"  value:\"6\"}  metrics:{key:\"client_tx_bytes\"  value:\"486\"}  metrics:{key:\"client_tx_packets\"  value:\"5\"}  metrics:{key:\"server_drops\"  value:\"0\"}  metrics:{key:\"server_interface\"  value:\"VIRTIO/tap7\"}  metrics:{key:\"server_rx_bytes\"  value:\"596\"}  metrics:{key:\"server_rx_packets\"  value:\"6\"}  metrics:{key:\"server_tx_bytes\"  value:\"0\"}  metrics:{key:\"server_tx_packets\"  value:\"0\"}}  path_segments:{name:\"proxy-load-balancer-a1-4rlss\"  id:\"eb70bcf6-70d9-4fbd-8eb1-fca7ae12ae15\"  token:\"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzA5MDM5MzkzfQ.1XWCLyUBwVaL6-urM86WydMuHrbHk20_SnQ1QE-EzKhT2cn71_u8ZqpkqSG0QMVst-xUjB6kOWcqedm2zFwQVQ\"  expires:{seconds:1709039393  nanos:92324963}}}  network_service_endpoint_name:\"proxy-load-balancer-a1-4rlss\"  payload:\"ETHERNET\"}}, err: <nil>","version":"1.0.0","extra_data":{"subsystem":"NSM","id":"target-a-646b8df765-2p5dh-proxy.load-balancer-a1.trench-a.default-0","type":"networkService","heal":"eventLoop"}}

nsmbot pushed a commit that referenced this issue May 18, 2024
…k-sriov@main

PR link: networkservicemesh/sdk-sriov#595

Commit: 53551c7
Author: Network Service Mesh Bot
Date: 2024-05-18 13:39:18 -0500
Message:
  - Update go.mod and go.sum to latest version from networkservicemesh/sdk-kernel@main (#595)
PR link: networkservicemesh/sdk-kernel#664
Commit: 9502001
Author: Network Service Mesh Bot
Date: 2024-05-18 13:35:42 -0500
Message:
    - Update go.mod and go.sum to latest version from networkservicemesh/sdk@main (#664)
PR link: networkservicemesh/sdk#1626
Commit: 7b51d9c
Author: Vladislav Byrgazov
Date: 2024-05-19 02:33:13 +0800
Message:
        - Fix memory leak in metrics chain element (#1626)
* Fix memory leak in metrics chain element
* Fix lint issues
* Added check is opentelemetry enabled and fixed copyrights
* Fix metrics memory leak by storing temp connection data in metadata
* Added copyright
* Address review comments
* Fixed import
---------
Signed-off-by: Vladislav Byrgazov <vladislav.byrgazov@xored.com>
Signed-off-by: NSMBot <nsmbot@networkservicmesh.io>
nsmbot pushed a commit that referenced this issue May 18, 2024
…k-vpp@main

PR link: networkservicemesh/sdk-vpp#830

Commit: 2fbe1a3
Author: Network Service Mesh Bot
Date: 2024-05-18 13:39:38 -0500
Message:
  - Update go.mod and go.sum to latest version from networkservicemesh/sdk-kernel@main (#830)
PR link: networkservicemesh/sdk-kernel#664
Commit: 9502001
Author: Network Service Mesh Bot
Date: 2024-05-18 13:35:42 -0500
Message:
    - Update go.mod and go.sum to latest version from networkservicemesh/sdk@main (#664)
PR link: networkservicemesh/sdk#1626
Commit: 7b51d9c
Author: Vladislav Byrgazov
Date: 2024-05-19 02:33:13 +0800
Message:
        - Fix memory leak in metrics chain element (#1626)
* Fix memory leak in metrics chain element
* Fix lint issues
* Added check is opentelemetry enabled and fixed copyrights
* Fix metrics memory leak by storing temp connection data in metadata
* Added copyright
* Address review comments
* Fixed import
---------
Signed-off-by: Vladislav Byrgazov <vladislav.byrgazov@xored.com>
Signed-off-by: NSMBot <nsmbot@networkservicmesh.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance The problem related to system effectivity
Projects
Status: Moved to next release
Status: Done
Development

No branches or pull requests

4 participants