-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with ethernet connections between routers in default configuration #1121
Comments
Maybe some race condition is happening here? |
Good idea! I think the race condition has been solved by not using |
Amazing, you have a great memory :D |
I believe that adding some documentation about how to configure the routers in this scenario can mitigate this issue, do you agree? So, for the upcoming release we should either tackle this properly or (simpler) add comments in the website (on lime-docs' /docs/lime-example.txt @pony1k already added documentation in #1085 ) indicating how to manage this. |
Also, this interface-specific configuration should be exposed via lime-app, as it is veeery common for users to connect two LibreMesh devices via ethernet (I believe). |
The default configuration aims to support both LibreMesh_node---Client and LibreMesh_node---LibreMesh_node ethernet connections. Should we leave the Ethernet interfaces out of Batman-adv's bat0, or also out of Babeld? What is a clean way to do so? There is already something like this in place for AP interfaces: lime-packages/packages/lime-proto-batadv/files/usr/lib/lua/lime/proto/batadv.lua Lines 53 to 57 in 396c37a
If we add there an exception also for cabled interfaces, it will disable Batman for LAN interfaces but it would still be present in WAN interfaces (as they get configurated with interface-specific configuration by lime-hwd-openwrt-wan that should be ok, no? |
In case we should remove also Babeld, we can add the same check here: lime-packages/packages/lime-proto-babeld/files/usr/lib/lua/lime/proto/babeld.lua Lines 95 to 97 in c3ef5a8
|
TL;DR For what it's worth, I think this is a step in the right direction. Unfortunately, I did not have time to do further testing. However, I think the reason that we are seeing this issue is that this kind of configuration is not supported by DSA. Let me explain. In the default configuration, we get something like this:
The first section adds lan1 to br-lan, putting the DSA-port in bridge mode. The second section configures a soft vlan on top of lan1. That would require the port to be in stand-alone mode. But a DSA port can not be both in stand-alone and in bridge mode. What we want to happen is that ingress packets on lan1 that have a vlan tag of 207 should go to lan1_207, while all others should be handled by the bridge. This worked fine on swconfig devices, where the bridging is done in software. But with DSA-devices, the bridging is offloaded to the internal switch. Not automatically configuring batadv and babeld on DSA-ports would avoid the situation where a DSA-port is both member of a bridge and configured as stand-alone port. Maybe in future, node-to-node connections could be made possble again without manual configuration by cascading two bridges: One with all the DSA-ports and another bridge that has the first bridge as member, that we call br-lan and also has bat0 and the wlan-ap interfaces. Then the routing protocols could be configured on top of the first bridge. If we switch the tagging protocol on ethernet to 802.1q, then we could use bridge-vlans to seperate traffic between each routing protocol and user traffic. Here is some very detailed information on DSA: |
@pony1k amazing research!! The only thing I can say (not very useful though) is that we can already use VLAN 802.1q just specifying it in the /etc/config/lime-node with something like:
As implemented here: lime-packages/packages/lime-proto-batadv/files/usr/lib/lua/lime/proto/batadv.lua Lines 63 to 64 in 396c37a
And documented here:
If I remember correctly, @G10h4ck said we are using 802.1ad specifically for avoiding having the hardware switch messing with our packets. @G10h4ck pls have a look |
Continuing #1118 here because I'm not sure if trying to detect switches between routers is the right way to go forward.
Let's discuss the issue described by @ilario in this mail:
https://lists.autistici.org/message/20240714.140352.58fe57b2.en.html
In default configuration, ethernet interfaces are added to the br-lan bridge, while also being configured as batadv hard interface. In some setups, this leads to error messages appearing in the kernel log in a high rate and network instability.
It was suspected that the error appears iff there is a switch between the two routers. I tried to reproduce the issue with a dumb switch, but without success (everything working fine, no errors in kernel log), as described in this mail:
https://lists.autistici.org/message/20240726.150840.dcc0e028.en.html
I then tried to reprduce it by replacing the switch with an OpenWrt-router (without DSA), basicly acting as a managable switch, with no sucess either.
Then when I connected the two LibreMesh routers directly, suprisingly I could observe the issue. I could observe the error messages in the kernel logs and batadv didn't mesh over ethernet. On
mr70x-v1
,batctl n
did not list thefritz4040
as neighbour on the lan interfaces, alsobatctl bbt
showed no routers in the backbone table on both routers.batctl tcpdump lan1_29
could see batman OGMs appearing on thelan1_29
, not sure why the interface was not showing up in the neighbour table. After a while, the wifi connection between my laptop and the routers became quite unusable. When I ran tcpdump on the mesh interfaces I found that there was a lot of broadcast and some frames were duplicated many times (I saw ICMPv6 messages with same id and seq-no many times over long time periods. Plus, on my laptop, I saw the same echo request being received over and over at a high rate. So there was a loop and that clogged the wifi interface.It is not the kind of loop I described in #1032 .
Later I booted the routers again, to further investigate the issue. Annoyingly, everything is working fine now. No kernel logs, meshing over ethernet works, no frames looping around. I'm not able to replicate the issue again. I also tried with resetting the configuration to firstboot state, but to no avail. So, unfortunatly, it is currently not possible for me to find out when excactly this happens and why.
I find it strange that batman is also configured on
eth0
on dsa enabled devices. I don't think we are supposed to use that directly. Next time someone observes this issue, maybe they could addto the
lime-node
file and see if it helps.The text was updated successfully, but these errors were encountered: