Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with ethernet connections between routers in default configuration #1121

Open
pony1k opened this issue Jul 31, 2024 · 10 comments
Open

Comments

@pony1k
Copy link
Contributor

pony1k commented Jul 31, 2024

Continuing #1118 here because I'm not sure if trying to detect switches between routers is the right way to go forward.

Let's discuss the issue described by @ilario in this mail:
https://lists.autistici.org/message/20240714.140352.58fe57b2.en.html

In default configuration, ethernet interfaces are added to the br-lan bridge, while also being configured as batadv hard interface. In some setups, this leads to error messages appearing in the kernel log in a high rate and network instability.

It was suspected that the error appears iff there is a switch between the two routers. I tried to reproduce the issue with a dumb switch, but without success (everything working fine, no errors in kernel log), as described in this mail:
https://lists.autistici.org/message/20240726.150840.dcc0e028.en.html

I then tried to reprduce it by replacing the switch with an OpenWrt-router (without DSA), basicly acting as a managable switch, with no sucess either.

Then when I connected the two LibreMesh routers directly, suprisingly I could observe the issue. I could observe the error messages in the kernel logs and batadv didn't mesh over ethernet. On mr70x-v1, batctl n did not list the fritz4040 as neighbour on the lan interfaces, also batctl bbt showed no routers in the backbone table on both routers. batctl tcpdump lan1_29 could see batman OGMs appearing on the lan1_29, not sure why the interface was not showing up in the neighbour table. After a while, the wifi connection between my laptop and the routers became quite unusable. When I ran tcpdump on the mesh interfaces I found that there was a lot of broadcast and some frames were duplicated many times (I saw ICMPv6 messages with same id and seq-no many times over long time periods. Plus, on my laptop, I saw the same echo request being received over and over at a high rate. So there was a loop and that clogged the wifi interface.

It is not the kind of loop I described in #1032 .

Later I booted the routers again, to further investigate the issue. Annoyingly, everything is working fine now. No kernel logs, meshing over ethernet works, no frames looping around. I'm not able to replicate the issue again. I also tried with resetting the configuration to firstboot state, but to no avail. So, unfortunatly, it is currently not possible for me to find out when excactly this happens and why.

I find it strange that batman is also configured on eth0 on dsa enabled devices. I don't think we are supposed to use that directly. Next time someone observes this issue, maybe they could add

config net
	option linux_name 'eth0'
	list protocols 'manual'

to the lime-node file and see if it helps.

@ilario
Copy link
Member

ilario commented Aug 5, 2024

Maybe some race condition is happening here?
Some time ago I remember there was something like the wrong interface being added to the bridge as the first one setting its MAC address to some harmful value, but I also think this was solved, maybe adding the dummy0 interface, cannot remember (and I found a message of mine mentioning that we could remove dummy0 altogether as it should not be needed anymore #189 (comment)).

@pony1k
Copy link
Contributor Author

pony1k commented Aug 5, 2024

Good idea! I think the race condition has been solved by not using dummy0 any longer but changing the mac address of all hardifs, so that the main mac address can never be the same as the one of br-lan (which I think was the problem here). See 4ed70e5. But maybe there is another race condition with the bridge fdb that somehow has to with the fact that the mac address of the other router is seen through two interfaces, both bat0 and the ethernet interface. I will try to figure this out in two weeks or so when I have time (If no one else has figured it out by then).

@ilario
Copy link
Member

ilario commented Aug 6, 2024

Amazing, you have a great memory :D

@ilario
Copy link
Member

ilario commented Nov 7, 2024

I believe that adding some documentation about how to configure the routers in this scenario can mitigate this issue, do you agree?

So, for the upcoming release we should either tackle this properly or (simpler) add comments in the website (on lime-docs' /docs/lime-example.txt @pony1k already added documentation in #1085 ) indicating how to manage this.

@ilario
Copy link
Member

ilario commented Nov 7, 2024

Also, this interface-specific configuration should be exposed via lime-app, as it is veeery common for users to connect two LibreMesh devices via ethernet (I believe).
Opinions? @selankon @javierbrk @G10h4ck

@ilario
Copy link
Member

ilario commented Nov 7, 2024

The issue #1008 is like a child of this one. I hope that when this one will be fixed, also #1008 will really be fixed.

@ilario
Copy link
Member

ilario commented Dec 7, 2024

The default configuration aims to support both LibreMesh_node---Client and LibreMesh_node---LibreMesh_node ethernet connections.
While this maybe worked for both scenarios in the past, currently it works only for the first case, so why don't we change the default configuration?

Should we leave the Ethernet interfaces out of Batman-adv's bat0, or also out of Babeld?

What is a clean way to do so?
We should allow the users to activate it manually in interface-specific configuration.

There is already something like this in place for AP interfaces:

if not args["specific"] then
if ifname:match("^wlan%d+.ap") then
utils.log( "lime.proto.batadv.setup_interface(%s, ...) ignored",
ifname )
return

If we add there an exception also for cabled interfaces, it will disable Batman for LAN interfaces but it would still be present in WAN interfaces (as they get configurated with interface-specific configuration by lime-hwd-openwrt-wan that should be ok, no?

@ilario
Copy link
Member

ilario commented Dec 7, 2024

In case we should remove also Babeld, we can add the same check here:

if not args["specific"] and ifname:match("^wlan%d+.ap") then
utils.log("lime.proto.babeld.setup_interface(%s, ...) ignored", ifname)
return

@pony1k
Copy link
Contributor Author

pony1k commented Dec 8, 2024

Should we leave the Ethernet interfaces out of Batman-adv's bat0, or also out of Babeld?

TL;DR For what it's worth, I think this is a step in the right direction.

Unfortunately, I did not have time to do further testing. However, I think the reason that we are seeing this issue is that this kind of configuration is not supported by DSA. Let me explain. In the default configuration, we get something like this:

config device
    option name 'br-lan'
    option type 'bridge'
    list ports 'bat0'
    list ports 'lan1'
    list ports 'lan2'
    list ports 'lan3'
    
[...]

config device 'lm_net_lan1_batadv_dev'
    option type '8021ad'
    option name 'lan1_207'
    option ifname 'lan1'
    option vid '207'
    option macaddr '02:4a:18:35:7e:35'
    option mtu '1532'

The first section adds lan1 to br-lan, putting the DSA-port in bridge mode. The second section configures a soft vlan on top of lan1. That would require the port to be in stand-alone mode. But a DSA port can not be both in stand-alone and in bridge mode. What we want to happen is that ingress packets on lan1 that have a vlan tag of 207 should go to lan1_207, while all others should be handled by the bridge. This worked fine on swconfig devices, where the bridging is done in software. But with DSA-devices, the bridging is offloaded to the internal switch.

Not automatically configuring batadv and babeld on DSA-ports would avoid the situation where a DSA-port is both member of a bridge and configured as stand-alone port.

Maybe in future, node-to-node connections could be made possble again without manual configuration by cascading two bridges: One with all the DSA-ports and another bridge that has the first bridge as member, that we call br-lan and also has bat0 and the wlan-ap interfaces. Then the routing protocols could be configured on top of the first bridge. If we switch the tagging protocol on ethernet to 802.1q, then we could use bridge-vlans to seperate traffic between each routing protocol and user traffic.

Here is some very detailed information on DSA:
https://www.kernel.org/doc/html/latest/networking/dsa/dsa.html
https://www.kernel.org/doc/html/latest/networking/dsa/configuration.html

@ilario ilario closed this as completed Dec 9, 2024
@github-project-automation github-project-automation bot moved this to Done in Releases Dec 9, 2024
@ilario ilario reopened this Dec 9, 2024
@ilario ilario moved this from Done to Todo in Releases Dec 9, 2024
@ilario
Copy link
Member

ilario commented Dec 9, 2024

@pony1k amazing research!!

The only thing I can say (not very useful though) is that we can already use VLAN 802.1q just specifying it in the /etc/config/lime-node with something like:

list protocols "batadv:%N1:8021q

As implemented here:

local vlanId = args[2] or "%N1"
local vlanProto = args[3] or "8021ad"

And documented here:

list protocols bmx6:13 # The VLAN type can be provided as a third argument, for example bmx6:13:8021q for using VLAN 802.1q instead of the default 802.1ad

If I remember correctly, @G10h4ck said we are using 802.1ad specifically for avoiding having the hardware switch messing with our packets.

@G10h4ck pls have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants