Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP probably a bug with multipath + addpath capability + bestpath calculation + route server scenario #18222

Open
2 tasks done
IvayloJ opened this issue Feb 21, 2025 · 23 comments · May be fixed by #18275
Open
2 tasks done
Assignees
Labels

Comments

@IvayloJ
Copy link

IvayloJ commented Feb 21, 2025

Description

If a neighbor do not support or advertise addpath RX capability to frr, but in it's corresponding peer-group or per neighbor (ipv4 unicast/ipv6 unicast) explicitly are include options:
addpath-tx-all-paths
addpath-tx-best-selected
addpath-tx-bestpath-per-AS

Additional paths announced to the neighbor are observed.

Version

Current test was done with frr-9.1.3 but I guess it will affect all versions, even master (by looking the source code).

How to reproduce

FRR as Route Server adding addpath-tx-XXXX. FRR also can be used as a client with set disable-addpath-rx to the RS neighbor (in my case I have many different bgpd from different vendors - cisco,nokia,juniper,huawei + bgpd linux/bsd software implementations).

Connecting other bgp behind the RS clients and withdraw/announce nets will speed bug appearance.

+----------------------------+             +------------------+
| RS                         |             | AS65100          |
| AS65000 192.168.137.1      |-------------| 192.168.137.100  |--------+
|addpath-tx-bestpath-per-AS  |             |                  |        |
+----------------------------+             +------------------+        |
 |                        |                                            | 
 |                        |       +--------------------+               |
 |                        |       | AS65200            |               | prepend 65444
 |                        |-------| 192.168.137.201    |---+           |
+--------------------+    |       | disable-addpath-rx |   |   +---------------------+
| AS65600            |    |       +--------------------+   +---| AS65444             |
| 192.168.137.60     |    |                                    |                     |
|disable-addpath-rx  |    |       +--------------------+   +---| network 10.0.0.0/24 |
+--------------------+    +-------| AS65200            |   |   +---------------------+
                                  | 192.168.137.202    |---+
                                  | disable-addpath-rx |
                                  +---------------------

Expected behavior

All additional paths announces and best path calculation for the neighbor to be with respect of bgp addpath capability negotiations in the beginning of the session.

Actual behavior

Output of "show bgp ipv4 neighbors 192.168.137.20 advertised-routes"

 *> 10.80.129.0/24   192.168.137.155            0    100      0 64866 64561 65011 i
 *> 10.80.129.0/24   192.168.137.154            0    100      0 648717 65011 65011 65011 65011 65011 65011 i
 *> 10.80.129.0/24   192.168.137.155            0    100      0 64866 64561 65011 i

Output of "show bgp ipv4 neighbors 192.168.137.20"

  Neighbor capabilities:
    4 Byte AS: advertised and received
    Extended Message: advertised and received
    AddPath:
      IPv4 Unicast: TX advertised
      IPv4 Unicast: RX advertised
    Route refresh: advertised and received
    Enhanced Route Refresh: advertised and received
    Address Family IPv4 Unicast: advertised and received
    Hostname Capability: advertised (name: rs1.varnaix.net,domain name: n/a) received (name: ge1-bgpd,domain name: n/a)
    Version Capability: not advertised not received
  Graceful restart information:
    Local GR Mode: Disable

    Remote GR Mode: NotApplicable

Additional context

I look at the frr-9.1.3 code but do not see how it check if addpath flag is set during initial peer negotiation, to perform best path calculation per the neighbor. Maybe the bug appears only if one of the paths withdraw and then must announce other path on its place. In all cases no matter if we set addpath-tx-something in the config, all that functionality for the peer should not be applied to the bgp logic if addpath capability is not negotiated.
suppress-duplicate is enabled by default and with the showed topology prevent proper update messages to be send from the route server. Stopping suppress-duplicate no bgp suppress-duplicate may help, but also may make things worse sending additional paths to a neighbor we should not do it.

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@IvayloJ IvayloJ added the triage Needs further investigation label Feb 21, 2025
@ton31337
Copy link
Member

But you have addpath enabled (both tx/rx):

    AddPath:
      IPv4 Unicast: TX advertised
      IPv4 Unicast: RX advertised

Could you give the simplest configuration to fully understand what's the issue you described?

@ton31337 ton31337 self-assigned this Feb 21, 2025
@ton31337 ton31337 added the bgp label Feb 21, 2025
@IvayloJ
Copy link
Author

IvayloJ commented Feb 22, 2025

@ton31337 That is what we advertise to the peer - We support addpath TX and RX. The peer (192.168.137.20 FRR-9.1.3 with disable-addpath-rx clause set) do not advertise to us nor addpath TX nor addpath RX. So we must NOT accept additional paths nor advertise additional paths to that peer no matter we support it and advertise it. Maybe your patch for dynamic addpath negotiation was step in the right direction as checks, but in all cases multipath logic in the code should be on/off on the session init only when both negotiators support same features (or any of the bgps explicitly is configured to not respect other side capabilities).

RS config (All peers have equal configuration)

 no bgp default ipv4-unicast
 bgp deterministic-med
 bgp long-lived-graceful-restart stale-time 1
 bgp graceful-restart stalepath-time 1
 bgp graceful-restart restart-time 1
 bgp graceful-restart-disable
 neighbor PEER-A-V4 peer-group
 neighbor PEER-A-V4 remote-as 64866
 neighbor PEER-A-V4 description PEER-A
 neighbor PEER-A-V4 passive
 neighbor PEER-A-V4 enforce-first-as
 neighbor PEER-A-V4 timers 5 15
 neighbor PEER-A-V4 graceful-restart-disable
 neighbor 192.168.137.155 peer-group PEER-A-V4
 neighbor 192.168.137.155 description PEER-A-1
 address-family ipv4 unicast
  neighbor PEER-A-V4 activate
  neighbor PEER-A-V4 addpath-tx-bestpath-per-AS
  neighbor PEER-A-V4 soft-reconfiguration inbound
  neighbor PEER-A-V4 maximum-prefix 20
  neighbor PEER-A-V4 route-server-client
  neighbor PEER-A-V4 prefix-list v4_ban in
  neighbor PEER-A-V4 prefix-list v4_ban out
  neighbor PEER-A-V4 route-map PEER-A-IN in
  neighbor PEER-A-V4 route-map PEER-A-OUT out

The bug could be in the bestpath calculation or in the way sending bgp withdraw/updates. Also with the time I observe growing number of PfxSnt with "show bgp ipv4 summary" to the neighbors that do not support addpath at all. And when I dump advertised-routes (show bgp ipv4 neighbors x.x.x.xadvertised-routes) I see some same prefixes learned on the RS from different peers over different paths announced to a peer that do not support addpath. When I remove addpath-tx-bestpath-per-AS clause (same bug i see with addpath-tx-all-paths and addpath-tx-best-selected) for a peer which do not support addpath all works fine (on the router server).

So maybe the simplest fix should be if addpath rx is not advertised from a peer to us, but we put addpath-tx-XXX as a config clause, to turn off for that peer all multipath logic and functions in the code which is applied when we have addpath-tx-XXX and to completely ignore what addpath-tx-XXX do. But still should announce addpath TX capability (we support it at all).

With more simple words I think the logic and functionality behind multipath should be:

  1. no addpath-tx in config - do what we do now
  2. addpath-tx in config - turn on advertising addpath TX capability only, but still do what we do without addpath-tx in the config.
  3. addpath-tx in config + the neighbor advertise addpath RX capability - turn on the current code functions for multipath
  4. Always hard reset of the bgp session is needed when set addpath-tx in the config, to can renegotiate the supported capabilities from both sides.

UPDATE: Forgot to mention when the bug is present the RS really send wrong routes to the affected peers. In one of my test scenarios I have 2 neighbors in one peer group (for redundancy). Both of them announcing specific prefix with same path, same prefix is announcing and from third neighbor in other PG. If the prefix from the both neighbors in the same PG stop be announced anymore the affected peers receive a bad route from the RS for that prefix (next hop point to one of the neighbors that do not announce it anymore), maybe because withdraw do not come in right the same time from both neighbors in the PG, or maybe other reason I dont know. With "show bgp ipv4 the.affected.prefix" all looks good (the prefix is visible only from the single peer that announce it in the moment). "clear bgp ipv4 * soft" fix the things and the right path (from the third neighbor) is send to all. The PfxSnt counter is increased (only the first time it happens, never decreased), and advertised-routes show all 3 paths was announced to the affected peers.

@ton31337
Copy link
Member

So is this happening with peer-groups only?

@IvayloJ
Copy link
Author

IvayloJ commented Feb 24, 2025

Not only. Peer-group setup is my latest config scenario. I observe same when set neighbors in the RS config as individuals (with and without solo clause).

It is happening, when we put addpath-tx- in the config for a peer, and this peer do not support addpath at all, or do not want to accept additional paths from us (do not advertise addpath RX to the RS). In the scenario where I observe the bug I have 43 neighbors, and 6k prefixes (could be because the scale).

@ton31337
Copy link
Member

Ok, thanks for the details, I will test.

@IvayloJ
Copy link
Author

IvayloJ commented Feb 24, 2025

Thank you !!! If you have time and you want I may think how to give you access to one RS where you can see the bug. and I can simulate it. Just can not debug deeply there (not testing env).

By the way now when I think more deeply about the PGs scenario with the current handling of addpath there could be a real mess... addpath-tx- is set on the peer-group, but what if one neighbor in the PG is addpath RX capable and the other one not ?

@ton31337
Copy link
Member

I tested without peer-groups (not tested yet with peer-groups), and can't see anything wrong in sending/not sending addpaths according to RX flag. I will test with PGs to be sure.

In the meantime, could you please test it with the master branch?

And also, please show us the following logs: "debug bgp neighbor", and "debug bgp updates".

@IvayloJ
Copy link
Author

IvayloJ commented Feb 25, 2025

Think I understand what exactly going on (thanks to you for the debug pointing)... You will need bit more complicated test topology to simulate (will build such to be 100% sure, and will put it in the description of the bug, but all what I observe by now is with such topology):

+----------------------------+             +------------------+
| RS                         |             | AS65100          |
| AS65000                    |-------------| 192.168.137.100  |--------+
|addpath-tx-bestpath-per-AS  |             |                  |        |
+----------------------------+             +------------------+        |
 |                        |                                            | 
 |                        |       +--------------------+               |
 |                        |       | AS65200            |               | prepend 65444
 |                        |-------| 192.168.137.201    |---+           |
+--------------------+    |       | disable-addpath-rx |   |   +---------------------+
| AS65600            |    |       +--------------------+   +---| AS65444             |
|                    |    |                                    |                     |
|disable-addpath-rx  |    |       +--------------------+   +---| network 10.0.0.0/24 |
+--------------------+    +-------| AS65200            |   |   +---------------------+
                                  | 192.168.137.202    |---+
                                  | disable-addpath-rx |
                                  +---------------------

When 65444 stop the announce in debug log of the RS I see withdraw from 192.168.137.201 is come. Update path to all is send (next hop 192.168.137.202), then withdraw from 192.168.137.202 come update message is suppressed (suppress-duplicate is enable per default). Also checked what we really send over the network (tcpdump -Xvvv to AS65600) - only one bgp update message with next hop .202. That happens to peers which do not support addpath RX (not advertise such capability), but we enable addpath-tx for them in the RS config. This way the route in AS65600 is stuck to invalid next hop (192.168.137.202 from AS65200). When I stop suppressing duplicates (set "no bgp suppress-duplicates"), the RS start sending 2 update, and next hop in AS65600 come from AS65100 (192.168.137.100) as it should to be. But "show bgp ipv4 summary" and "show bgp ipv4 neighbor 192.168.137.100 advertised-routes" are messy PfxSnt is wrong, and I see all paths are advertised (did not catch with tcpdump what exactly we send, was too tired and too late in the night, but will do later).

I can send you the full debug log, the real bgp msgs we send (tcpdump), and traceroutes I observe also. I done the test with PGs scenario, but it will be same per individual neighbor (PGs only can make things worse I guess).

Unfortunately I can not test with nothing newer than 9.1.3 maybe max 10.0. In 10.2.1 I spot a lot of bad bugs just for 30 minutes of play (worst one bgpd crash at start if frr.conf is missing - package from deb.frrouting.org). After 9.1.3 there are changes in default config (what is on what is off by default). Also change from per daemon config to single file, is not auto done, I have to rewrite everything on hand. BTW that is the worse idea ever ! On some places I will have > 150kb config file when I put everything in one which is absolutely insane to read and maintain. CLI and vty are are awful with single file and dont work properly(saving changes). But this are subjects for other issues. At all I personally (and almost every network admin I know and use frr) will be stuck to 9.1.3 for long time, and if you drop support of 9.1.3 will be a bad.

@ton31337
Copy link
Member

Please show show ip bgp update-group when this is happening (advertising addpath routes even when RX flag is not received from 65200 routers) on RS router.

@IvayloJ
Copy link
Author

IvayloJ commented Feb 25, 2025

100% confirmed with the topology above the defect is visible and can be simulated...

RS65000# show ip bgp update-group
Update-group 15:
  Created: Tue Feb 25 19:47:49 2025
  Outgoing route map: AS65200-OUT
  MRAI value (seconds): 0

  Update-subgroup 17:
    Created: Tue Feb 25 19:47:49 2025
    Join events: 2
    Prune events: 0
    Merge events: 0
    Split events: 0
    Update group switch events: 0
    Peer refreshes combined: 0
    Merge checks triggered: 0
    Coalesce Time: 1400
    Version: 176
    Packet queue length: 0
    Total packets enqueued: 3
    Packet queue high watermark: 1
    Adj-out list count: 3
    Advertise list: empty
    Flags: 
    Max packet size: 4096
    Peers:
      - 192.168.137.202
      - 192.168.137.201
Update-group 16:
  Created: Tue Feb 25 19:47:49 2025
  Outgoing route map: AS65600-OUT
  MRAI value (seconds): 0

  Update-subgroup 18:
    Created: Tue Feb 25 19:47:49 2025
    Join events: 1
    Prune events: 0
    Merge events: 0
    Split events: 0
    Update group switch events: 0
    Peer refreshes combined: 0
    Merge checks triggered: 0
    Coalesce Time: 1400
    Version: 176
    Packet queue length: 0
    Total packets enqueued: 12
    Packet queue high watermark: 1
    Adj-out list count: 6
    Advertise list: empty
    Flags: 
    Max packet size: 4096
    Peers:
      - 192.168.137.60
Update-group 14:
  Created: Tue Feb 25 19:47:49 2025
  Outgoing route map: AS65100-OUT
  MRAI value (seconds): 0

  Update-subgroup 16:
    Created: Tue Feb 25 19:47:49 2025
    Join events: 1
    Prune events: 0
    Merge events: 0
    Split events: 0
    Update group switch events: 0
    Peer refreshes combined: 0
    Merge checks triggered: 0
    Coalesce Time: 1400
    Version: 176
    Packet queue length: 0
    Total packets enqueued: 9
    Packet queue high watermark: 1
    Adj-out list count: 3
    Advertise list: empty
    Flags: 
    Max packet size: 4096
    Peers:
      - 192.168.137.100

I will cleanup all the bgpd.confs in the scheme, and will updated this post with download links for them. For the test used debian versions from 9 to 12, and frr from 8.5 to 9.1.3 (all on packages from deb.frrouting.org). The route server is 9.1.3 again package from deb.frrouting.org.

UPD:
config files for all bgpds in the scheme: http://i.bglans.net/frr-bug-test-18222.tar.gz
to simulate on AS65444: no ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24 / ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24
observe result on AS65600: show bgp ipv4 10.0.0.0/24
debug the RS (AS65000)

@ton31337
Copy link
Member

ton31337 commented Feb 26, 2025

I tested now with the master branch + your topology and configs, and I can't replicate it. I can't see 10.0.0.0/24 received (only best path) on 65600 AS node.

I will test with other 10.x branches yet.

Please correct me what should I see on 65600 node?

@IvayloJ
Copy link
Author

IvayloJ commented Feb 26, 2025

You have to see 10.0.0.0/24 on 65600 node. Please check if on 65444 you have ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24

Just a note: In frr 10+ enforce-first-as is enable by default, which automatically breaks all router server scenarios if you using 10+ as a client of the router server (node 65600, 65100, 65200-1, 65200-2 for the sessions to the RS-65000). You will have to disable it.

@ton31337
Copy link
Member

You will have to disable it.

Yes, I already disabled it.

You have to see 10.0.0.0/24 on 65600 node. Please check if on 65444 you have ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24

I see 10.0.0.0/24 as a best path on 65600 which is fine, because AS65100 has it and announces it towards AS65000 -> AS65600. It's not an addpath (you can verify it by looking at TX ID, which does not exist).

@IvayloJ
Copy link
Author

IvayloJ commented Feb 26, 2025

In fact on 65600 bestpath for 10.0.0.0/24 should be from 65200-1/2 (65444 -> 65100 have + one prepend). when you stop annouce 10.0.0.0/24 from 65444 to 65200-1 and 65200-2, then best path in 65600 should come from 65100. When you start advertise 10.0.0.0/24 from 65444 to 65200-1/2 best path must come from there (65200-1/2).

To simulate the defect, stop/start announce 10.0.0.0/24 from 65444 to 65200(1,2) couple of times (at least 2). That can be done with ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24 , no ip prefix-list v4_our_to65200 seq 2 permit 10.0.0.0/24 executed on 654444. And you can check on the route server 65000 what is advertised to 65600 (show bgp ipv4 neighbor 192.168.137.60 advertised-routes). Also check on 65600 which is bestpath for 10.0.0.0/24 you will see how it stuck to one of the 65200 when it have to move to 65100.

@ton31337
Copy link
Member

Ok, seems now I can see the issue.

@IvayloJ
Copy link
Author

IvayloJ commented Feb 26, 2025

Thank you very much !!!!!

Thats not all...

Try to set "no bgp suppress-duplicate" on the route server, do the tests again and watch show bgp ipv4 sum also advertised-routes to 65600.

Also try to remove addpath-tx for 65600 and test again (with and without suppress-duplicates).

All of these are related somehow somewhere in the code.

On 65000 node can catch with tcpdump -Xvvvni [ethX/enpX] host 192.168.137.60 and port 179 the real update/withdrawn messages send to 65600 (eventually no neighbor AS65600-V4 timers 5 15 executed on 65000 because with timers too noisy from the keep alive frames).

Keep in mind every time you restart frr/bgpd on 65000 it may work in the first test, so few of start/stop announce of 10.0.0.0/24 from 64444 -> 65200 may be needed to trigger the issue. Also same maybe applied on session hard clear. clear bgp ipv4 * soft executed on 65000 will trigger sending of proper update to 65600.

@ton31337
Copy link
Member

ton31337 commented Feb 27, 2025

Could you compile and test this patch? (Do not look into advertised-routes output yet)

@IvayloJ
Copy link
Author

IvayloJ commented Feb 27, 2025

Backported the patch to 9.1 (current branch), in all test cases the issue is fixed (you rock !!! Thank you !) and all works as it should.
Unfortunately I discover two new bugs :))) (it do NOT come from the patch in any way)

on 65600 node enable multipath:
no neighbor 192.168.137.1 disable-addpath-rx

on 65000 (the RS) enable best-selected strategy for 65600 and enable suppress duplicates
neighbor AS65600-V4 addpath-tx-best-selected 6
bgp suppress-duplicates

bug 1: When stop announce 10.0.0.0/24 65444 -> 65200 , all paths for 10.0.0.0/24 disappear in 65600, even the one from 65100.

bug 2: when do no neighbor AS65600-V4 addpath-tx-best-selected the session do not go thru hard reset, as it goes when remove addpath-tx-all-paths or addpath-tx-bestpath-per-AS. Maybe that will leave peers out of sync for the addpath capable ? And will not stop all MP in the code ?

Can you test if that same happens with your RS (guess you still test with master branch version RS).

Do you want me to make a patch for 9.1 and send you to push it in the git or all in one if/when you fix and the stats (show bgp sum / advertised routes) ?

UPDATE: in 9.1 the patch seems also FIXED the stats: show bgp ipv4 sum and show bgp ipv4 neighbor 192.168.137.60 advertised-routes Just to know if you see different results with the master branch.

@ton31337
Copy link
Member

Updated #18275, could you test this also?

@IvayloJ
Copy link
Author

IvayloJ commented Feb 27, 2025

What for is the second commit ? How you spot such situation ? And with the second commit it also looks that it works how it should.

I just compile master branch for testing the tree new issues I spot are present and there. Should I open new issues for them (same topology) or ?

@ton31337 ton31337 removed the triage Needs further investigation label Feb 28, 2025
@IvayloJ
Copy link
Author

IvayloJ commented Feb 28, 2025

Dont push these two commits to the master yet. There are other problems that are created now. Looking to understands what is going on.

@ton31337
Copy link
Member

Should I open new issues for them (same topology) or ?

If it's a separate issue, then yes - open a new issue.

@IvayloJ
Copy link
Author

IvayloJ commented Mar 1, 2025

Ok I have more clear view and will try to explain you why I told you to hold push the commits ...

There are HUGE problem with the workflow paths logic process (which after which should be applied) and it affects all NOT multipath peers - update groups (by now tested from 9.1 to master, will check older vers too, with 7.1 - 8.2 was not that). The current workflow I observe in the code and from tests and debugs is:

(did not look back, yet) -> best-path-selection -> Adj-RIB-Out( group_announce_route_walkcb where you put the fix) -> subgroup_announce_check( OUTPUT route-map apply ) -> send

And it should be:

(did not look back, yet) -> OUTPUT-route-map-check/apply( for the peer/group ) -> best-path-selection( among the allowed paths left) -> Adj-RIB-Out( group_announce_route_walkcb where you put the fix) -> subgroup_announce_check() -> send

The Current workflow breaks many features in frr (route map filtering) and invalidate big part of the documentation ! Because we may have many paths for a prefix but, filtering best-path in the output RM will stop send any other path (for NON MP peer/group).

If you push the commits for this issue, may create other bug (loops where you search the best_tx_id), because for a specific peer/group , we may not send always best path we have (MUST depend on the output route map).

Please, if you can give me a hand and help to figure out if what I try to explain is really true, and fix it I will be very very grateful ! I observe > 80% of the peers (vendors equipment) are not multi path capable or not enable by default, if we keep keep going current way as now, i am not very sure how usable frr bgpd can be (maybe only frr to frr peering).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants