Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC5549 next-hop is unusable/wrong after restart #11108

Closed
1 of 2 tasks
sebastianw opened this issue Apr 27, 2022 · 0 comments
Closed
1 of 2 tasks

RFC5549 next-hop is unusable/wrong after restart #11108

sebastianw opened this issue Apr 27, 2022 · 0 comments
Labels
bgp triage Needs further investigation

Comments

@sebastianw
Copy link


Describe the bug

  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

FRR is set up to peer with Arista switches to exchange IPv4 routes over IPv6 Link-Local BGP sessions (RFC5549).

This stops working sometimes, mostly after restarting FRR it seems. When this happens I noticed the following:

BGP neighbor output no longer shows extended-next-hop capability

Neighbor output displays that extended nexthop is received but not advertised:

client1rt# show bgp neighbors fabric0
BGP neighbor on fabric0: fe80::d6af:f7ff:fe91:46db, remote AS 4209900005, local AS 4209901001, external link
 Member of peer-group EVPN-FABRIC for session parameters
  BGP version 4, remote router ID 10.60.196.15, local router ID 10.60.197.1
  BGP state = Established, up for 00:00:15
  Last read 00:00:15, Last write 00:00:13
  Hold time is 180, keepalive interval is 60 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    Extended Message: advertised
    AddPath:
      IPv4 Unicast: RX advertised and received
    Extended nexthop: received         <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      Address families by peer:          
                   IPv4 Unicast                   
    Long-lived Graceful Restart: advertised
    Route refresh: advertised and received(new)
    Enhanced Route Refresh: advertised and received
    Address Family IPv4 Unicast: advertised and received
    Hostname Capability: advertised (name: client1rt,domain name: n/a) not received
    Graceful Restart Capability: advertised and received
      Remote Restart timer is 300 seconds
      Address families by peer:
        none
  Graceful restart information:
    End-of-RIB send: IPv4 Unicast
    End-of-RIB received: IPv4 Unicast
    Local GR Mode: Helper*
    Remote GR Mode: Helper
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 300
    IPv4 Unicast:
      F bit: False
      End-of-RIB sent: Yes
      End-of-RIB sent after update: Yes
      End-of-RIB received: Yes
      Timers:
        Configured Stale Path Time(sec): 360
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  2          2
    Notifications:          2          0
    Updates:                6         20
    Keepalives:             4          7
    Route Refresh:          0          0
    Capability:             0          0
    Total:                 14         29
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  EVPN-FABRIC peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Inbound soft reconfiguration allowed
  Community attribute sent to this neighbor(all)
  Inbound path policy configured
  Outbound path policy configured
  Route map for incoming advertisements is *PERMIT-ANY
  Route map for outgoing advertisements is *LOCAL-LOOPBACKS
  0 accepted prefixes
  Maximum prefixes allowed 10000
  Threshold for warning message 75%

  Connections established 2; dropped 1
  Last reset 00:00:16,  No AFI/SAFI activated for peer
Local host: fe80::a236:9fff:fe3e:509a, Local port: 179
Foreign host: fe80::d6af:f7ff:fe91:46db, Foreign port: 46323
Nexthop: 10.60.197.1
Nexthop global: fe80::a236:9fff:fe3e:509a
Nexthop local: fe80::a236:9fff:fe3e:509a
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Peer Authentication Enabled
Read thread: on  Write thread: on  FD used: 27

Also in the BGP router config the neighbor suddenly has no neighbor fabric0 capability extended-nexthop
even though it is active through the peer-group and was not configured by me. The line just turns up for all neighbors in the peer group.

router bgp 4209901001
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 bgp bestpath as-path multipath-relax
 neighbor EVPN-FABRIC peer-group
 neighbor EVPN-FABRIC password XXX
 neighbor EVPN-FABRIC capability extended-nexthop
 neighbor EVPN-OVERLAY-PEERS peer-group
 neighbor EVPN-OVERLAY-PEERS bfd
 neighbor EVPN-OVERLAY-PEERS bfd profile EVPN-FABRIC
 neighbor EVPN-OVERLAY-PEERS password XXX
 neighbor EVPN-OVERLAY-PEERS ebgp-multihop 3
 neighbor EVPN-OVERLAY-PEERS update-source evpn0
 neighbor fabric0 interface peer-group EVPN-FABRIC
 neighbor fabric0 remote-as 4209900005
 no neighbor fabric0 capability extended-nexthop                   <<<<<<<<<<<<<<<<<<<<<<<<<<
 neighbor fabric2 interface peer-group EVPN-FABRIC
 neighbor fabric2 remote-as 4209900006
 no neighbor fabric2 capability extended-nexthop                   <<<<<<<<<<<<<<<<<<<<<<<<<<
 neighbor 10.60.196.15 remote-as 4209900005
 neighbor 10.60.196.15 peer-group EVPN-OVERLAY-PEERS
 !
 address-family ipv4 unicast
  redistribute connected route-map LOOPBACK-HOSTIPS
  neighbor EVPN-FABRIC activate
  neighbor EVPN-FABRIC soft-reconfiguration inbound
  neighbor EVPN-FABRIC maximum-prefix 10000
  neighbor EVPN-FABRIC route-map PERMIT-ANY in
  neighbor EVPN-FABRIC route-map LOCAL-LOOPBACKS out
  maximum-paths 2
 exit-address-family

Even when I reactive the capability with neighbor fabric0 capability extended-nexthop the problem persists after BGP is reset. Some combination of restarting FRR and changing configuration then fixes this again, but I can't make out a pattern.

Output on the Arista side shows an error when BGP is established and also has routes with the wrong next-hop:

Apr 27 16:26:26 leaf1 Bgp: %BGP-3-DROP_TXUPDATE: Dropped updates for peer fe80::a236:9fff:fe3e:509a%Et2 (VRF default AS 4209901001) because a local Nexthop was not configured for AFI/SAFI IPv4/Unicast (message repeated 2 times in 78.1729 secs)

#show bgp neighbors fe80::a236:9fff:fe3e:509a%Et2 ipv4 unicast received-routes
BGP routing table information for VRF default
Router identifier 10.60.196.15, local AS number 4209900005
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
                    % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  AIGP       LocPref Weight  Path
          10.60.197.1/32         10.60.197.1           0       -          -       -       4209901001 ?
          10.61.197.1/32         10.60.197.1           0       -          -       -       4209901001 ?

To Reproduce

Not sure how to reproduce, problem occurs pretty often, mostly after restarting FRR.

Expected behavior

When it works next-hop is IPv6 on the Arista side as expected:

#show bgp neighbors fe80::a236:9fff:fe3e:509a%Et2 ipv4 unicast received-routes
BGP routing table information for VRF default
Router identifier 10.60.196.15, local AS number 4209900005
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup, L - labeled-unicast
                    % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI Origin Validation codes: V - valid, I - invalid, U - unknown
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  AIGP       LocPref Weight  Path
 * >      10.60.197.1/32         fe80::a236:9fff:fe3e:509a%Et2 0       -          -       -       4209901001 ?
 * >      10.61.197.1/32         fe80::a236:9fff:fe3e:509a%Et2 0       -          -       -       4209901001 ?

Also in FRR the extended next-hop is advertised:

client1rt# show bgp neighbors fabric0
BGP neighbor on fabric0: fe80::d6af:f7ff:fe91:46db, remote AS 4209900005, local AS 4209901001, external link
 Member of peer-group EVPN-FABRIC for session parameters
  BGP version 4, remote router ID 10.60.196.15, local router ID 10.60.197.1
  BGP state = Established, up for 00:14:44
  Last read 00:00:05, Last write 00:00:44
  Hold time is 180, keepalive interval is 60 seconds
  Neighbor capabilities:
    4 Byte AS: advertised and received
    Extended Message: advertised
    AddPath:
      IPv4 Unicast: RX advertised and received
    Extended nexthop: advertised and received       <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      Address families by peer:
                   IPv4 Unicast
    Long-lived Graceful Restart: advertised
    Route refresh: advertised and received(new)
    Enhanced Route Refresh: advertised and received
    Address Family IPv4 Unicast: advertised and received
    Hostname Capability: advertised (name: client1rt,domain name: n/a) not received
    Graceful Restart Capability: advertised and received
      Remote Restart timer is 300 seconds
      Address families by peer:
        none
  Graceful restart information:
    End-of-RIB send: IPv4 Unicast
    End-of-RIB received: IPv4 Unicast
    Local GR Mode: Helper*
    Remote GR Mode: Helper
    R bit: False
    Timers:
      Configured Restart Time(sec): 120
      Received Restart Time(sec): 300
    IPv4 Unicast:
      F bit: False
      End-of-RIB sent: Yes
      End-of-RIB sent after update: Yes
      End-of-RIB received: Yes
      Timers:
        Configured Stale Path Time(sec): 360
  Message statistics:
    Inq depth is 0
    Outq depth is 0
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:                3         12
    Keepalives:            15         19
    Route Refresh:          0          0
    Capability:             0          0
    Total:                 19         32
  Minimum time between advertisement runs is 0 seconds

 For address family: IPv4 Unicast
  EVPN-FABRIC peer-group member
  Update group 1, subgroup 1
  Packet Queue length 0
  Inbound soft reconfiguration allowed
  Community attribute sent to this neighbor(all)
  Inbound path policy configured
  Outbound path policy configured
  Route map for incoming advertisements is *PERMIT-ANY
  Route map for outgoing advertisements is *LOCAL-LOOPBACKS
  17 accepted prefixes
  Maximum prefixes allowed 10000
  Threshold for warning message 75%

  Connections established 1; dropped 0
  Last reset 00:15:53,  Waiting for peer OPEN
Local host: fe80::a236:9fff:fe3e:509a, Local port: 41550
Foreign host: fe80::d6af:f7ff:fe91:46db, Foreign port: 179
Nexthop: 10.60.197.1
Nexthop global: fe80::a236:9fff:fe3e:509a
Nexthop local: fe80::a236:9fff:fe3e:509a
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 120
Estimated round trip time: 1 ms
Peer Authentication Enabled
Read thread: on  Write thread: on  FD used: 29

Versions

  • OS Version: Debian Bullseye (11.3)
  • FRR Version: 8.2.2-0~deb11u1 (from FRR deb repository)
@sebastianw sebastianw added the triage Needs further investigation label Apr 27, 2022
@ton31337 ton31337 added the bgp label Apr 29, 2022
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this issue Oct 20, 2022
…12453)

Fixing issue FRRouting/frr#11108
For interface based peers with peer-groups, "no neighbor capability extended-nexthop" gets added by default. This will result in IPv4 routes not having ipv6 next hops.

- How I did it
Porting the commit FRRouting/frr@8e89adc to FRR 8.2.2 which fixes the issue

- How to verify it
Load FRR and verify if the "no neighbor capability extended-nexthop" not gets added for interfaces associated with peer-groups
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this issue Oct 25, 2022
…12453)

Fixing issue FRRouting/frr#11108
For interface based peers with peer-groups, "no neighbor capability extended-nexthop" gets added by default. This will result in IPv4 routes not having ipv6 next hops.

- How I did it
Porting the commit FRRouting/frr@8e89adc to FRR 8.2.2 which fixes the issue

- How to verify it
Load FRR and verify if the "no neighbor capability extended-nexthop" not gets added for interfaces associated with peer-groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgp triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants