-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post Fast reboot static route configured is missing #6225
Post Fast reboot static route configured is missing #6225
Conversation
855e573
to
7ba63c0
Compare
@pavel-shirshov , @lguohan - This one has been inactive for a week - how do we move towards merge pls? |
7ba63c0
to
bb30c50
Compare
Hi all, |
retest vsimage please |
does it also happen on the normal reboot? from the description it seems it can happen on normal reboot as well. |
@shi-su , we have solve a race condition between zebra and staticd in #6478 , @sudhanshukumar22, can you look at that pr and see if we can the same approach to solve the problem. |
@lguohan , yes, it is a basic testcase and will happen with normal reboot as well. here, we have to create the static route, save the FRR config and reboot. After reboot, the static route present in FRR config will not be replayed. |
bb30c50
to
0925d7b
Compare
can you look the other people and use the same approach |
-t /usr/share/sonic/templates/unisolate.j2,/usr/sbin/bgp-unisolate \ | ||
" | ||
|
||
CFGGEN_PARAMS_UNIFIED=" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated change, should be move to a different pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will move it to another PR. This change was done to support split mode along with separate mode. The default is still separate mode.
rm -f /etc/frr/bgpd.conf /etc/frr/zebra.conf /etc/frr/staticd.conf | ||
elif [ "$CONFIG_TYPE" == "split" ]; then | ||
echo "service integrated-vtysh-config" > /etc/frr/vtysh.conf | ||
sonic-cfggen $CFGGEN_PARAMS_SPLIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated changes. can you pull to a different pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -39,6 +39,17 @@ stderr_logfile=syslog | |||
dependent_startup=true | |||
dependent_startup_wait_for=rsyslogd:running | |||
|
|||
[program:zsocket] | |||
command=/usr/bin/zsocket.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shi-su , i think the problem you discovered is a generic problem. i think it is better that you pull you script as a delicated script and wait for zebra socket available, and then both bgpd, staticd or other daemons shoud wait for that script to exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I will rewrite my script as a separate one to wait for zebra to be available and then let the daemons wait for it to exit similar to the code changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I drafted a PR to pull the script #6519. It took some ideas in this PR, thanks to @sudhanshukumar22. Maybe we can stick to that script. There are some other changes in this PR, maybe they can be rebased onto #6519.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ben-gale to comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will review that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @shi-su has done the changes for wait script, I will remove this. I will put my changes on top of them.
@sudhanshukumar22, can you take a look at #6519 ? |
dockers/docker-fpm-frr/zsocket.sh
Outdated
@@ -0,0 +1,13 @@ | |||
#!/usr/bin/env bash | |||
secs=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to wait this long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had put max wait. It will not matter since the script will exit as soon as zebra socket is available.
0925d7b
to
938c035
Compare
dockers/docker-fpm-frr/Dockerfile.j2
Outdated
@@ -18,7 +18,8 @@ RUN apt-get update && \ | |||
iproute2 \ | |||
libjson-c3 \ | |||
logrotate \ | |||
libunwind8 | |||
libunwind8 \ | |||
net-tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is net-tools
still necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one doubt. I am deleting the /etc/sonic/config_db.json and rebooting the system. After reboot, I expect that config db and vtysh config should be empty. But I see that we still have BGP entries in redis DB. e.g.",
router bgp 65100
neighbor 10.0.0.1 remote-as 65200"
Can you suggest where from these entries are coming after reboot. It is expected that when we delete config_db.json and rebooting, the redis DB should also become empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if /etc/sonic/minigraph.xml exists? I think if config_db.json does not exist, it will try to recover the configuration from minigraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked this file. It is not present in /etc/sonic, but at some other place. But even after deleting the file and deleting /etc/sonic/config_db.json and reboot, I still see that all BGP configurations are retained. I don't know from where this config is getting replayed. I am using default mode (separated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything (e.g., config_db.json, minigraph.xml) in /host/old_config?
dockers/docker-fpm-frr/Dockerfile.j2
Outdated
@@ -49,6 +50,7 @@ RUN apt-get clean -y && \ | |||
|
|||
COPY ["frr", "/usr/share/sonic/templates"] | |||
COPY ["docker_init.sh", "/usr/bin/"] | |||
COPY ["zsocket.sh", "/usr/bin/"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is already added a few lines later.
@@ -29,7 +29,7 @@ stderr_logfile=syslog | |||
dependent_startup=true | |||
|
|||
[program:zebra] | |||
command=/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp | |||
command=/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -K 300 -r -M fpm -M snmp --v6-rr-semantics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help clarify why these changes are necessary on this topic? What will happen without these changes?
src/sonic-bgpcfgd/bgpcfgd/main.py
Outdated
@@ -25,7 +25,7 @@ | |||
def do_work(): | |||
""" Main function """ | |||
frr = FRR(["bgpd", "zebra", "staticd"]) | |||
frr.wait_for_daemons(seconds=20) | |||
frr.wait_for_daemons(seconds=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to wait for this long? Is there a scenario that the 20-second is not enough?
938c035
to
07fc2d8
Compare
07fc2d8
to
5cd97d0
Compare
- Why I did it
Post Fast reboot static route configured is missing.
Log:
Prior To fast reboot:
2020-11-03 07:48:28,726 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:48:28,829 T0000: INFO [D1-H3]
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:01:34
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] S>* 20.20.20.0/24 [1/0] via 10.10.10.2, Ethernet120, 00:01:23
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:29
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] sonic#
Post Fast Reboot:
2020-11-03 07:51:40,228 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:51:40,329 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:51:40,330 T0000: INFO [D1-H3]
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:00:58
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:58
2020-11-03 07:51:40,331 T0000: INFO [D1-H3] sonic#
- How I did it
Allow zebra socket to come up first before static daemon can start. This is because if static daemon comes up before or at the same time as zebra, zebra misses the static route add message from static daemon. This causes static route to be not present in zebra. This is easily seen during reboot (after config save in FRR).
- How to verify it
- Which release branch to backport (provide reason below if selected)
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)