Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Routeorch] Fix next hop group reference count in bulk operation #1501

Merged
merged 4 commits into from
Nov 16, 2020

Conversation

shi-su
Copy link
Contributor

@shi-su shi-su commented Nov 11, 2020

What I did
Remove next-hop groups after updating the reference counter for a bulk of routes instead of removing next-hop groups in the loop of updating the reference counter.
Fix sonic-net/sonic-buildimage#5813

Why I did it
The bulk route API has two loops of updating the reference counter:

  1. update the sairedis reference counter
  2. update the orchagent reference counter

Currently, the removal of next-hop groups is triggered in the second loop of updating the orchagent reference counter when the reference counter decreases to zero. This may result in a reference counter mismatch between orchagent and sairedis since the sairedis reference counter has already included the operation of the whole bulk but the orchagent reference counter has not. Therefore, the removal of next-hop group may fail due to the mismatch in reference counter (e.g., there are some other routes point to the next-hop group but has not been counted in orchagent yet).

To fix this problem, the next-hop group removal operation should be done after updating the reference counter of the whole bulk to make sure the reference counters sairedis and orchagent matches.

How I verified it

Details if related

@shi-su shi-su requested a review from qiluo-msft November 11, 2020 19:41
@@ -1392,11 +1411,6 @@ bool RouteOrch::addRoutePost(const RouteBulkContext& ctx, const NextHopGroupKey
increaseNextHopRefCount(nextHops);

decreaseNextHopRefCount(it_route->second);
if (it_route->second.getSize() > 1
&& m_syncdNextHopGroups[it_route->second].ref_count == 0)
Copy link
Contributor

@qiluo-msft qiluo-msft Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to insert bulkNhgReducedRefCnt here? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep old if-statement, I think it will keep the bulkNhgReducedRefCnt in small size.


In reply to: 521673770 [](ancestors = 521673770)

*/
decreaseNextHopRefCount(it_route->second);
if (it_route->second.getSize() > 1
Copy link
Contributor

@qiluo-msft qiluo-msft Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same #Closed

{
if (m_syncdNextHopGroups[*it_nhg].ref_count == 0)
{
removeNextHopGroup(*it_nhg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeNextHopGroup [](start = 16, length = 18)

Could you write a vs testcase to fail old code and pass new code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add a vs test case.

Copy link
Contributor Author

@shi-su shi-su Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A vs test case for this particular issue could be challenging form two aspects:

  1. It requires a specific bulk pattern to happen, but we do not have control over how the operations are partition into bulks. Constantly reproducing the issue in vs test could be a problem.
  2. The issue is transient, it will get resolved itself and the only trace would be the log messages of failed attempts of nexthop group removals. Basically, there is a limited sign of the issue ever happened.

I think it needs some thinking in developing this test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you. I think maybe we need to look into the googletest framework for such sequencing testing. the tests under tests/mock_tests

@@ -22,6 +22,8 @@ extern CrmOrch *gCrmOrch;

const int routeorch_pri = 5;

std::set<NextHopGroupKey> bulkNhgReducedRefCnt;
Copy link
Contributor

@qiluo-msft qiluo-msft Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bulkNhgReducedRefCnt [](start = 26, length = 20)

Change to a RouteOrch member variable? #Closed

@lguohan
Copy link
Contributor

lguohan commented Nov 13, 2020

is this affecting 201911 as well?

@shi-su
Copy link
Contributor Author

shi-su commented Nov 13, 2020

is this affecting 201911 as well?

This should not affect 201911 since the bulk mode is not in 201911

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qiluo-msft qiluo-msft changed the title [Routeorch] Fix next hop group reference count [Routeorch] Fix next hop group reference count in bulk operation Nov 13, 2020
@shi-su shi-su merged commit 7a92100 into sonic-net:master Nov 16, 2020
daall pushed a commit to daall/sonic-swss that referenced this pull request Dec 7, 2020
…ic-net#1501)

What I did
Remove next-hop groups after updating the reference counter for a bulk of routes instead of removing next-hop groups in the loop of updating the reference counter.
Fix sonic-net/sonic-buildimage#5813

Why I did it
The bulk route API has two loops of updating the reference counter:

1. update the sairedis reference counter
2. update the orchagent reference counter

Before this commit, the removal of next-hop groups is triggered in the second loop of updating the orchagent reference counter when the reference counter decreases to zero. This may result in a reference counter mismatch between orchagent and sairedis since the sairedis reference counter has already included the operation of the whole bulk but the orchagent reference counter has not. Therefore, the removal of next-hop group may fail due to the mismatch in reference counter (e.g., there are some other routes point to the next-hop group but has not been counted in orchagent yet).

To fix this problem, the next-hop group removal operation should be done after updating the reference counter of the whole bulk to make sure the reference counters sairedis and orchagent matches.
shi-su pushed a commit that referenced this pull request May 4, 2021
…route at same time. (#1698)

reference #1501: remove overlay nexthop after updating the reference counter for a bulk of routes
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
…route at same time. (sonic-net#1698)

reference sonic-net#1501: remove overlay nexthop after updating the reference counter for a bulk of routes
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
… on Y cable (sonic-net#1501)

Summary
There is also support for new commands for muxcable/Y cable
show muxcable hwmode switchmode
sudo config muxcable hwmode setswitchmode manual/auto all
which can basically set/display the switching modes on the cable to auto or manual
using the cli

What I did
Also added these new commands

admin@STR43-0101-0101-01LT0:~$ show muxcable hwmode switchmode 
Port       Switching
---------  -----------
Ethernet0  manual
Ethernet4  manual
admin@STR43-0101-0101-01LT0:/usr$ sudo config muxcable hwmode setswitchmode manual all
Muxcable at port all will be changed to manual switching mode. Continue? [y/N]: y
Success in switching mode on port Ethernet0 to manual
Success in switching mode on port Ethernet4 to manual
admin@STR43-0101-0101-01LT0:/usr$ sudo config muxcable hwmode setswitchmode auto all
Muxcable at port all will be changed to auto switching mode. Continue? [y/N]: y
Success in switching mode on port Ethernet0 to auto
Success in switching mode on port Ethernet4 to auto

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can't remove next hop groups due to reference counting
3 participants