Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WRR] fix bugs that caused us to re-enter blackout period upon updates #33694

Merged

Conversation

markdroth
Copy link
Member

@markdroth markdroth commented Jul 13, 2023

As per gRFC A58, when WRR sees a subchannel report READY, it reset the non_empty_since value, thus restarting the blackout period. However, there were two cases where we were incorrectly triggering this code:

  1. When WRR got an updated address list that contained addresses that were already present on the old list and whose subchannels were already in READY state, the initial notification for those subchannels on the new list was READY, which incorrectly triggered resetting the non_empty_since value.
  2. Due to a bug in the outlier_detection policy, whenever an update was propagated down through the OD policy without actually enabling OD, it would incorrectly send a duplicate connectivity state notification for the subchannels. This meant that a subchannel that was already in state READY would report READY again, which would also incorrectly trigger resetting the non_empty_since value.

This PR makes two changes:

  1. Fix the bug in outlier_detection that caused it to generate the spurious duplicate READY updates.
  2. Fix WRR to reset the non_empty_since value when a subchannel goes READY only if the subchannel has seen a previous state update and only if that previous state was not READY. (The duplicate READY notifications should not actually happen anymore now that the OD policy has been fixed, but better to be defensive.)

Fixes b/290983884.

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Jul 13, 2023
@markdroth markdroth requested a review from yousukseung July 13, 2023 20:29
@markdroth markdroth merged commit ec39600 into grpc:master Jul 13, 2023
markdroth added a commit to markdroth/grpc that referenced this pull request Jul 13, 2023
grpc#33694)

As per gRFC A58, when WRR sees a subchannel report READY, it reset the
non_empty_since value, thus restarting the blackout period. However,
there were two cases where we were incorrectly triggering this code:
1. When WRR got an updated address list that contained addresses that
were already present on the old list and whose subchannels were already
in READY state, the initial notification for those subchannels on the
new list was READY, which incorrectly triggered resetting the
non_empty_since value.
2. Due to a bug in the outlier_detection policy, whenever an update was
propagated down through the OD policy without actually enabling OD, it
would incorrectly send a duplicate connectivity state notification for
the subchannels. This meant that a subchannel that was already in state
READY would report READY again, which would also incorrectly trigger
resetting the non_empty_since value.

This PR makes two changes:
1. Fix the bug in outlier_detection that caused it to generate the
spurious duplicate READY updates.
2. Fix WRR to reset the non_empty_since value when a subchannel goes
READY only if the subchannel has seen a previous state update and only
if that previous state was not READY. (The duplicate READY notifications
should not actually happen anymore now that the OD policy has been
fixed, but better to be defensive.)

Fixes b/290983884.
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Jul 14, 2023
apolcyn pushed a commit that referenced this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bloat/none imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants