Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ReadTimeoutException between requests #640

Merged

Conversation

AndreKurait
Copy link
Member

@AndreKurait AndreKurait commented May 8, 2024

Description

Currently with the traffic replayer, in the NettyPacketToHttpConsumer, the ReadTimeoutHandler is coupled with the ConnectionClosedListenerHandler.

This means that we count the time in between request towards the ReadTimeout. This causes the channel to throw an exception while it is waiting to be re-used and may cause race conditions if the channel is reused exactly at the timeout interval.

This change places the ReadTimeoutHandler behind the handlers which remain between requests.

This change also fixes an issue where the BacksideHttpWatcherHandler was propagating exceptions through even though it shouldn't since it handles them through the callback finalization.

  • Category: Bug Fix
  • Why these changes are required? Properly handle read timeouts
  • What is the old behavior before changes and new behavior after changes? ReadTimeoutException would occur between requests on the same connection/channel, now the exception only occurs if there is a read timeout during the request and is not sent to the bottom of the pipeline.

Issues Resolved

MIGRATIONS-1723

Is this a backport? If so, please add backport PR # and/or commits #

Testing

Ran existing and added new unit tests. Also ran E2E tests from #637 and verified no ReadTimeoutException in logs (they were occurring before this change)

Check List

  • [ x] New functionality includes testing
    • [ x] All tests pass, including unit test, integration test and doctest
  • [ x] New functionality has been documented
  • [x ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@AndreKurait AndreKurait marked this pull request as ready for review May 8, 2024 19:55
Copy link
Collaborator

@gregschohn gregschohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have some minor improvements that would be nice additions for minimal cost. Thanks for spotting this and fixing it. I'm not sure why I doubled up on the connection close handler.

There is a side note that we might want to put a timeout on how stale a connection could be before we recycle it. We're using a guava CacheBuilder so adding a timeout isn't too challenging. We SHOULD already get some of that because of the timeouts that we do on source connections... however, if we're playing back at a slower or variable speed, or the target has some different parameters, that might be an issue.

I'd rather invest time into managing retries well and find out how prevalent that issue is before worrying about it.


@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) {
socketContext.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a line beforehand for socketContext.addTraceException(cause, true)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the socketContext NOT being closed when there was an exception? Does an exception skip unregistration?
I think it's safe to call close() twice, but can you check? If it isn't, we should have a check in this class (like nulling out the socketContext)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Socket context would have been closed twice with this in cases an exception reaches ConnectionClosedListenerHandler. It doesn't occur in the timeout case since that handler is after this one.

It doesn't look like .close should be called multiple times, it seems that would double count metrics. I'll do a couple changes to fix this.

  1. remove the socketContext.close from exception caught
  2. change channelUnregistered -> channelInactive. Technically, unregistered can be called multiple times, but inactive is only called once on channel close

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our use, channel is only unregistered when it's closed, but there can be implementations where channel is unregistered from the event loop and re-registered again so inactive follows the contracts better

@AndreKurait AndreKurait force-pushed the ReadTimeoutHandlerFix branch 2 times, most recently from 8e37c54 to 201a7af Compare May 9, 2024 05:32
Copy link

codecov bot commented May 9, 2024

Codecov Report

Attention: Patch coverage is 55.00000% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 75.75%. Comparing base (7b270fa) to head (201a7af).
Report is 13 commits behind head on main.

Files Patch % Lines
...replay/datahandlers/NettyPacketToHttpConsumer.java 58.33% 5 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java 0.00% 3 Missing ⚠️
...rch/migrations/replay/TrafficReplayerTopLevel.java 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #640      +/-   ##
============================================
- Coverage     75.79%   75.75%   -0.05%     
  Complexity     1543     1543              
============================================
  Files           168      168              
  Lines          6437     6446       +9     
  Branches        575      574       -1     
============================================
+ Hits           4879     4883       +4     
- Misses         1179     1183       +4     
- Partials        379      380       +1     
Flag Coverage Δ
unittests 75.75% <55.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member Author

@AndreKurait AndreKurait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all PR feedback. Thanks @gregschohn

Signed-off-by: Andre Kurait <akurait@amazon.com>
Signed-off-by: Andre Kurait <akurait@amazon.com>
Signed-off-by: Andre Kurait <akurait@amazon.com>
Signed-off-by: Andre Kurait <akurait@amazon.com>
@AndreKurait AndreKurait force-pushed the ReadTimeoutHandlerFix branch from 201a7af to 2ae2959 Compare May 9, 2024 13:49
@AndreKurait AndreKurait merged commit e80b6c2 into opensearch-project:main May 9, 2024
5 checks passed
@AndreKurait AndreKurait deleted the ReadTimeoutHandlerFix branch May 9, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants