Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid that the reporter blocks server shutdown #554

Merged

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Apr 2, 2019

This only happens after the APM Server is not available for a while. In that case, the reporter queue fills up and can never really drain because the reporter blocks its thread to throttle APM Server connection retries. On shutdown, the reporter tries to flush by registering a flush event. The registration blocks until a new slot in the ring buffer becomes available. But due to the throttling only once every 36 seconds, an event is picked up from the ring buffer. So a shutdown takes around 512 * 36 seconds.

@felixbarny felixbarny requested a review from eyalkoren April 2, 2019 08:58
@felixbarny felixbarny self-assigned this Apr 2, 2019
@felixbarny
Copy link
Member Author

felixbarny commented Apr 2, 2019

The bug likely was introduced here: #410, in combination with #397

@codecov-io
Copy link

codecov-io commented Apr 2, 2019

Codecov Report

Merging #554 into master will decrease coverage by 2.14%.
The diff coverage is 40%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #554      +/-   ##
============================================
- Coverage     65.25%   63.11%   -2.15%     
  Complexity       68       68              
============================================
  Files           180      180              
  Lines          7209     6815     -394     
  Branches        863      780      -83     
============================================
- Hits           4704     4301     -403     
+ Misses         2251     2249       -2     
- Partials        254      265      +11
Impacted Files Coverage Δ Complexity Δ
...pm/agent/report/IntakeV2ReportingEventHandler.java 76.23% <ø> (-4.79%) 0 <0> (ø)
...co/elastic/apm/agent/report/ApmServerReporter.java 57.84% <40%> (+3.84%) 0 <0> (ø) ⬇️
.../apm/agent/report/serialize/DslJsonSerializer.java 81.84% <0%> (-6.41%) 0% <0%> (ø)
...tic/apm/agent/configuration/CoreConfiguration.java 96.12% <0%> (-2.14%) 0% <0%> (ø)
...a/co/elastic/apm/agent/report/ReporterFactory.java 78.26% <0%> (-0.91%) 0% <0%> (ø)
.../main/java/co/elastic/apm/agent/impl/MetaData.java 100% <0%> (ø) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6482904...7e1e4d7. Read the comment docs.

@eyalkoren
Copy link
Contributor

eyalkoren commented Apr 3, 2019

On shutdown, the reporter tries to publish a shutdown event and then calls the Disruptor.shutdown which does a busy spin to see if there are still events to publish and times out after 5 seconds:

Then it wakes up the Handler thread, which may try to flush current buffered serialized data, but not flush all events in buffer:

This also changes the state of the Handler so that it shouldn't continue trying to send events, but in any case, this thread is a daemon thread, so it shouldn't prevent shutdown.

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did. So I see how the shutdown is delayed by the publish call (which doesn't guarantee success after 36 seconds if other publishers are trying to send events as well) plus the 5 seconds wait, but not longer.

What am I missing here?

@felixbarny
Copy link
Member Author

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did.

That's correct. I think with this change we're all good.

@felixbarny felixbarny merged commit c9149db into elastic:master Apr 3, 2019
@felixbarny felixbarny deleted the avoid-reporter-blocking-shutdown branch April 3, 2019 08:45
@SylvainJuge SylvainJuge added bug Bugs and removed type: bug labels Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants