-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FALLBACK_REJECTION not really working due to Schedulers.immediate #685
Comments
Thanks for the report, @cohadar, I'll take a look |
@cohadar Can you review the test I added in #687? In this test, I set up 3 commands which share a circuit breaker and fallback semaphore of count 2. I queue up the first 2, wait 100 millis, then queue up the third. The third should fast fail with a fallback rejection, and the Futures for the first 2 commands should still be incomplete. Once I block on the first 2, they should complete successfully. All of those assertions check out. Is there anything I did not capture about the scenario you're describing above? |
Problem with tests with low command count is that they don't discover concurrency bugs. package com.netflix.hystrix; import java.util.concurrent.ExecutionException; import org.junit.Test; /**
|
Btw I am inclined to blame the custom implementation of Semaphore in Hystrix. |
Ok I think I figured out where the problem is. If you remove the runnable wrapper around commands the test runs forever. When commands are executed serially by one background thread that thread gets hogged because hystrix throws rejection exception only after previous command finishes fallback execution. Ideally a single thread that pumps hystrix commands should not get hogged but get fast rejections. |
There are 2 types of rejections, which I think you may be conflating. What documentation and code refers to as a threadpool rejection or semaphore rejection is bulkheading around concurrent executions of the The rejection tested in your code and the test I added is a fallback rejection. This only occurs after a failure condition on the This is not a fail-fast path, because it only happens after the
|
And just to clarify, your larger point is correct. Calls to |
OK, I think I understand this better now. The command set up in this unit test uses the defaults of a thread pool size of 10 and a fallback semaphore count of 10. The This sets up a race. The hang you are seeing comes when the In either a threadpool-rejection or short-circuited case, the threadpool never gets invoked, and the fallback runs on the calling thread. Given that this may happen, this is why it is wise to minimize the amount of work done in a fallback. In the specific case where the number of threads in the pool and the fallback semaphore are the same number, this race will occur. In the case where fallback semaphore size > number of threads, you will consistently see the main thread hang, for the reasons outlined above. Here's some log output where the first 10 commands enter the threadpool and each of the fallbacks claim one of the fallback semaphores.
Here's some log output where the first 10 commands enter the threadpool, but only 9 have claimed a fallback semaphore so far. The eleventh command enters, gets thread-pool rejected, so the fallback runs on the calling (main) thread. This fallback gets scheduled earlier than the other command which entered the threadpool, so the 30 second latency in
|
…x#685 Also added a missing execution event for FALLBACK_REJECTION
Your test with some extra logging is in #697. Note that I explicitly avoid the race by setting the fallback semaphore lower than the threadpool size. |
So the race happens because the 11th hits THREAD_POOL_REJECTION while one of the 10 in the thread pool has not yet entered getFallback(). That makes sense, since Thanks for walking through this ... it behaves as expected. |
Thanks for your persistence Matt. |
Great, glad we both understand it better now! |
Because HystrixCommand.queue is using Schedulers.immediate when you schedule multiple threads with slow fallback the thread that queues commands gets blocked instead of rejecting them.
HystrixCommandTest.testFallbackSemaphore does not cover real use case, write a better test.
The text was updated successfully, but these errors were encountered: