FALLBACK_REJECTION not really working due to Schedulers.immediate #685

cohadar · 2015-02-18T19:15:22Z

Because HystrixCommand.queue is using Schedulers.immediate when you schedule multiple threads with slow fallback the thread that queues commands gets blocked instead of rejecting them.

HystrixCommandTest.testFallbackSemaphore does not cover real use case, write a better test.

mattrjacobs · 2015-02-18T19:23:12Z

Thanks for the report, @cohadar, I'll take a look

mattrjacobs · 2015-02-19T08:05:48Z

@cohadar Can you review the test I added in #687?

In this test, I set up 3 commands which share a circuit breaker and fallback semaphore of count 2.
They are all configured to fail fast, then take 1000 millis in the fallback implementation before succeeding.

I queue up the first 2, wait 100 millis, then queue up the third. The third should fast fail with a fallback rejection, and the Futures for the first 2 commands should still be incomplete. Once I block on the first 2, they should complete successfully.

All of those assertions check out. Is there anything I did not capture about the scenario you're describing above?

cohadar · 2015-02-19T12:49:09Z

Problem with tests with low command count is that they don't discover concurrency bugs.
My test queues 1000 commands and randomly blocks about 50% of time.
Here is some code:

package com.netflix.hystrix;

import java.util.concurrent.ExecutionException;

import org.junit.Test;

/**

Recreates ALL success and failure conditions that can happen to a HystrixCommand.
*/
public class EventTest2 {

/***/
static class EventCommand extends HystrixCommand {
public EventCommand() {
super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("eventGroup")));
}

@OverRide
protected String run() throws Exception {
throw new RuntimeException("run_exception");
}

@OverRide
public String getFallback() {
try {
Thread.sleep(30000L);
} catch (InterruptedException e) {
System.out.println("AAAAAAAAAAAAA");
}
return "fallback";
}
}

// This does not work as specified, it blocks instead of rejecting command.
@test
public void testFallbackRejection() throws InterruptedException, ExecutionException {
for (int i = 0; i < 1000; i++) {
try {
if (i == 500) {
Thread.sleep(1000L);
}
new EventCommand().queue();
System.out.println("ququeed: " + i);
} catch (Exception e) {
System.out.println("rejected: " + i + " " + e.getClass());
}
}
}
}

cohadar · 2015-02-19T12:51:09Z

Btw I am inclined to blame the custom implementation of Semaphore in Hystrix.
Why not just use java.util.concurrent.Semaphore ? it also has tryAcquire.

cohadar · 2015-02-23T10:15:37Z

Ok I think I figured out where the problem is.
Your test is "cheating" by executing every command in a separate thread.

If you remove the runnable wrapper around commands the test runs forever.

When commands are executed serially by one background thread that thread gets hogged because hystrix throws rejection exception only after previous command finishes fallback execution.

Ideally a single thread that pumps hystrix commands should not get hogged but get fast rejections.

mattrjacobs · 2015-02-23T21:19:50Z

There are 2 types of rejections, which I think you may be conflating.

What documentation and code refers to as a threadpool rejection or semaphore rejection is bulkheading around concurrent executions of the run() or construct() method. If the command execution cannot claim a thread or semaphore, then the commands fails-fast and goes into the fallback path.

The rejection tested in your code and the test I added is a fallback rejection. This only occurs after a failure condition on the run()/construct() method and is a concurrency restriction on the fallback path.

This is not a fail-fast path, because it only happens after the run()/construct() method, so is always preceded by that execution/failure. In practice, we do not see fallback rejections at any material volume in production at Netflix, because the mechanism only exists as a safeguard against commands which do not obey best practices, which are:

Fallbacks should be fast
-OR-
If a fallback is slow/does IO, it should get wrapped in its own command. That command will have it's own fail-fast mechanism.

mattrjacobs · 2015-02-23T22:57:21Z

And just to clarify, your larger point is correct. Calls to queue() should not hang, ever. I'm looking into it.

mattrjacobs · 2015-02-24T00:52:13Z

OK, I think I understand this better now. The command set up in this unit test uses the defaults of a thread pool size of 10 and a fallback semaphore count of 10.

The main thread is queuing commands as fast as it can. The first 10 commands to hit the run() method claim one of the eventGroup threads. Each of these fail with an immediate runtime exception and transition into the fallback code. As the getFallback() method gets entered, each command claims a semaphore, if it is available.

This sets up a race. The hang you are seeing comes when the main thread queues a command and sees a saturated eventGroup threadpool. This command is rejected immediately, and the fallback runs on the calling thread (in this case main). If not all of the previous 10 commands in the threadpool have claimed a fallback semaphore, the main thread will, and this causes the sleep to occur on the main thread.

In either a threadpool-rejection or short-circuited case, the threadpool never gets invoked, and the fallback runs on the calling thread. Given that this may happen, this is why it is wise to minimize the amount of work done in a fallback.

In the specific case where the number of threads in the pool and the fallback semaphore are the same number, this race will occur. In the case where fallback semaphore size > number of threads, you will consistently see the main thread hang, for the reasons outlined above.

Here's some log output where the first 10 commands enter the threadpool and each of the fallbacks claim one of the fallback semaphores.

queued: 0
queued: 1
queued: 2
queued: 3
queued: 4
hystrix-eventGroup-2 : In run()
hystrix-eventGroup-4 : In run()
hystrix-eventGroup-5 : In run()
hystrix-eventGroup-3 : In run()
hystrix-eventGroup-1 : In run()
queued: 5
hystrix-eventGroup-6 : In run()
queued: 6
hystrix-eventGroup-7 : In run()
queued: 7
hystrix-eventGroup-8 : In run()
queued: 8
hystrix-eventGroup-9 : In run()
queued: 9
hystrix-eventGroup-10 : In run()
hystrix-eventGroup-5 : In fallback => [FAILURE]
hystrix-eventGroup-2 : In fallback => [FAILURE]
hystrix-eventGroup-8 : In fallback => [FAILURE]
hystrix-eventGroup-9 : In fallback => [FAILURE]
hystrix-eventGroup-6 : In fallback => [FAILURE]
hystrix-eventGroup-7 : In fallback => [FAILURE]
hystrix-eventGroup-1 : In fallback => [FAILURE]
hystrix-eventGroup-4 : In fallback => [FAILURE]
hystrix-eventGroup-3 : In fallback => [FAILURE]
hystrix-eventGroup-10 : In fallback => [FAILURE]
Fail Fast on queue() : [THREAD_POOL_REJECTED, FALLBACK_REJECTION]
Fail Fast on queue() : [THREAD_POOL_REJECTED, FALLBACK_REJECTION]
...

Here's some log output where the first 10 commands enter the threadpool, but only 9 have claimed a fallback semaphore so far. The eleventh command enters, gets thread-pool rejected, so the fallback runs on the calling (main) thread. This fallback gets scheduled earlier than the other command which entered the threadpool, so the 30 second latency in getFallback() is seen on the main thread.

queued: 0
queued: 1
queued: 2
queued: 3
queued: 4
queued: 5
queued: 6
queued: 7
hystrix-eventGroup-7 : In run()
hystrix-eventGroup-4 : In run()
hystrix-eventGroup-3 : In run()
hystrix-eventGroup-6 : In run()
hystrix-eventGroup-1 : In run()
hystrix-eventGroup-9 : In run()
queued: 8
hystrix-eventGroup-8 : In run()
hystrix-eventGroup-2 : In run()
hystrix-eventGroup-5 : In run()
hystrix-eventGroup-10 : In run()
queued: 9
hystrix-eventGroup-7 : In fallback => [FAILURE]
hystrix-eventGroup-3 : In fallback => [FAILURE]
hystrix-eventGroup-1 : In fallback => [FAILURE]
hystrix-eventGroup-9 : In fallback => [FAILURE]
hystrix-eventGroup-4 : In fallback => [FAILURE]
hystrix-eventGroup-5 : In fallback => [FAILURE]
hystrix-eventGroup-2 : In fallback => [FAILURE]
main : In fallback => [THREAD_POOL_REJECTED]
hystrix-eventGroup-10 : In fallback => [FAILURE]
hystrix-eventGroup-8 : In fallback => [FAILURE]
<HANG>
...

…x#685 Also added a missing execution event for FALLBACK_REJECTION

mattrjacobs · 2015-02-24T00:58:12Z

Your test with some extra logging is in #697. Note that I explicitly avoid the race by setting the fallback semaphore lower than the threadpool size.

benjchristensen · 2015-02-24T01:00:54Z

So the race happens because the 11th hits THREAD_POOL_REJECTION while one of the 10 in the thread pool has not yet entered getFallback().

That makes sense, since queue() is immediately returning while work is spun up on the threads asynchronously.

Thanks for walking through this ... it behaves as expected.

cohadar · 2015-02-24T09:17:01Z

Thanks for your persistence Matt.
I consider this resolved.

mattrjacobs · 2015-02-24T13:32:45Z

Great, glad we both understand it better now!

mattrjacobs added this to the 1.4.0-RC8 milestone Feb 18, 2015

mattrjacobs mentioned this issue Feb 19, 2015

Added unit test for fallback rejection #687

Merged

mattrjacobs mentioned this issue Feb 20, 2015

Added a test with higher semaphore counts to detect raciness #691

Merged

mattrjacobs pushed a commit to mattrjacobs/Hystrix that referenced this issue Feb 24, 2015

Added unit test to help with understanding the test written in Netfli…

d00e9e2

…x#685 Also added a missing execution event for FALLBACK_REJECTION

mattrjacobs mentioned this issue Feb 24, 2015

Missing execution event for FALLBACK_REJECTION #697

Merged

mattrjacobs closed this as completed Feb 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

cohadar commented Feb 18, 2015

mattrjacobs commented Feb 18, 2015

mattrjacobs commented Feb 19, 2015

cohadar commented Feb 19, 2015

cohadar commented Feb 19, 2015

cohadar commented Feb 23, 2015

mattrjacobs commented Feb 23, 2015

mattrjacobs commented Feb 23, 2015

mattrjacobs commented Feb 24, 2015

mattrjacobs commented Feb 24, 2015

benjchristensen commented Feb 24, 2015

cohadar commented Feb 24, 2015

mattrjacobs commented Feb 24, 2015

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

Comments

cohadar commented Feb 18, 2015

mattrjacobs commented Feb 18, 2015

mattrjacobs commented Feb 19, 2015

cohadar commented Feb 19, 2015

cohadar commented Feb 19, 2015

cohadar commented Feb 23, 2015

mattrjacobs commented Feb 23, 2015

mattrjacobs commented Feb 23, 2015

mattrjacobs commented Feb 24, 2015

mattrjacobs commented Feb 24, 2015

benjchristensen commented Feb 24, 2015

cohadar commented Feb 24, 2015

mattrjacobs commented Feb 24, 2015