Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

Closed
cohadar opened this issue Feb 18, 2015 · 12 comments
Closed

FALLBACK_REJECTION not really working due to Schedulers.immediate #685

cohadar opened this issue Feb 18, 2015 · 12 comments
Milestone

Comments

@cohadar
Copy link

cohadar commented Feb 18, 2015

Because HystrixCommand.queue is using Schedulers.immediate when you schedule multiple threads with slow fallback the thread that queues commands gets blocked instead of rejecting them.

HystrixCommandTest.testFallbackSemaphore does not cover real use case, write a better test.

@mattrjacobs
Copy link
Contributor

Thanks for the report, @cohadar, I'll take a look

@mattrjacobs
Copy link
Contributor

@cohadar Can you review the test I added in #687?

In this test, I set up 3 commands which share a circuit breaker and fallback semaphore of count 2.
They are all configured to fail fast, then take 1000 millis in the fallback implementation before succeeding.

I queue up the first 2, wait 100 millis, then queue up the third. The third should fast fail with a fallback rejection, and the Futures for the first 2 commands should still be incomplete. Once I block on the first 2, they should complete successfully.

All of those assertions check out. Is there anything I did not capture about the scenario you're describing above?

@cohadar
Copy link
Author

cohadar commented Feb 19, 2015

Problem with tests with low command count is that they don't discover concurrency bugs.
My test queues 1000 commands and randomly blocks about 50% of time.
Here is some code:

package com.netflix.hystrix;

import java.util.concurrent.ExecutionException;

import org.junit.Test;

/**

  • Recreates ALL success and failure conditions that can happen to a HystrixCommand.

  • */
    public class EventTest2 {

    /***/
    static class EventCommand extends HystrixCommand {
    public EventCommand() {
    super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("eventGroup")));
    }

    @OverRide
    protected String run() throws Exception {
    throw new RuntimeException("run_exception");
    }

    @OverRide
    public String getFallback() {
    try {
    Thread.sleep(30000L);
    } catch (InterruptedException e) {
    System.out.println("AAAAAAAAAAAAA");
    }
    return "fallback";
    }
    }

    // This does not work as specified, it blocks instead of rejecting command.
    @test
    public void testFallbackRejection() throws InterruptedException, ExecutionException {
    for (int i = 0; i < 1000; i++) {
    try {
    if (i == 500) {
    Thread.sleep(1000L);
    }
    new EventCommand().queue();
    System.out.println("ququeed: " + i);
    } catch (Exception e) {
    System.out.println("rejected: " + i + " " + e.getClass());
    }
    }
    }
    }

@cohadar
Copy link
Author

cohadar commented Feb 19, 2015

Btw I am inclined to blame the custom implementation of Semaphore in Hystrix.
Why not just use java.util.concurrent.Semaphore ? it also has tryAcquire.

@cohadar
Copy link
Author

cohadar commented Feb 23, 2015

Ok I think I figured out where the problem is.
Your test is "cheating" by executing every command in a separate thread.

If you remove the runnable wrapper around commands the test runs forever.

When commands are executed serially by one background thread that thread gets hogged because hystrix throws rejection exception only after previous command finishes fallback execution.

Ideally a single thread that pumps hystrix commands should not get hogged but get fast rejections.

@mattrjacobs
Copy link
Contributor

There are 2 types of rejections, which I think you may be conflating.

What documentation and code refers to as a threadpool rejection or semaphore rejection is bulkheading around concurrent executions of the run() or construct() method. If the command execution cannot claim a thread or semaphore, then the commands fails-fast and goes into the fallback path.

The rejection tested in your code and the test I added is a fallback rejection. This only occurs after a failure condition on the run()/construct() method and is a concurrency restriction on the fallback path.

This is not a fail-fast path, because it only happens after the run()/construct() method, so is always preceded by that execution/failure. In practice, we do not see fallback rejections at any material volume in production at Netflix, because the mechanism only exists as a safeguard against commands which do not obey best practices, which are:

  1. Fallbacks should be fast
    -OR-
  2. If a fallback is slow/does IO, it should get wrapped in its own command. That command will have it's own fail-fast mechanism.

@mattrjacobs
Copy link
Contributor

And just to clarify, your larger point is correct. Calls to queue() should not hang, ever. I'm looking into it.

@mattrjacobs
Copy link
Contributor

OK, I think I understand this better now. The command set up in this unit test uses the defaults of a thread pool size of 10 and a fallback semaphore count of 10.

The main thread is queuing commands as fast as it can. The first 10 commands to hit the run() method claim one of the eventGroup threads. Each of these fail with an immediate runtime exception and transition into the fallback code. As the getFallback() method gets entered, each command claims a semaphore, if it is available.

This sets up a race. The hang you are seeing comes when the main thread queues a command and sees a saturated eventGroup threadpool. This command is rejected immediately, and the fallback runs on the calling thread (in this case main). If not all of the previous 10 commands in the threadpool have claimed a fallback semaphore, the main thread will, and this causes the sleep to occur on the main thread.

In either a threadpool-rejection or short-circuited case, the threadpool never gets invoked, and the fallback runs on the calling thread. Given that this may happen, this is why it is wise to minimize the amount of work done in a fallback.

In the specific case where the number of threads in the pool and the fallback semaphore are the same number, this race will occur. In the case where fallback semaphore size > number of threads, you will consistently see the main thread hang, for the reasons outlined above.

Here's some log output where the first 10 commands enter the threadpool and each of the fallbacks claim one of the fallback semaphores.

queued: 0
queued: 1
queued: 2
queued: 3
queued: 4
hystrix-eventGroup-2 : In run()
hystrix-eventGroup-4 : In run()
hystrix-eventGroup-5 : In run()
hystrix-eventGroup-3 : In run()
hystrix-eventGroup-1 : In run()
queued: 5
hystrix-eventGroup-6 : In run()
queued: 6
hystrix-eventGroup-7 : In run()
queued: 7
hystrix-eventGroup-8 : In run()
queued: 8
hystrix-eventGroup-9 : In run()
queued: 9
hystrix-eventGroup-10 : In run()
hystrix-eventGroup-5 : In fallback => [FAILURE]
hystrix-eventGroup-2 : In fallback => [FAILURE]
hystrix-eventGroup-8 : In fallback => [FAILURE]
hystrix-eventGroup-9 : In fallback => [FAILURE]
hystrix-eventGroup-6 : In fallback => [FAILURE]
hystrix-eventGroup-7 : In fallback => [FAILURE]
hystrix-eventGroup-1 : In fallback => [FAILURE]
hystrix-eventGroup-4 : In fallback => [FAILURE]
hystrix-eventGroup-3 : In fallback => [FAILURE]
hystrix-eventGroup-10 : In fallback => [FAILURE]
Fail Fast on queue() : [THREAD_POOL_REJECTED, FALLBACK_REJECTION]
Fail Fast on queue() : [THREAD_POOL_REJECTED, FALLBACK_REJECTION]
...

Here's some log output where the first 10 commands enter the threadpool, but only 9 have claimed a fallback semaphore so far. The eleventh command enters, gets thread-pool rejected, so the fallback runs on the calling (main) thread. This fallback gets scheduled earlier than the other command which entered the threadpool, so the 30 second latency in getFallback() is seen on the main thread.

queued: 0
queued: 1
queued: 2
queued: 3
queued: 4
queued: 5
queued: 6
queued: 7
hystrix-eventGroup-7 : In run()
hystrix-eventGroup-4 : In run()
hystrix-eventGroup-3 : In run()
hystrix-eventGroup-6 : In run()
hystrix-eventGroup-1 : In run()
hystrix-eventGroup-9 : In run()
queued: 8
hystrix-eventGroup-8 : In run()
hystrix-eventGroup-2 : In run()
hystrix-eventGroup-5 : In run()
hystrix-eventGroup-10 : In run()
queued: 9
hystrix-eventGroup-7 : In fallback => [FAILURE]
hystrix-eventGroup-3 : In fallback => [FAILURE]
hystrix-eventGroup-1 : In fallback => [FAILURE]
hystrix-eventGroup-9 : In fallback => [FAILURE]
hystrix-eventGroup-4 : In fallback => [FAILURE]
hystrix-eventGroup-5 : In fallback => [FAILURE]
hystrix-eventGroup-2 : In fallback => [FAILURE]
main : In fallback => [THREAD_POOL_REJECTED]
hystrix-eventGroup-10 : In fallback => [FAILURE]
hystrix-eventGroup-8 : In fallback => [FAILURE]
<HANG>
...

mattrjacobs pushed a commit to mattrjacobs/Hystrix that referenced this issue Feb 24, 2015
…x#685

      Also added a missing execution event for FALLBACK_REJECTION
@mattrjacobs
Copy link
Contributor

Your test with some extra logging is in #697. Note that I explicitly avoid the race by setting the fallback semaphore lower than the threadpool size.

@benjchristensen
Copy link
Contributor

So the race happens because the 11th hits THREAD_POOL_REJECTION while one of the 10 in the thread pool has not yet entered getFallback().

That makes sense, since queue() is immediately returning while work is spun up on the threads asynchronously.

Thanks for walking through this ... it behaves as expected.

@cohadar
Copy link
Author

cohadar commented Feb 24, 2015

Thanks for your persistence Matt.
I consider this resolved.

@mattrjacobs
Copy link
Contributor

Great, glad we both understand it better now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants