Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit Breaker will never be closed again if HystrixBadRequestException thrown! #674

Closed
reevik opened this issue Feb 12, 2015 · 19 comments
Closed

Comments

@reevik
Copy link

reevik commented Feb 12, 2015

Hi All,
I am using the hystrix-core version 1.2.16 for a high-traffic portal and I guess, we did not understand really how the sanity checks by Circuit Breaker work.

So as to verify the Circuit Breaker (CB) behaviour, I wrote a test which prints out the internal state of the CB. The first 20 requests to a target service, result in 500er and mapped to HystrixRuntimeException, which indicates the target service status unhealthy in expectation that the CB will be tripped soon, and the subsequent requests (500 requests fired every ~100 millis) return HTTP 404 not found, which is totally valid for us and indicates that the user resource is not found on the server - we should keep the circuit closed in this case, therefore we map this status code to HystrixBadRequestException.

The problem is, however, according to the log output, the HystrixBadRequestException does not let the circuit be closed again, in case that the CB has already been opened! If a malevolent user would notice, that the backend service were currently unavailable, he could send further requests for non existing files, which would keep the CB open and cause a service outage !?
Is it really, how it works? Is it possible, that I missed a configuration ?

Moreover, If the target service returns e.g 201, which does not trigger any exception, then after ~ 5 seconds the CB is closed again - as expected.

... former 19 attempts with 500 er
Open Circuit: false, Total : 15, Error : 15, Perc. : 100
20 attempt returns 500
Fallback! Open false, Total : 15, Error : 15, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Open true, Total : 66, Error : 66, Perc. : 100
21 attempt returns 404
com.netflix.hystrix.exception.HystrixBadRequestException: File could not be found.
at com.foo.sc.sync.core.hystrix.HystrixCallableExecutor$HystrixCommandDelegate.run(HystrixCallableExecutor.java:176)
at com.netflix.hystrix.HystrixCommand.executeCommand(HystrixCommand.java:786)
at com.netflix.hystrix.HystrixCommand.access$1400(HystrixCommand.java:81)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:721)
at com.netflix.hystrix.strategy.concurrency.HystrixContextCallable.call(HystrixContextCallable.java:45)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: com.foo.sc.sync.core.exceptions.ResourceNotFoundException: Parent collection could not be found.
at com.foo.sc.sync.connector.internal.sc.commands.ResponseHandler.handleResponse(ResponseHandler.java:197)
at com.foo.sc.sync.connector.internal.sc.commands.CreateAssetCommand.performCall(CreateAssetCommand.java:156)
at com.foo.sc.sync.connector.internal.sc.commands.CreateAssetCommand.performCall(CreateAssetCommand.java:1)
at com.foo.sc.sync.core.hystrix.jersey.JerseyClientHystrixCallable.call(JerseyClientHystrixCallable.java:81)
at com.foo.sc.sync.core.hystrix.HystrixCallableExecutor$HystrixCommandDelegate.run(HystrixCallableExecutor.java:166)
... 8 more
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100

and 1000 requests produce the log in the pattern above and the CB will never be closed !

@reevik reevik changed the title Short circuit/Fallback increases the error count Circuit Breaker will never be closed again if HystrixBadRequestException thrown! Feb 12, 2015
@mattrjacobs
Copy link
Contributor

In master, I've confirmed that the computed error percentage does not count a BadRequest as a success or failure, as expected. See #676

@mattrjacobs
Copy link
Contributor

I've also confirms this is working properly in 1.3.x.

Can you try to upgrade to 1.3.20 or 1.4.0-RC6 and see if this resolves your problem? I believe it will

@reevik
Copy link
Author

reevik commented Feb 13, 2015

I think, the issue is not related to the exception handling.
I was digging into the source code and found that interesting method, getHealthCounts().
(https://github.com/Netflix/Hystrix/blob/master/hystrix-core/src/main/java/com/netflix/hystrix/HystrixCommandMetrics.java#L406)

long success = counter.getRollingSum(HystrixRollingNumberEvent.SUCCESS);
long failure = counter.getRollingSum(HystrixRollingNumberEvent.FAILURE); // fallbacks occur on this
long timeout = counter.getRollingSum(HystrixRollingNumberEvent.TIMEOUT); // fallbacks occur on this
long threadPoolRejected = counter.getRollingSum(HystrixRollingNumberEvent.THREAD_POOL_REJECTED); // fallbacks occur on this
long semaphoreRejected = counter.getRollingSum(HystrixRollingNumberEvent.SEMAPHORE_REJECTED); // fallbacks occur on this
long shortCircuited = counter.getRollingSum(HystrixRollingNumberEvent.SHORT_CIRCUITED); // fallbacks occur on this
long totalCount = failure + success + timeout + threadPoolRejected + shortCircuited + semaphoreRejected;
long errorCount = failure + timeout + threadPoolRejected + shortCircuited + semaphoreRejected;
int errorPercentage = 0;

if (totalCount > 0) {
    errorPercentage = (int) ((double) errorCount / totalCount * 100);
}

Say, that the CB is open and we send continuous requests to the backend service. For every request, the total count will be incremented by 1 (through shortCircuited parameter) as well as the errorCount, because the shortCircuited count is also counted in in errorCount calculation. As a result of this, the errorCount will always have the same value as totalCount, which ends up with a 100 % error percentage and leads up a permanent open circuit breaker. The only parameter, which could break this equality, is the success. However, it stays at "0", as long as the requests are short-circuited. At least, in our scenario. (I have just upgraded the version to 1.3.20, but the problem remains.)

@mattrjacobs
Copy link
Contributor

The precise way that the circuit opening and closing occurs is as follows:

  1. Assuming the volume across a circuit meets a certain threshold : HystrixCommandProperties.circuitBreakerRequestVolumeThreshold()
  2. And assuming that the error percentage, as defined above exceeds the error percentage defined in : HystrixCommandProperties.circuitBreakerErrorThresholdPercentage()
  3. Then the circuit-breaker transitions from CLOSED to OPEN.
  4. While it is open, it short-circuits all requests made against that circuit-breaker.
  5. After some amount of time (HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds()), the next request is let through. If it fails, the command stays OPEN for the sleep window. If it succeeds, it transitions to CLOSED and the logic in 1) takes over again.

If you include your unit test, I can look at your specific sequence of requests and understand better what you'd like to see changed.

@reevik
Copy link
Author

reevik commented Feb 13, 2015

First, thank you for your support and quick reaction. I'd say, that your answers would clarify the issue. You said : "I've confirmed that the computed error percentage does not count a BadRequest as a success or failure, as expected". I interpret that the BadRequest will be considered neither as a success nor as an error, and in your last message, in the 5th step, the circuit breaker will only be (re)closed, if one of the following requests succeed. But, the subsequent requests in our case are ending up with a BadRequest (mapped from HTTP 404) and which probably keeps the CB open. Could you confirm this? If it is the case, then the Hystrix's circuit breaker error handling mechanism and our's do not fit together.

@mattrjacobs
Copy link
Contributor

Yeah, that sounds like the case then. When the circuit-breaker lets the single request through (this is the HALF-OPEN state, forgot to name this above), that request must be successful for the circuit to transition to CLOSED.

@reevik reevik closed this as completed Feb 17, 2015
@mattnelson
Copy link

Sorry to be digging up this old issue. But I've also just run into this issue, and was surprised by the current implementation. Why would a bad request not count for the error percentage when the circuit is closed, but then count as an error when the circuit is open?

@mattrjacobs
Copy link
Contributor

@mattnelson No problem, thanks for bringing it to my attention.

A BAD_REQUEST counts as neither a success nor error for the purposes of transitioning a circuit from CLOSED->OPEN. So it's basically discarded from the event counts when checking to see if a circuit should get opened.

The logic to close a circuit only fires when a SUCCESS occurs (no other event types may trigger this, including BAD_REQUEST). I would argue that this is consistent with the above description, where BAD_REQUESTs have no effect on circuit health. I'm willing to be convinced otherwise.

Here's a concrete example:
Say your circuit is under stress from a DoS attack where an attacker is giving you bad input data. Say that 90% of command invocations result in BAD_REQUEST and 10% in FAILURE (maybe the underlying system is under stress). Assuming the traffic that results in FAILURE is enough traffic to trip a circuit, then the 100% failure rate of health-check-eligible traffic opens the circuit.

I think that closing this circuit on a BAD_REQUEST is not what we want, we want to know that the underlying system is healthy before we close it.

WDYT?

@mattnelson
Copy link

Your explanation makes sense.

@sirait77
Copy link

sirait77 commented Jul 29, 2016

Hi there,

Sorry if I have to open this forum again. I am new to hystrix, so please bear with me; I got the code that is currently using version 1.2.8, is there any test suite to proof or show this issue occurred as well as showing that this issue is fixed by the later version? My pom file currently showing this follows:

    <dependency>
        <groupId>com.netflix.hystrix</groupId>
        <artifactId>hystrix-core</artifactId>
        <version>1.2.8</version>
    </dependency>

    <dependency>
        <groupId>com.netflix.hystrix</groupId>
        <artifactId>hystrix-yammer-metrics-publisher</artifactId>
        <version>1.2.8</version>
    </dependency>

I have been experiencing issue with Future call, when calling Redis, that is why I brought this issue back to the table. The problem is, sometimes call to Redis was successful and sometimes not. So, I'd like to know if the high volume of traffic test could prove that this is the main issue.

Your response would be greatly appreciated.

Thank you

@mattrjacobs
Copy link
Contributor

Yes, see the referenced PR above: #676.

Do you have any reason to suspect that HystrixBadRequestExceptions are involved? That is an exception generated by user-code, not Hystrix internals, so it would be something that you're creating in response to bad input.

If you need more assistance, please include more details - it's hard to tell what your issue is.

@sirait77
Copy link

sirait77 commented Aug 2, 2016

I have code that needs to be fixed as follows:

import com.netflix.hystrix.HystrixCommand;
import com.netflix.hystrix.HystrixCommandGroupKey;
import com.netflix.hystrix.HystrixCommandProperties;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.exceptions.JedisConnectionException;

public RedisDeleteObjectCommand(HystrixCommandProperties.Setter setter, JedisPool jedisPool, byte[] namespace, byte[] key) {
    super(HystrixCommand.Setter
        .withGroupKey(HystrixCommandGroupKey.Factory.asKey(commandGroup))
        .andCommandPropertiesDefaults(setter));

    this.jedisPool = Preconditions.checkNotNull(jedisPool);
    this.key = Preconditions.checkNotNull(key);
    this.namespace =  Preconditions.checkNotNull(namespace);
}

@Override
    protected Long run() {
        Jedis jedis = jedisPool.getResource();
        try {
            return jedis.hdel(namespace, key);
        } catch (JedisConnectionException ex) {
            jedisPool.returnBrokenResource(jedis);
            jedis = null;
            throw ex;
        } finally {
            if (jedis != null) {
                jedisPool.returnResource(jedis);
            }
        }
    }

That code above is called when the logic needs to push delete instruction to Redis.

  public boolean hdel(byte[] ns, byte[] key) {
    RedisDeleteObjectCommand delete = new RedisDeleteObjectCommand(circuitBreakerProperties, jedisPool, ns, key);
    return delete.execute() > 0;
  }
private boolean deleteFromRedis(final String id) {
        try {
            List<Callable<Boolean>> deletions =
                    Lists.transform(redisCollection, new Function<RedisFacade, Callable<Boolean>>() {
                public Callable<Boolean> apply(final RedisFacade redis) {
                    return new Callable<Boolean>() {
                        public Boolean call() {
                            return redis.hdel(redisNameSpace.getBytes(), id.getBytes());
                        }
                    };
                }
            });

            Optional<Boolean> result = Callables2.findFirstAndExecuteRemaining(executorService, deletions);
            if (result.isPresent()) {
                return true;
            }
        } catch (InterruptedException e) {
            log.error("Error deleting object from Redis", e);
        }

        return false;
    }

The issue is when running in low traffic environment (Dev or QA), the delete is executed properly. However, when in production, this code is intermittently executed.
Due to this inconsistency and inability to recreate this issue, I am still researching whether this is caused by my code vs environment issue.
I am open for any suggestion.

@mattrjacobs
Copy link
Contributor

Hystrix provides a fairly rich set of metrics to help understand what your wrapped code is doing.

One way to visualize these metrics is to set up the Hystrix Dashboard (https://github.com/Netflix/Hystrix/wiki/Dashboard). This is a standalone war which receives data from your application (once you've installed a metrics-stream).

Once you have that working, then you will be able to see what your command is doing in realtime. The most-common possibilities are: SUCCESS/FAILURE/TIMEOUT/THREADPOOL_REJECTED/SHORT_CIRCUITED. Each of those has a different fix, so understanding that is critical.

Internally at Netflix, we also persist our Hystrix metrics to Atlas for historical trends.

@milosevicd
Copy link

milosevicd commented Dec 1, 2017

Sorry for reopening this again, but I feel there's some room for improvement. I think in this last step

  1. After some amount of time (HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds()), the next request is let through. If it fails, the command stays OPEN for the sleep window. If it succeeds, it transitions to CLOSED and the logic in 1) takes over again.

the problem is that a BadRequest will trigger the sleep window again.

We have a use case where we wrap DB access in Hystrix and throw BadRequest for conditional failures on INSERT in DB (e.g. duplicate key). The code reacts to that BadReq and does an UPDATE for the same key. However, if the circuit trips OPEN because of several real DB errors, and we continue to have plenty of those duplicate keys, the chances of circuit closing again are very slim, although the system is healthy (each duplicate key triggers a BadReq, which in turn triggers the sleep window again, so it's only about luck whether next request after sleep window will be a duplicate key or not).

I'd argue that BadRequest should not close the circuit (as it is now), but it shouldn't trigger the sleep window either; it should rather be discarded, and next request should be let through (and based on that next request we decide whether to close the circuit or not - if it's a BadReq again, we let the next one through and so on).

@mukeshgarg
Copy link

mukeshgarg commented Feb 2, 2018

I also run into the same issue and fully agree with @milosevicd that in case of BAD_REQUEST if the circuit is not let CLOSED, at least sleep window should not trigger so that there are better chances of the circuit to open if the request is Successful.

Is there a workaround for this? I mean I would like to close the circuit even if request results in a BAD_REQUEST

@karthikrajakandasamy
Copy link

I have to use the steeltoe circuit breaker in my application(API) using .net. I could not find any sample to call the steeltoe circuit breaker using .net. Can anyone share me any sample app to use steeltoe circuit breaker from .net app.

@horalstvo
Copy link

I also agree with @milosevicd . Our use case is even better. We have a job processing records in the same order. Some records may return BadRequest and it should be transient (but in some cases it isn't), rest usually returns correctly. But then we run into the issue that circuit trips open and the next request to be tried always return BadRequest. Then it will never close - unless we fix the one problematic request validation issue. It is a 4xx return and yet it will keep the circuit open forever. Also in case there are couple of those broken entries we will only learn about them one by one.

I would expect that bad request response won't be triggering the sleep window but will let next request to pass. If there is a sequence of bad requests followed by FAIL it will stay open. Ending the sequence with SUCCESS will close the circuit.

@doraeimo
Copy link

doraeimo commented May 18, 2020

I met a similar problem related to ignoreExceptions defined in @HystrixCommand.
At the beginning, ignoreExceptions are not counted. Once the circuit is open, it could never be closed again if ignoreExceptions are thrown.

I could understand what @mattrjacobs has mentioned that it could be a feature, however shall we have a config to make it more Intuitive?

Using Hystrix 1.5.12, Spring-Cloud Netflix Dalston.SR1 and Javanica Hystrix Annotations.

@adaias
Copy link

adaias commented Aug 25, 2022

Any solutions for this?, IMHO should only have OPEN-CLOSED status, success or ignored exceptions(CLOSED), all others exceptions OPEN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants