-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circuit Breaker will never be closed again if HystrixBadRequestException thrown! #674
Comments
In master, I've confirmed that the computed error percentage does not count a BadRequest as a success or failure, as expected. See #676 |
I've also confirms this is working properly in 1.3.x. Can you try to upgrade to 1.3.20 or 1.4.0-RC6 and see if this resolves your problem? I believe it will |
I think, the issue is not related to the exception handling.
Say, that the CB is open and we send continuous requests to the backend service. For every request, the total count will be incremented by 1 (through shortCircuited parameter) as well as the errorCount, because the shortCircuited count is also counted in in errorCount calculation. As a result of this, the errorCount will always have the same value as totalCount, which ends up with a 100 % error percentage and leads up a permanent open circuit breaker. The only parameter, which could break this equality, is the success. However, it stays at "0", as long as the requests are short-circuited. At least, in our scenario. (I have just upgraded the version to 1.3.20, but the problem remains.) |
The precise way that the circuit opening and closing occurs is as follows:
If you include your unit test, I can look at your specific sequence of requests and understand better what you'd like to see changed. |
First, thank you for your support and quick reaction. I'd say, that your answers would clarify the issue. You said : "I've confirmed that the computed error percentage does not count a BadRequest as a success or failure, as expected". I interpret that the BadRequest will be considered neither as a success nor as an error, and in your last message, in the 5th step, the circuit breaker will only be (re)closed, if one of the following requests succeed. But, the subsequent requests in our case are ending up with a BadRequest (mapped from HTTP 404) and which probably keeps the CB open. Could you confirm this? If it is the case, then the Hystrix's circuit breaker error handling mechanism and our's do not fit together. |
Yeah, that sounds like the case then. When the circuit-breaker lets the single request through (this is the HALF-OPEN state, forgot to name this above), that request must be successful for the circuit to transition to CLOSED. |
Sorry to be digging up this old issue. But I've also just run into this issue, and was surprised by the current implementation. Why would a bad request not count for the error percentage when the circuit is closed, but then count as an error when the circuit is open? |
@mattnelson No problem, thanks for bringing it to my attention. A BAD_REQUEST counts as neither a success nor error for the purposes of transitioning a circuit from CLOSED->OPEN. So it's basically discarded from the event counts when checking to see if a circuit should get opened. The logic to close a circuit only fires when a SUCCESS occurs (no other event types may trigger this, including BAD_REQUEST). I would argue that this is consistent with the above description, where BAD_REQUESTs have no effect on circuit health. I'm willing to be convinced otherwise. Here's a concrete example: I think that closing this circuit on a BAD_REQUEST is not what we want, we want to know that the underlying system is healthy before we close it. WDYT? |
Your explanation makes sense. |
Hi there, Sorry if I have to open this forum again. I am new to hystrix, so please bear with me; I got the code that is currently using version 1.2.8, is there any test suite to proof or show this issue occurred as well as showing that this issue is fixed by the later version? My pom file currently showing this follows:
I have been experiencing issue with Future call, when calling Redis, that is why I brought this issue back to the table. The problem is, sometimes call to Redis was successful and sometimes not. So, I'd like to know if the high volume of traffic test could prove that this is the main issue. Your response would be greatly appreciated. Thank you |
Yes, see the referenced PR above: #676. Do you have any reason to suspect that HystrixBadRequestExceptions are involved? That is an exception generated by user-code, not Hystrix internals, so it would be something that you're creating in response to bad input. If you need more assistance, please include more details - it's hard to tell what your issue is. |
I have code that needs to be fixed as follows:
That code above is called when the logic needs to push delete instruction to Redis.
The issue is when running in low traffic environment (Dev or QA), the delete is executed properly. However, when in production, this code is intermittently executed. |
Hystrix provides a fairly rich set of metrics to help understand what your wrapped code is doing. One way to visualize these metrics is to set up the Hystrix Dashboard (https://github.com/Netflix/Hystrix/wiki/Dashboard). This is a standalone war which receives data from your application (once you've installed a metrics-stream). Once you have that working, then you will be able to see what your command is doing in realtime. The most-common possibilities are: SUCCESS/FAILURE/TIMEOUT/THREADPOOL_REJECTED/SHORT_CIRCUITED. Each of those has a different fix, so understanding that is critical. Internally at Netflix, we also persist our Hystrix metrics to Atlas for historical trends. |
Sorry for reopening this again, but I feel there's some room for improvement. I think in this last step
the problem is that a BadRequest will trigger the sleep window again. We have a use case where we wrap DB access in Hystrix and throw BadRequest for conditional failures on INSERT in DB (e.g. duplicate key). The code reacts to that BadReq and does an UPDATE for the same key. However, if the circuit trips OPEN because of several real DB errors, and we continue to have plenty of those duplicate keys, the chances of circuit closing again are very slim, although the system is healthy (each duplicate key triggers a BadReq, which in turn triggers the sleep window again, so it's only about luck whether next request after sleep window will be a duplicate key or not). I'd argue that BadRequest should not close the circuit (as it is now), but it shouldn't trigger the sleep window either; it should rather be discarded, and next request should be let through (and based on that next request we decide whether to close the circuit or not - if it's a BadReq again, we let the next one through and so on). |
I also run into the same issue and fully agree with @milosevicd that in case of BAD_REQUEST if the circuit is not let CLOSED, at least sleep window should not trigger so that there are better chances of the circuit to open if the request is Successful. Is there a workaround for this? I mean I would like to close the circuit even if request results in a BAD_REQUEST |
I have to use the steeltoe circuit breaker in my application(API) using .net. I could not find any sample to call the steeltoe circuit breaker using .net. Can anyone share me any sample app to use steeltoe circuit breaker from .net app. |
I also agree with @milosevicd . Our use case is even better. We have a job processing records in the same order. Some records may return BadRequest and it should be transient (but in some cases it isn't), rest usually returns correctly. But then we run into the issue that circuit trips open and the next request to be tried always return BadRequest. Then it will never close - unless we fix the one problematic request validation issue. It is a 4xx return and yet it will keep the circuit open forever. Also in case there are couple of those broken entries we will only learn about them one by one. I would expect that bad request response won't be triggering the sleep window but will let next request to pass. If there is a sequence of bad requests followed by FAIL it will stay open. Ending the sequence with SUCCESS will close the circuit. |
I met a similar problem related to I could understand what @mattrjacobs has mentioned that it could be a feature, however shall we have a config to make it more Intuitive? Using Hystrix 1.5.12, Spring-Cloud Netflix Dalston.SR1 and Javanica Hystrix Annotations. |
Any solutions for this?, IMHO should only have OPEN-CLOSED status, success or ignored exceptions(CLOSED), all others exceptions OPEN |
Hi All,
I am using the hystrix-core version 1.2.16 for a high-traffic portal and I guess, we did not understand really how the sanity checks by Circuit Breaker work.
So as to verify the Circuit Breaker (CB) behaviour, I wrote a test which prints out the internal state of the CB. The first 20 requests to a target service, result in 500er and mapped to HystrixRuntimeException, which indicates the target service status unhealthy in expectation that the CB will be tripped soon, and the subsequent requests (500 requests fired every ~100 millis) return HTTP 404 not found, which is totally valid for us and indicates that the user resource is not found on the server - we should keep the circuit closed in this case, therefore we map this status code to HystrixBadRequestException.
The problem is, however, according to the log output, the HystrixBadRequestException does not let the circuit be closed again, in case that the CB has already been opened! If a malevolent user would notice, that the backend service were currently unavailable, he could send further requests for non existing files, which would keep the CB open and cause a service outage !?
Is it really, how it works? Is it possible, that I missed a configuration ?
Moreover, If the target service returns e.g 201, which does not trigger any exception, then after ~ 5 seconds the CB is closed again - as expected.
... former 19 attempts with 500 er
Open Circuit: false, Total : 15, Error : 15, Perc. : 100
20 attempt returns 500
Fallback! Open false, Total : 15, Error : 15, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 20, Error : 20, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 26, Error : 26, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 31, Error : 31, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 36, Error : 36, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 41, Error : 41, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 46, Error : 46, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 51, Error : 51, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 56, Error : 56, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 61, Error : 61, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Fallback! Open true, Total : 66, Error : 66, Perc. : 100
Open true, Total : 66, Error : 66, Perc. : 100
21 attempt returns 404
com.netflix.hystrix.exception.HystrixBadRequestException: File could not be found.
at com.foo.sc.sync.core.hystrix.HystrixCallableExecutor$HystrixCommandDelegate.run(HystrixCallableExecutor.java:176)
at com.netflix.hystrix.HystrixCommand.executeCommand(HystrixCommand.java:786)
at com.netflix.hystrix.HystrixCommand.access$1400(HystrixCommand.java:81)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:721)
at com.netflix.hystrix.strategy.concurrency.HystrixContextCallable.call(HystrixContextCallable.java:45)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: com.foo.sc.sync.core.exceptions.ResourceNotFoundException: Parent collection could not be found.
at com.foo.sc.sync.connector.internal.sc.commands.ResponseHandler.handleResponse(ResponseHandler.java:197)
at com.foo.sc.sync.connector.internal.sc.commands.CreateAssetCommand.performCall(CreateAssetCommand.java:156)
at com.foo.sc.sync.connector.internal.sc.commands.CreateAssetCommand.performCall(CreateAssetCommand.java:1)
at com.foo.sc.sync.core.hystrix.jersey.JerseyClientHystrixCallable.call(JerseyClientHystrixCallable.java:81)
at com.foo.sc.sync.core.hystrix.HystrixCallableExecutor$HystrixCommandDelegate.run(HystrixCallableExecutor.java:166)
... 8 more
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
Fallback! Open true, Total : 70, Error : 70, Perc. : 100
and 1000 requests produce the log in the pattern above and the CB will never be closed !
The text was updated successfully, but these errors were encountered: