Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node level fault isolation #862

Closed
esrishi opened this issue Aug 14, 2015 · 6 comments
Closed

Node level fault isolation #862

esrishi opened this issue Aug 14, 2015 · 6 comments
Labels

Comments

@esrishi
Copy link

esrishi commented Aug 14, 2015

Hi,
I am new to hystrix and trying to read up on the documentation and examples.
I have a back end dependency with requests routed to ~200 hosts (X10 JVM's per host).
Each JVM has unique data, using a lookup table my service has to connect to any of the 2000 JVM's to respond to client request.
I am looking for fault isolation at (backend jvm) level, and if I understand the documentation I have read so far, it would require my application to have ~2000 thread pools using hystrixcommand.
A single host has multiple instances running on it, so host level isolation is not the right solution.
Maybe I am missing something, any suggestions would be great.

Thanks,
Rishi.

@mattrjacobs
Copy link
Contributor

@esrishi This is a great question. If you want fault isolation at that fine of a granularity, Hystrix currently forces you to have a HystrixCommand per JVM (otherwise faults would not be isolated). This imposes a large overhead in terms of monitoring.

For thread pools, you could group many commands into the same thread pool to reduce the number of thread pools. For instance, you could have 200 thread pools, each with 10 commands, 1 per JVM. Commands would fail independently, but could, in this case, get blocked by running out of thread pool space.

A related open issue is: #26. This allows for a more reasonable model of a case like this. I will say that one reason it hasn't been done yet is that, for internal Netflix usage, we try very hard to avoid cases like this. Instead, we attempt to build stateless services such that routing to any node is equivalent.

Of course, cases like yours are completely valid and worth solving, we just haven't put time into them as we don't see them in daily usage.

@esrishi
Copy link
Author

esrishi commented Aug 17, 2015

Hi @mattrjacobs , thanks for your response, the open issue #26 is exactly the same situation I am working to solve. We have multiple clusters with sharded data. As much as we want it to be stateless, eventually we are bound to an endpoint where the end user data lives (ex: Solr or MySql etc).

Since I am new to hystrix and still learning the ropes, can you elaborated on what you mean by " you could have 200 thread pools, each with 10 commands, 1 per JVM".

Here is a sample hystrixcommand initialization I have, this will lead to ~ 2000 PerNodeHystrixCommand's
class PerNodeHystrixCommand extends HystrixCommand {
public PerNodeHystrixCommand(String host, String port) {
super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(host +"_"+ port)));
}
}

If I understand you right, is the shared thread pool approach something like this?

class PerNodeHystrixCommand extends HystrixCommand {
public PerNodeHystrixCommand(String host, String port) {
super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(host +"_"+ port))
.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey(host+"_Pool"));
}
}
For our case we have 10 JVM instances per host so its ~ 200 thread pools.

@mattrjacobs
Copy link
Contributor

Yes, so the principle is to share thread pools between commands.

There are 3 levels of naming:

  • Group
  • Thread Pool
  • Command

Generally, these are hierarchical, so in your case I would have:

  • A unique command name (host+port)
  • A per-host thread pool name (host)
  • A per-host group name (host)

Note that, if left unset, group defaults to thread pool name, so you can safely omit that.

If you use those settings, the 10 commands that talk to the same host will use the same threadpool (and have the same group name).

@esrishi
Copy link
Author

esrishi commented Aug 19, 2015

Thanks Matt for clarifying the naming and hierarchy.
I see a section about measuring cost of hystrix overhead, and examples stating 40+ thread pools in use, but is there a liner/non-linear calculation of overhead with respect to thread pools, any insights or your experience on using hystrix with large thread pools.

@mattrjacobs
Copy link
Contributor

Unfortunately, there's not really an analytic way to arrive at an optimal sizing of thread pools. It's based upon the flow of traffic through commands and their distributions of interarrival times and execution times. Moreover, the calculations change for different machine/OSes in terms of number of cores and
thread scheduling policies.

In practice, we set our thread pool sizes by:

  1. In steady state, rejecting 0/.01%/.1%/1% of commands (depends on the criticality of the function)
  2. Under latency, ensuring that the system still operates as expected. In this case, we expect additional latency to directly trigger more thread pool rejections/timeouts/short-circuits, and so the system should not become overloaded.

There's more context on the problem here: #131, if you're interested.

I will also say that on the high-volume system I work on at Netflix, we've never tuned a threadpool above 28, so that might give you a sense of absolute numbers (though of course our domains/system characteristics are likely very different).

@esrishi
Copy link
Author

esrishi commented Aug 26, 2015

Thanks @mattrjacobs, will try out with higher number of threadpools and see it that options pans out well for our deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants