Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TTL health checks at server #2089

Closed
tgross opened this issue Jun 3, 2016 · 3 comments
Closed

Support TTL health checks at server #2089

tgross opened this issue Jun 3, 2016 · 3 comments

Comments

@tgross
Copy link
Member

tgross commented Jun 3, 2016

In the Google group we discussed a problem with TTL health checks in PaaS-like or "serverless" environments where there are only servers and no agents. In these environments, Consul isn't collocated with the application client so there's no good way to assign a client to a particular server. This means Consul is making topology assumptions that work well for “machines,” but are in conflict with other uses. We don't have VMs, so we don't need Consul to be aware of the underlying infrastructure and impose assumptions about it, but still want to be able to use Consul for service discovery in the application.

@misterbisson and I suggested that health checks (for TTLs) could be defined at the catalog level rather than the agent level.

@slackpad responded with the following:

This definitely would not make sense for any other health check type but I could see TTL checks being useful when used this way. There are two architecture questions we'd need to think through:

  1. Having Consul servers manage TTL expiration for these checks would be a significant new feature. This is currently managed completely on the agent side and they send edge-triggered updates to the Consul servers when a TTL expires and the state changes, so Consul servers have no concept of what kinds of checks are present, and they don't know anything about check TTLs. We have some precedent for servers handling TTL expirations via sessions, and I have a design sketched out to make managing TTL expirations much more efficient, so it seems like we could work something out here to add potentially lots more TTL-expiring things to be managed by servers.
  2. Agents currently provide a buffer between the load from refreshing TTLs (which the Consul servers never see) and service state changes (which the Consul servers see but happen much less often). Having many, many processes posting TTL refreshes directly to the servers could put a lot of extra load on them in a way that may not scale well. I don't think all the TTL refreshes should need to go through Raft, but we'd need to do some careful planning to make sure that's true so we don't create a bottleneck.

We'd want to have a solid plan for #2 before jumping into code - that's probably best worked out via a new Consul Github issue. I'm happy to help figure this out!

I'm happy to help contribute to the design discussion as well as a PR for this work when it comes to it.

@tgross
Copy link
Member Author

tgross commented Jun 7, 2016

I've had a dive into the code base and this really is a large project. That being said, I'm going to flag this issue #259 (comment) as a possible workaround for the problem in terms of building this into the application (or in our use case, ContainerPilot) rather than trying to radically change Consul's model.

@tgross
Copy link
Member Author

tgross commented Jun 9, 2016

@slackpad I've spent a lot of this week getting to understand SWIM/Serf and how Consul agents gossip updates to the servers, and I think I've come around to this being a bad idea to change in Consul. It's really clear that having agents gossip but not participate in the raft is key to Consul scalability (particularly compared to etcd where all TTLs get sent up to the servers). I'd hate to introduce an architectural change that breaks this core advantage of Consul.

The specific use case we were running into with TritonDataCenter/containerpilot#162 we're going to solve with TritonDataCenter/containerpilot#175 and running a Consul agent as a co-process in the container. I may at some point explore the idea of a Consul agent library (probably in C or Rust so it's can be embedded in arbitrary applications), but that's certainly out of scope for my current project or for Consul itself.

I'd be happy to close this issue if you're in agreement with the above @slackpad

@slackpad
Copy link
Contributor

Thank you for the update, @tgross!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants