Support TTL health checks at server #2089

tgross · 2016-06-03T18:26:27Z

In the Google group we discussed a problem with TTL health checks in PaaS-like or "serverless" environments where there are only servers and no agents. In these environments, Consul isn't collocated with the application client so there's no good way to assign a client to a particular server. This means Consul is making topology assumptions that work well for “machines,” but are in conflict with other uses. We don't have VMs, so we don't need Consul to be aware of the underlying infrastructure and impose assumptions about it, but still want to be able to use Consul for service discovery in the application.

@misterbisson and I suggested that health checks (for TTLs) could be defined at the catalog level rather than the agent level.

@slackpad responded with the following:

This definitely would not make sense for any other health check type but I could see TTL checks being useful when used this way. There are two architecture questions we'd need to think through:

Having Consul servers manage TTL expiration for these checks would be a significant new feature. This is currently managed completely on the agent side and they send edge-triggered updates to the Consul servers when a TTL expires and the state changes, so Consul servers have no concept of what kinds of checks are present, and they don't know anything about check TTLs. We have some precedent for servers handling TTL expirations via sessions, and I have a design sketched out to make managing TTL expirations much more efficient, so it seems like we could work something out here to add potentially lots more TTL-expiring things to be managed by servers.

Agents currently provide a buffer between the load from refreshing TTLs (which the Consul servers never see) and service state changes (which the Consul servers see but happen much less often). Having many, many processes posting TTL refreshes directly to the servers could put a lot of extra load on them in a way that may not scale well. I don't think all the TTL refreshes should need to go through Raft, but we'd need to do some careful planning to make sure that's true so we don't create a bottleneck.

We'd want to have a solid plan for #2 before jumping into code - that's probably best worked out via a new Consul Github issue. I'm happy to help figure this out!

I'm happy to help contribute to the design discussion as well as a PR for this work when it comes to it.

tgross · 2016-06-07T12:04:11Z

I've had a dive into the code base and this really is a large project. That being said, I'm going to flag this issue #259 (comment) as a possible workaround for the problem in terms of building this into the application (or in our use case, ContainerPilot) rather than trying to radically change Consul's model.

tgross · 2016-06-09T11:58:44Z

@slackpad I've spent a lot of this week getting to understand SWIM/Serf and how Consul agents gossip updates to the servers, and I think I've come around to this being a bad idea to change in Consul. It's really clear that having agents gossip but not participate in the raft is key to Consul scalability (particularly compared to etcd where all TTLs get sent up to the servers). I'd hate to introduce an architectural change that breaks this core advantage of Consul.

The specific use case we were running into with TritonDataCenter/containerpilot#162 we're going to solve with TritonDataCenter/containerpilot#175 and running a Consul agent as a co-process in the container. I may at some point explore the idea of a Consul agent library (probably in C or Rust so it's can be embedded in arbitrary applications), but that's certainly out of scope for my current project or for Consul itself.

I'd be happy to close this issue if you're in agreement with the above @slackpad

slackpad · 2016-08-11T00:28:58Z

Thank you for the update, @tgross!

tgross mentioned this issue Jun 3, 2016

Registering containers as separate nodes in consul TritonDataCenter/containerpilot#162

Closed

slackpad closed this as completed Aug 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TTL health checks at server #2089

Support TTL health checks at server #2089

tgross commented Jun 3, 2016

tgross commented Jun 7, 2016

tgross commented Jun 9, 2016 •

edited

Loading

slackpad commented Aug 11, 2016

Support TTL health checks at server #2089

Support TTL health checks at server #2089

Comments

tgross commented Jun 3, 2016

tgross commented Jun 7, 2016

tgross commented Jun 9, 2016 • edited Loading

slackpad commented Aug 11, 2016

tgross commented Jun 9, 2016 •

edited

Loading