Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc tls cert is not reloaded #7400

Closed
cbarbara-okta opened this issue Mar 6, 2020 · 3 comments
Closed

grpc tls cert is not reloaded #7400

cbarbara-okta opened this issue Mar 6, 2020 · 3 comments

Comments

@cbarbara-okta
Copy link

cbarbara-okta commented Mar 6, 2020

Overview of the Issue

When updating the consul agent's TLS cert (cert_file) and running consul reload we see the new cert being used correctly for the https API but the grpc API continues to use the previous cert. The only solution appears to be a full consul agent restart.

We just updated our certificate, and we can see the new cert used on the https port:

$ openssl s_client -connect localhost:8501 2>/dev/null | openssl x509 -noout -dates
notBefore=Feb 25 01:31:12 2020 GMT
notAfter=Feb 24 01:31:12 2025 GMT

However you can see that the grpc port is still serving the old cert:

$ openssl s_client -connect localhost:8502 2>/dev/null | openssl x509 -noout -dates
notBefore=Mar  6 00:40:08 2019 GMT
notAfter=Mar  5 00:40:08 2020 GMT

And at the time of writing this issue it is currently Fri Mar 6 01:30:51 UTC 2020

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 3 client nodes and 3 server nodes
  2. run the command openssl s_client -connect localhost:8501 2>/dev/null | openssl x509 -noout -dates and record the expiration dates of the https api
  3. run the command openssl s_client -connect localhost:8502 2>/dev/null | openssl x509 -noout -dates and record the expiration dates of the grpc api
  4. Create a new certificate with the different expiration date
  5. Deploy your new certificate to any of the hosts
  6. run consul reload to bring in the latest cert
  7. run the command openssl s_client -connect localhost:8501 2>/dev/null | openssl x509 -noout -dates to validate the new https cert has been reloaded and is being used.
  8. run the command openssl s_client -connect localhost:8502 2>/dev/null | openssl x509 -noout -dates to validate the new grpc cert has been reloaded and is being used ... however you will not see this happening.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 2
build:
	prerelease = 
	revision = a42ded47
	version = 1.5.3
consul:
	acl = enabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 73
	max_procs = 4
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 32
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 1
	member_time = 3652
	members = 43
	query_queue = 0
	query_time = 77
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = a42ded47
	version = 1.5.3
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.3.199.186:8300
	server = true
raft:
	applied_index = 10512172
	commit_index = 10512172
	fsm_pending = 0
	last_contact = 11.665232ms
	last_log_index = 10512172
	last_log_term = 57
	last_snapshot_index = 10511392
	last_snapshot_term = 57
	latest_configuration = [{Suffrage:Voter ID:891a2fb9-ba35-b8a2-a24f-16a369a6fb15 Address:10.3.193.229:8300} {Suffrage:Voter ID:aad59619-5884-9654-0ce4-3d2a00b50245 Address:10.3.199.186:8300} {Suffrage:Voter ID:25e3819a-9bc0-c532-8e64-08246942a6dd Address:10.3.202.220:8300}]
	latest_configuration_index = 9994951
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 57
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 383
	max_procs = 2
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 32
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 3652
	members = 42
	query_queue = 0
	query_time = 77
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon linux 1 - 4.14.128-87.105.amzn1.x86_64

Our sample config file
{
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "tokens": {
      "agent": "xzy-xzy-xzy-xzy-xzy"
    }
  },
  "advertise_addr": "10.3.198.171",
  "bind_addr": "10.3.198.171",
  "ca_file": "/etc/consul/certs/ca.pem",
  "cert_file": "/etc/consul/certs/agent_cert.pem",
  "data_dir": "/data/consul",
  "datacenter": "consul-dc",
  "disable_host_node_id": false,
  "disable_remote_exec": true,
  "disable_update_check": true,
  "enable_script_checks": false,
  "key_file": "/etc/consul/certs/agent_key.pem",
  "node_name": "some-node-name",
  "ports": {
    "serf_lan": 8301,
    "http": -1,
    "https": 8501,
    "grpc": 8502,
    "dns": 8600
  },
  "rejoin_after_leave": true,
  "retry_join": [
    "provider=aws tag_key=ConsulCluster tag_value=consul"
  ],
  "server": false,
  "verify_outgoing": true,
  "verify_server_hostname": true,
  "node_id": "cccccccc-cccc-cccc-cccc-cccccccccccc",
  "encrypt": "xyz===="
}

Log Fragments

We are using consul connect and envoy, and what we saw once the grpc certificate expired was that a new envoy sidecar could not be correctly configured after launching it, there would just be these logs on the agent:

2020/03/06 00:34:27 [WARN] grpc: Server.Serve failed to complete security handshake from "127.0.0.1:37486": remote error: tls: expired certificate
@hanshasselberg
Copy link
Member

hanshasselberg commented Mar 9, 2020

Thank you for reporting! This was introduced in Consul 1.7.1: 11a571de9 #7086.

@cbarbara-okta
Copy link
Author

@i0rek -a few questions:

  1. The PR only mentions auto_encrypt, but does that change also cover reloading certs from disk?
  2. is there any chance this will be backported into 1.5 or 1.6? We see this as a big consul connect bug that is currently causing a lot of operational overhead to restart all clients.

@hanshasselberg
Copy link
Member

hanshasselberg commented Mar 9, 2020

@cbarbara-okta yes, the grpc server now benefits from the same TLS config reloading mechanism that is used across consul. I will get back to you re your second question....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants