-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic when dns query on non-server node #3407
Comments
I'll have a look |
This is code we've changed recently. Can you post |
Ok, no problem.
So here are the output from curl.
non-server:
|
OK, I can reproduce it and found the root cause. I should have a fix by tomorrow. Thx for reporting this! |
Looking over this with @magiconair it comes down to a difference in behavior from Thinking about this, I think a better approach might be to perform an RPC to get the Consul servers. This would be a query similar to this: https://github.com/hashicorp/consul/blob/v0.9.2/agent/dns.go#L684-L712 Using this constant to get the https://github.com/hashicorp/consul/blob/v0.9.2/agent/structs/catalog.go#L20 The benefit of this is that all the code moves into dns.go and we end up with identical behavior on the client and server. We also are using the consistent information in the state store to answer the request, which is behavior I think folks would expect for this. We could remove the |
This patch replaces the code which determines the list of servers in the current cluster with an RPC call to get the list of active consul service instances which only run on servers. This replaces the previous implementation which was more complex and relied on serf messages which can provide a different view than the consistent response from the raft log. As a side effect it makes the implementation independent of the server and the agent which means it works consistently across both. Different behavior for server and agent was the root cause for the bug in http://github.com/hashicorp/consul/issue/3047. Fixes #3407
I've implemented a patch based on @slackpad's comments in #3408 but I'm still a bit puzzled on how you triggered the panic. To get into that codepath you have to explicitly query for either an I was only able to trigger the panic with these two queries:
Maybe I'm missing something obvious or non-obvious like |
@magiconair you are right, that was an mistake i made... |
No worries. I'm glad we found that out since that makes the difference between DNS completely broken or new feature broken. :) |
This patch replaces the code which determines the list of servers in the current cluster with an RPC call to get the list of active consul service instances which only run on servers. This replaces the previous implementation which was more complex and relied on serf messages which can provide a different view than the consistent response from the raft log. As a side effect it makes the implementation independent of the server and the agent which means it works consistently across both. Different behavior for server and agent was the root cause for the bug in http://github.com/hashicorp/consul/issue/3047. Fixes #3407
consul version
for both Client and ServerClient:
Consul v0.9.2
Server:
Consul v0.9.2
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
Using pre-compiled binaries from releases.hashicorp.com on Arch Linux
Description of the Issue (and unexpected/desired result)
I've build an test environment of 1 consul server node (
"bootstrap_expect": 1
) and 2 non-server nodes. If i try to query dns on the leader there is no problem, but if i try the same on an non-server node consul crash.This could reproduced every time on amd64 and
arch = arm
(my second non-server node).Reproduction steps
on server:
dig -p 8600 @localhost
-> no problemon non-server:
dig -p 8600 @localhost
-> consul crash with panicLog Fragments or Link to gist
log from the non-server after entering
dig -p 8600 @localhost webserver.service.test.consul
The text was updated successfully, but these errors were encountered: