Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERR] serf: Rejected coordinate from HOST: round trip time not in valid range, duration -99.611868ms is not a positive value less than 10s #3704

Closed
mnuic opened this issue Nov 21, 2017 · 7 comments
Assignees
Labels
type/bug Feature does not function as expected
Milestone

Comments

@mnuic
Copy link

mnuic commented Nov 21, 2017

consul version for both Client and Server

Client: consul 1.0.1
Server: consul 1.0.1

consul info for both Client and Server

Client:

same as server

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 32
	services = 45
build:
	prerelease =
	revision = 9564c29
	version = 1.0.1
consul:
	bootstrap = true
	known_datacenters = 7
	leader = true
	leader_addr = 10.0.66.150:8300
	server = true
raft:
	applied_index = 16074526
	commit_index = 16074526
	fsm_pending = 0
	last_contact = 0
	last_log_index = 16074526
	last_log_term = 15
	last_snapshot_index = 16070464
	last_snapshot_term = 15
	latest_configuration = [{Suffrage:Voter ID:386b24e2-c793-cd40-49dd-4116232b96bd Address:10.0.66.150:8300}]
	latest_configuration_index = 1
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 15
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 472
	max_procs = 8
	os = linux
	version = go1.9.2
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 1
	event_time = 15
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 886
	members = 11
	query_queue = 0
	query_time = 1

Operating system and Environment details

Ubuntu 16.04.03LTS, Docker 17.09

Description of the Issue (and unexpected/desired result)

Upon upgrade consul to version 1.0.1 logs started to fill with messages:

a.b.c.d     2017/11/21 09:01:38 [ERR] serf: Rejected coordinate from HOST1: round trip time not in valid range, duration -206.486µs is not a positive value less than 10s
a.b.c.d     2017/11/21 09:02:14 [ERR] serf: Rejected coordinate from HOST2: round trip time not in valid range, duration -99.611868ms is not a positive value less than 10s
a.b.c.d     2017/11/21 09:04:28 [ERR] serf: Rejected coordinate from HOST3: round trip time not in valid range, duration -765.777µs is not a positive value less than 10s

Logs

  • no logs except for mention above
@slackpad slackpad added the type/bug Feature does not function as expected label Nov 21, 2017
@slackpad slackpad added this to the 1.0.2 milestone Nov 21, 2017
@slackpad
Copy link
Contributor

Hi @mnuic we tracked that down but the fix didn't make it into this release cycle but we will pick this up in the next minor release of Consul via hashicorp/memberlist#139. Sorry for the log noise - these can be safely ignored.

@mnuic
Copy link
Author

mnuic commented Nov 21, 2017

@slackpad thank you for the info! Will wait for the next release for production use.

@slackpad slackpad self-assigned this Nov 21, 2017
@sofax
Copy link

sofax commented Dec 12, 2017

I'm afraid this is more than just log noise. consul 1.0.1 does break our test environment, whereas v0.9.3 works flawlessly. The above mentioned error messages are the only ones we see.

@slackpad
Copy link
Contributor

@sofax can you provide more details about what is broken for you?

@sofax
Copy link

sofax commented Dec 12, 2017

@slackpad:
It may or may not be related to this issue - all I can say is that we don't see any other error messages.

Here is the scenario:
We have some integration tests for service health checks, e.g. one with two instances of service A, where initially both instances return an unhealthy state. Then service instance #2 is set to "healthy" (i.e. its health check resource returns a healthy state), which - as expected - makes it available via Consul. However, service instance #1 is suddenly available too, even though its health check resource still returns "unhealthy".

This does not happen with Consul 0.9.3.

@slackpad
Copy link
Contributor

@sofax thanks that's definitely not related to this error. Can you please open a new issue with some more details about how your test is working and we will take a look?

@sofax
Copy link

sofax commented Dec 15, 2017

@slackpad:
Thanks - I think it turned out that the problem lies in our configuration (and in a misinterpration of the documentation or in a configuration example we found on the Internet, that was based on Consul > 0.9.3). We had the field id added to the check definition in both instances with the same value. v0.9.3 apparently/probably did not interpret that property at all, so it simply ignored it and assigned an automatic ID to the checks instead. v1.0.1 does interpret it though, but instead of treating the ID as local to the service instance (which IMO makes more sense), it seems to have global scope, so assigning the same ID to health checks for different service instances (of the same service) won't work.

@bitmask777
Copy link

I've upgraded to 1.0.2 (on Windows) and am seeing these messages even though per the changelog this issue (GH-3704) is fixed in 1.0.2

Snipped from log after upgrade to 1.0.2:

2018/01/02 22:07:16 [ERR] serf: Rejected coordinate from host1: round trip time not in valid range, duration 0s is not a positive value less than 10s
2018/01/02 22:07:17 [ERR] serf: Rejected coordinate from host2: round trip time not in valid range, duration 0s is not a positive value less than 10s
2018/01/02 22:07:18 [ERR] serf: Rejected coordinate from host3: round trip time not in valid range, duration 0s is not a positive value less than 10s

I do see that the description of the fix seems specific to a negative value. Perhaps a value of 0 is an uncovered edge case?

I performed my upgrade in a test cluster. Unfortunately I can't proceed with a production upgrade until I resolve this issue since it creates so much noise in the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants