Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester refusing grpc connections during hand-over #1447

Closed
bboreham opened this issue Jun 6, 2019 · 1 comment · Fixed by #1463
Closed

Ingester refusing grpc connections during hand-over #1447

bboreham opened this issue Jun 6, 2019 · 1 comment · Fixed by #1463

Comments

@bboreham
Copy link
Contributor

bboreham commented Jun 6, 2019

Logs from an ingester at address 10.244.251.100:

June 6th 2019, 08:28:33.577    level=info ts=2019-06-06T07:28:33.577526677Z caller=cortex.go:239 msg=stopping module=ingester
June 6th 2019, 08:28:37.858    level=info ts=2019-06-06T07:28:37.858651585Z caller=transfer.go:218 msg="sending chunks" to_ingester=10.244.227.137:9095
June 6th 2019, 08:32:35.502    level=info ts=2019-06-06T07:32:35.502129301Z caller=transfer.go:272 msg="successfully sent chunks" to_ingester=10.244.227.137:9095
June 6th 2019, 08:33:05.661    level=info ts=2019-06-06T07:33:05.661630134Z caller=lifecycler.go:291 msg="member.loop() exited gracefully"
June 6th 2019, 08:33:05.661    level=info ts=2019-06-06T07:33:05.661879021Z caller=cortex.go:239 msg=stopping module=server
June 6th 2019, 08:33:05.661    level=info ts=2019-06-06T07:33:05.661586078Z caller=lifecycler.go:367 msg="ingester removed from consul"

Logs from a querier trying to talk to it:

June 6th 2019, 08:28:34.164 level=warn ts=2019-06-06T07:28:34.163957075Z caller=logging.go:49 traceID=15482fd69ecc5603 msg="GET /api/prom/user_stats (500) 158.178µs Response: \"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.244.251.100:9095: connect: connection refused\\\"\\n\" ws: false; Accept-Encoding: gzip; Uber-Trace-Id: 15482fd69ecc5603:80a1b145fb19898:25aeaac212bdf087:0; User-Agent: Go-http-client/1.1; X-Scope-Orgid: 13085; "
June 6th 2019, 08:28:34.244 level=warn ts=2019-06-06T07:28:34.24475902Z caller=logging.go:49 traceID=355b3f418d9be500 msg="GET /api/prom/user_stats (500) 133.39µs Response: \"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.244.251.100:9095: connect: connection refused\\\"\\n\" ws: false; Accept-Encoding: gzip; Uber-Trace-Id: 355b3f418d9be500:e27e2986d2d8e49:663334830d61cd92:0; User-Agent: Go-http-client/1.1; X-Scope-Orgid: 14537; "
...
June 6th 2019, 08:33:04.412 level=warn ts=2019-06-06T07:33:04.412779057Z caller=logging.go:49 traceID=479e862751234885 msg="GET /api/prom/user_stats (500) 251.2µs Response: \"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.244.251.100:9095: connect: connection refused\\\"\\n\" ws: false; Accept-Encoding: gzip; Uber-Trace-Id: 479e862751234885:19881cc5841beafa:3a5ac60bd24b386d:0; User-Agent: Go-http-client/1.1; X-Scope-Orgid: 15178; "
June 6th 2019, 08:33:05.281 level=warn ts=2019-06-06T07:33:05.281099669Z caller=logging.go:49 traceID=710dc5ccf13e9ac9 msg="GET /api/prom/user_stats (500) 187.317µs Response: \"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp 10.244.251.100:9095: connect: connection refused\\\"\\n\" ws: false; Accept-Encoding: gzip; Uber-Trace-Id: 710dc5ccf13e9ac9:1eda228a3d8ab2d7:1a7b3f1b5019d2dd:0; User-Agent: Go-http-client/1.1; X-Scope-Orgid: 2; "

This didn't used to happen. My hypothesis is that this changed in the single-binary refactor.

@bboreham
Copy link
Contributor Author

This can also break QueryStream(), because the distributor increases the size of the quorum required when it finds ingesters joining and leaving (as mentioned in #1290), so a single failure cancels the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant