Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mass joining instances results in high resource usage #25154

Closed
rosstimothy opened this issue Apr 25, 2023 · 2 comments
Closed

Mass joining instances results in high resource usage #25154

rosstimothy opened this issue Apr 25, 2023 · 2 comments
Labels
bug scale Changes required to achieve 100K nodes per cluster.

Comments

@rosstimothy
Copy link
Contributor

Joining thousands(10k) of instances at the same time results in high CPU and RAM utilization, which can cause one or both instances to be killed due to resource exhaustion.

image

Logs from an instance joining the cluster shows the following:

User Message: Post "https://proxy:3080/v1/webapi/host/credentials": http: server closed idle connection] auth/register.go:271
2023-04-25T14:38:30Z ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://proxy:3080/v1/webapi/host/credentials": http: server closed idle connection. pid:1.1 service/connect.go:119
2023-04-25T14:40:03Z DEBU [CLIENT]    Attempting https://proxy:3080/v1/webapi/host/credentials client/https_client.go:86
2023-04-25T14:40:09Z DEBU [AUTH]      Registration via proxy server failed. error:[
ERROR REPORT:
Original Error: *url.Error Post "https://proxy:3080/v1/webapi/host/credentials": EOF
Stack Trace:
	github.com/gravitational/teleport/lib/client/https_client.go:96 github.com/gravitational/teleport/lib/client.(*WebClient).PostJSONWithFallback
	github.com/gravitational/teleport/lib/client/weblogin.go:621 github.com/gravitational/teleport/lib/client.HostCredentials
	github.com/gravitational/teleport/lib/auth/register.go:323 github.com/gravitational/teleport/lib/auth.registerThroughProxy
	github.com/gravitational/teleport/lib/auth/register.go:268 github.com/gravitational/teleport/lib/auth.Register
	github.com/gravitational/teleport/lib/service/connect.go:626 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	github.com/gravitational/teleport/lib/service/connect.go:237 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	github.com/gravitational/teleport/lib/service/connect.go:200 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	github.com/gravitational/teleport/lib/service/connect.go:72 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	github.com/gravitational/teleport/lib/service/service.go:2534 github.com/gravitational/teleport/lib/service.(*TeleportProcess).RegisterWithAuthServer.func1
	github.com/gravitational/teleport/lib/service/supervisor.go:539 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	github.com/gravitational/teleport/lib/service/supervisor.go:276 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	runtime/asm_amd64.s:1598 runtime.goexit
User Message: Post "https://proxy:3080/v1/webapi/host/credentials": EOF] auth/register.go:271
2023-04-25T14:40:09Z ERRO [PROC:1]    Instance failed to establish connection to cluster: Post "https://proxy:3080/v1/webapi/host/credentials": EOF. pid:1.1 service/connect.go:119

It's possible that using 30s as the http.Server.IdleTimeout from #23943 is not enough time for the Auth server to process the join requests and reply with generated credentials. The low timeout may result in artificially terminating the join request and causing the instances to retry joining resulting in a thundering herd.

@zmb3 zmb3 added the scale Changes required to achieve 100K nodes per cluster. label Apr 25, 2023
@rosstimothy
Copy link
Contributor Author

Tested with a 60s http.Server.IdleTimeout and it had no effect on time to join

@rosstimothy
Copy link
Contributor Author

Tested again deploying via the helm chart a few times and did not see the same issues

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug scale Changes required to achieve 100K nodes per cluster.
Projects
None yet
Development

No branches or pull requests

2 participants