Mass joining instances results in high resource usage #25154

rosstimothy · 2023-04-25T16:50:05Z

Joining thousands(10k) of instances at the same time results in high CPU and RAM utilization, which can cause one or both instances to be killed due to resource exhaustion.

Logs from an instance joining the cluster shows the following:

User Message: Post &#34;https://proxy:3080/v1/webapi/host/credentials&#34;: http: server closed idle connection] auth/register.go:271
2023-04-25T14:38:30Z ERRO [PROC:1]    Node failed to establish connection to cluster: Post "https://proxy:3080/v1/webapi/host/credentials": http: server closed idle connection. pid:1.1 service/connect.go:119

2023-04-25T14:40:03Z DEBU [CLIENT]    Attempting https://proxy:3080/v1/webapi/host/credentials client/https_client.go:86
2023-04-25T14:40:09Z DEBU [AUTH]      Registration via proxy server failed. error:[
ERROR REPORT:
Original Error: *url.Error Post &#34;https://proxy:3080/v1/webapi/host/credentials&#34;: EOF
Stack Trace:
	github.com/gravitational/teleport/lib/client/https_client.go:96 github.com/gravitational/teleport/lib/client.(*WebClient).PostJSONWithFallback
	github.com/gravitational/teleport/lib/client/weblogin.go:621 github.com/gravitational/teleport/lib/client.HostCredentials
	github.com/gravitational/teleport/lib/auth/register.go:323 github.com/gravitational/teleport/lib/auth.registerThroughProxy
	github.com/gravitational/teleport/lib/auth/register.go:268 github.com/gravitational/teleport/lib/auth.Register
	github.com/gravitational/teleport/lib/service/connect.go:626 github.com/gravitational/teleport/lib/service.(*TeleportProcess).firstTimeConnect
	github.com/gravitational/teleport/lib/service/connect.go:237 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connect
	github.com/gravitational/teleport/lib/service/connect.go:200 github.com/gravitational/teleport/lib/service.(*TeleportProcess).connectToAuthService
	github.com/gravitational/teleport/lib/service/connect.go:72 github.com/gravitational/teleport/lib/service.(*TeleportProcess).reconnectToAuthService
	github.com/gravitational/teleport/lib/service/service.go:2534 github.com/gravitational/teleport/lib/service.(*TeleportProcess).RegisterWithAuthServer.func1
	github.com/gravitational/teleport/lib/service/supervisor.go:539 github.com/gravitational/teleport/lib/service.(*LocalService).Serve
	github.com/gravitational/teleport/lib/service/supervisor.go:276 github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1
	runtime/asm_amd64.s:1598 runtime.goexit
User Message: Post &#34;https://proxy:3080/v1/webapi/host/credentials&#34;: EOF] auth/register.go:271
2023-04-25T14:40:09Z ERRO [PROC:1]    Instance failed to establish connection to cluster: Post "https://proxy:3080/v1/webapi/host/credentials": EOF. pid:1.1 service/connect.go:119

It's possible that using 30s as the http.Server.IdleTimeout from #23943 is not enough time for the Auth server to process the join requests and reply with generated credentials. The low timeout may result in artificially terminating the join request and causing the instances to retry joining resulting in a thundering herd.

The text was updated successfully, but these errors were encountered:

rosstimothy · 2023-04-26T20:33:40Z

Tested with a 60s http.Server.IdleTimeout and it had no effect on time to join

rosstimothy · 2023-04-28T22:26:43Z

Tested again deploying via the helm chart a few times and did not see the same issues

rosstimothy added the bug label Apr 25, 2023

zmb3 added the scale Changes required to achieve 100K nodes per cluster. label Apr 25, 2023

rosstimothy mentioned this issue Apr 26, 2023

Teleport 13 Test Plan #24576

Closed

rosstimothy closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mass joining instances results in high resource usage #25154

Mass joining instances results in high resource usage #25154

rosstimothy commented Apr 25, 2023

rosstimothy commented Apr 26, 2023

rosstimothy commented Apr 28, 2023

Mass joining instances results in high resource usage #25154

Mass joining instances results in high resource usage #25154

Comments

rosstimothy commented Apr 25, 2023

rosstimothy commented Apr 26, 2023

rosstimothy commented Apr 28, 2023