Skip to content

Commit

Permalink
[dashbaord] Increase gcs health check failure threshold. (#31939)
Browse files Browse the repository at this point in the history
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

When GCS is overloaded, health check in dashboard will fail esily and thus dashboard will exit. Right now, there is no way to restart the dashboard in the user side, so exit the dashboard means the cluster will lose all the functions implemented there, like status api/logs/jobs/...

To mitigate this, the threshold is increased, and thus the maximum time is 10min.

A better solution is needed to take care of the dashboard failure.
  • Loading branch information
fishbone authored Jan 26, 2023
1 parent da79ae9 commit a703a91
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion dashboard/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
GCS_SERVER_ADDRESS = "GcsServerAddress"
# GCS check alive
GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR = env_integer(
"GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR", 10
"GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR", 40
)
GCS_CHECK_ALIVE_INTERVAL_SECONDS = env_integer("GCS_CHECK_ALIVE_INTERVAL_SECONDS", 5)
GCS_CHECK_ALIVE_RPC_TIMEOUT = env_integer("GCS_CHECK_ALIVE_RPC_TIMEOUT", 10)
Expand Down

0 comments on commit a703a91

Please sign in to comment.