[dashbaord] Increase gcs health check failure threshold. (#31939)

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> When GCS is overloaded, health check in dashboard will fail esily and thus dashboard will exit. Right now, there is no way to restart the dashboard in the user side, so exit the dashboard means the cluster will lose all the functions implemented there, like status api/logs/jobs/... To mitigate this, the threshold is increased, and thus the maximum time is 10min. A better solution is needed to take care of the dashboard failure.
ray-project · Jan 26, 2023 · a703a91 · a703a91
1 parent da79ae9
commit a703a91
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/dashboard/consts.py b/dashboard/consts.py
@@ -28,7 +28,7 @@
 GCS_SERVER_ADDRESS = "GcsServerAddress"
 # GCS check alive
 GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR = env_integer(
-    "GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR", 10
+    "GCS_CHECK_ALIVE_MAX_COUNT_OF_RPC_ERROR", 40
 )
 GCS_CHECK_ALIVE_INTERVAL_SECONDS = env_integer("GCS_CHECK_ALIVE_INTERVAL_SECONDS", 5)
 GCS_CHECK_ALIVE_RPC_TIMEOUT = env_integer("GCS_CHECK_ALIVE_RPC_TIMEOUT", 10)