[CI] [Hackathon] Add dockerfiles for decoupled bootstrapping/Library …

…tests (#28535) * [core/ci] Disallow protobuf 3.19.5 (#28504) This leads to hangs in Ray client (e.g. test_dataclient_disconnect) Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune] Fix trial checkpoint syncing after recovery from other node (#28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <kai@anyscale.com> * [air] minor example fix. (#28379) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> * [cleanup] Remove memory unit conversion (#28396) The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore. Recommendation: Review #28394 first, since this is stacked on top of it. Co-authored-by: Alex <alex@anyscale.com> * [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422) * Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384) * Cast rewards as tf.float32 to fix error in DQN in tf2 Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> * Add test case for DQN with integer rewards Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> * [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488) This addresses some of the issues brought up in #28484 * [ci] Increase timeout on test_metrics (#28508) 10 milliseconds is ambitious for the CI to do anything. Co-authored-by: Alex <alex@anyscale.com> * [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503) * Add imports to object-spilling.rst Python code (#28507) * Add imports to object-spilling.rst Python code Also adjust a couple descriptions, retaining the same general information Signed-off-by: Jake <DevJake@users.noreply.github.com> * fix doc build / keep note formatting Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * another tiny fix Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Jake <DevJake@users.noreply.github.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> * [AIR] Make PathPartitionScheme a dataclass (#28390) Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> * [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> * [autoscaler][observability] Experimental verbose mode (#28392) This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API. Example output ======== Autoscaler status: 2020-12-28 01:02:03 ======== GCS request time: 3.141500s Node Provider non_terminated_nodes time: 1.618000s Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Total Usage: 1/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00/8.000 GiB memory 3.14/16.000 GiB object_store_memory Total Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() Node: 192.168.1.1 Usage: 0.1/1 AcceleratorType:V100 5.0/20.0 CPU 0.7/1 GPU 1.00/4.000 GiB memory 3.14/4.000 GiB object_store_memory Node: 192.168.1.2 Usage: 0.9/1 AcceleratorType:V100 15.0/20.0 CPU 0.3/1 GPU 1.00/12.000 GiB memory 0.00/4.000 GiB object_store_memory Co-authored-by: Alex <alex@anyscale.com> * [doc/tune] fix tune stopper attribute name (#28517) * [doc] Fix tune stopper doctests (#28531) * [air] Use self-hosted mirror for CIFAR10 dataset (#28480) The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing. Signed-off-by: Kai Fricke <kai@anyscale.com> * draft Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net> Signed-off-by: Jake <DevJake@users.noreply.github.com> Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex <alex@anyscale.com> Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: mgerstgrasser <matthias@gerstgrasser.net> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Jake <DevJake@users.noreply.github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com> Co-authored-by: Árpád Rózsás <rozsasarpi@gmail.com>
ray-project · Sep 15, 2022 · 06250b0 · 06250b0
1 parent b787f03
commit 06250b0
Show file tree

Hide file tree

Showing 48 changed files with 1,213 additions and 464 deletions.
diff --git a/ci/docker/Dockerfile.base b/ci/docker/Dockerfile.base
@@ -0,0 +1,57 @@
+FROM ubuntu:focal
+
+ARG REMOTE_CACHE_URL
+ARG BUILDKITE_PULL_REQUEST
+ARG BUILDKITE_COMMIT
+ARG BUILDKITE_PULL_REQUEST_BASE_BRANCH
+ARG PYTHON=3.6
+ARG INSTALL_DEPENDENCIES
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV TZ=America/Los_Angeles
+
+ENV BUILDKITE=true
+ENV CI=true
+ENV PYTHON=$PYTHON
+ENV RAY_USE_RANDOM_PORTS=1
+ENV RAY_DEFAULT_BUILD=1
+ENV RAY_INSTALL_JAVA=1
+ENV BUILDKITE_PULL_REQUEST=${BUILDKITE_PULL_REQUEST}
+ENV BUILDKITE_COMMIT=${BUILDKITE_COMMIT}
+ENV BUILDKITE_PULL_REQUEST_BASE_BRANCH=${BUILDKITE_PULL_REQUEST_BASE_BRANCH}
+# For wheel build
+# https://github.com/docker-library/docker/blob/master/20.10/docker-entrypoint.sh
+ENV DOCKER_TLS_CERTDIR=/certs
+ENV DOCKER_HOST=tcp://docker:2376
+ENV DOCKER_TLS_VERIFY=1
+ENV DOCKER_CERT_PATH=/certs/client
+ENV TRAVIS_COMMIT=${BUILDKITE_COMMIT}
+ENV BUILDKITE_BAZEL_CACHE_URL=${REMOTE_CACHE_URL}
+
+RUN apt-get update -qq && apt-get upgrade -qq
+RUN apt-get install -y -qq \
+    curl python-is-python3 git build-essential \
+    sudo unzip unrar apt-utils dialog tzdata wget rsync \
+    language-pack-en tmux cmake gdb vim htop \
+    libgtk2.0-dev zlib1g-dev libgl1-mesa-dev maven \
+    openjdk-8-jre openjdk-8-jdk clang-format-12 jq \
+    clang-tidy-12 clang-12
+# Make using GCC 9 explicit.
+RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90 --slave /usr/bin/g++ g++ /usr/bin/g++-9 \
+    --slave /usr/bin/gcov gcov /usr/bin/gcov-9
+RUN ln -s /usr/bin/clang-format-12 /usr/bin/clang-format && \
+    ln -s /usr/bin/clang-tidy-12 /usr/bin/clang-tidy && \
+    ln -s /usr/bin/clang-12 /usr/bin/clang
+
+RUN curl -o- https://get.docker.com | sh
+
+# System conf for tests
+RUN locale -a
+ENV LC_ALL=en_US.utf8
+ENV LANG=en_US.utf8
+RUN echo "ulimit -c 0" >> /root/.bashrc
+
+# Setup Bazel caches
+RUN (echo "build --remote_cache=${REMOTE_CACHE_URL}" >> /root/.bazelrc); \
+    (if [ "${BUILDKITE_PULL_REQUEST}" != "false" ]; then (echo "build --remote_upload_local_results=false" >> /root/.bazelrc); fi); \
+    cat /root/.bazelrc
diff --git a/ci/docker/Dockerfile.build b/ci/docker/Dockerfile.build
@@ -0,0 +1,15 @@
+FROM [Dockerfile.base image]
+
+RUN mkdir /ray
+WORKDIR /ray
+
+# Below should be re-run each time
+COPY . .
+RUN ./ci/ci.sh init
+RUN bash --login -i ./ci/ci.sh build
+
+RUN (if [ "${INSTALL_DEPENDENCIES}" = "ML" ]; then RLLIB_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 bash --login -i ./ci/env/install-dependencies.sh; fi)
+
+# Run determine test to run
+RUN bash --login -i -c "python ./ci/pipeline/determine_tests_to_run.py --output=json > affected_set.json"
+RUN cat affected_set.json
diff --git a/ci/docker/Dockerfile.gpu b/ci/docker/Dockerfile.gpu
@@ -0,0 +1,2 @@
+FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04
+
diff --git a/ci/docker/Dockerfile.ml b/ci/docker/Dockerfile.ml
@@ -0,0 +1,3 @@
+FROM [Dockerfile.test image]
+
+RLLIB_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 bash --login -i ./ci/env/install-dependencies.sh
diff --git a/ci/docker/Dockerfile.test b/ci/docker/Dockerfile.test
@@ -0,0 +1 @@
+FROM ubuntu:focal
diff --git a/ci/env/install-dependencies.sh b/ci/env/install-dependencies.sh
@@ -421,6 +421,18 @@ install_dependencies() {
     pip install --upgrade tensorflow-probability=="${TFP_VERSION}" tensorflow=="${TF_VERSION}"
   fi
 
+  # Inject our own mirror for the CIFAR10 dataset
+  if [ "${TRAIN_TESTING-}" = 1 ] || [ "${TUNE_TESTING-}" = 1 ] ||  [ "${DOC_TESTING-}" = 1 ]; then
+    SITE_PACKAGES=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')
+    TF_CIFAR="${SITE_PACKAGES}/tensorflow/python/keras/datasets/cifar10.py"
+    TORCH_CIFAR="${SITE_PACKAGES}/torchvision/datasets/cifar.py"
+
+    [ -f "$TF_CIFAR" ] && sed -i 's https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz https://air-example-data.s3.us-west-2.amazonaws.com/cifar-10-python.tar.gz g' \
+      "$TF_CIFAR"
+    [ -f "$TORCH_CIFAR" ] &&sed -i 's https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz https://air-example-data.s3.us-west-2.amazonaws.com/cifar-10-python.tar.gz g' \
+      "$TORCH_CIFAR"
+  fi
+
   # Additional Tune dependency for Horovod.
   # This must be run last (i.e., torch cannot be re-installed after this)
   if [ "${INSTALL_HOROVOD-}" = 1 ]; then

diff --git a/doc/source/ray-core/objects/object-spilling.rst b/doc/source/ray-core/objects/object-spilling.rst
@@ -7,12 +7,15 @@ Ray 1.3+ spills objects to external storage once the object store is full. By de
 Single node
 -----------
 
-Ray uses object spilling by default. Without any setting, objects are spilled to `[temp_folder]/spill`. `temp_folder` is `/tmp` for Linux and MacOS by default.
+Ray uses object spilling by default. Without any setting, objects are spilled to `[temp_folder]/spill`. On Linux and MacOS, the `temp_folder` is `/tmp` by default.
 
-To configure the directory where objects are placed, use:
+To configure the directory where objects are spilled to, use:
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             "object_spilling_config": json.dumps(
@@ -26,6 +29,9 @@ usage across multiple physical devices if needed (e.g., SSD devices):
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             "max_io_workers": 4,  # More IO workers for parallelism.
@@ -46,14 +52,18 @@ usage across multiple physical devices if needed (e.g., SSD devices):
         },
     )
 
+
 .. note::
 
-  To optimize the performance, it is recommended to use SSD instead of HDD when using object spilling for memory intensive workloads.
+    To optimize the performance, it is recommended to use an SSD instead of an HDD when using object spilling for memory-intensive workloads.
 
 If you are using an HDD, it is recommended that you specify a large buffer size (> 1MB) to reduce IO requests during spilling.
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             "object_spilling_config": json.dumps(
@@ -74,6 +84,9 @@ The default threshold is 0.95 (95%). You can adjust the threshold by setting ``l
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             # Allow spilling until the local disk is 99% utilized.
@@ -94,6 +107,9 @@ To enable object spilling to remote storage (any URI supported by `smart_open <h
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             "max_io_workers": 4,  # More IO workers for remote storage.
@@ -116,6 +132,9 @@ Spilling to multiple remote storages is also supported.
 
 .. code-block:: python
 
+    import json
+    import ray
+
     ray.init(
         _system_config={
             "max_io_workers": 4,  # More IO workers for remote storage.
@@ -124,7 +143,7 @@ Spilling to multiple remote storages is also supported.
                 {
                   "type": "smart_open", 
                   "params": {
-                    "uri": ["s3://bucket/path1", "s3://bucket/path2, "s3://bucket/path3"],
+                    "uri": ["s3://bucket/path1", "s3://bucket/path2", "s3://bucket/path3"],
                   },
                   "buffer_size": 100 * 1024 * 1024, # Use a 100MB buffer for writes
                 },

diff --git a/doc/source/tune/api_docs/stoppers.rst b/doc/source/tune/api_docs/stoppers.rst
@@ -45,3 +45,8 @@ TimeoutStopper (tune.stopper.TimeoutStopper)
 --------------------------------------------
 
 .. autoclass:: ray.tune.stopper.TimeoutStopper
+
+CombinedStopper (tune.stopper.CombinedStopper)
+----------------------------------------------
+
+.. autoclass:: ray.tune.stopper.CombinedStopper
diff --git a/python/ray/_private/ray_constants.py b/python/ray/_private/ray_constants.py
@@ -1,7 +1,6 @@
 """Ray constants used in the Python code."""
 
 import logging
-import math
 import os
 
 logger = logging.getLogger(__name__)
@@ -118,9 +117,6 @@ def env_bool(key, default):
 # for large resource quantities due to bookkeeping of specific resource IDs.
 MAX_RESOURCE_QUANTITY = 100e12
 
-# Each memory "resource" counts as this many bytes of memory.
-MEMORY_RESOURCE_UNIT_BYTES = 1
-
 # Number of units 1 resource can be subdivided into.
 MIN_RESOURCE_GRANULARITY = 0.0001
 
@@ -132,37 +128,6 @@ def env_bool(key, default):
 RAY_OVERRIDE_DASHBOARD_URL = "RAY_OVERRIDE_DASHBOARD_URL"
 
 
-def round_to_memory_units(memory_bytes, round_up):
-    """Round bytes to the nearest memory unit."""
-    return from_memory_units(to_memory_units(memory_bytes, round_up))
-
-
-def from_memory_units(memory_units):
-    """Convert from memory units -> bytes."""
-    return memory_units * MEMORY_RESOURCE_UNIT_BYTES
-
-
-def to_memory_units(memory_bytes, round_up):
-    """Convert from bytes -> memory units."""
-    value = memory_bytes / MEMORY_RESOURCE_UNIT_BYTES
-    if value < 1:
-        raise ValueError(
-            "The minimum amount of memory that can be requested is {} bytes, "
-            "however {} bytes was asked.".format(
-                MEMORY_RESOURCE_UNIT_BYTES, memory_bytes
-            )
-        )
-    if isinstance(value, float) and not value.is_integer():
-        # TODO(ekl) Ray currently does not support fractional resources when
-        # the quantity is greater than one. We should fix memory resources to
-        # be allocated in units of bytes and not 100MB.
-        if round_up:
-            value = int(math.ceil(value))
-        else:
-            value = int(math.floor(value))
-    return int(value)
-
-
 # Different types of Ray errors that can be pushed to the driver.
 # TODO(rkn): These should be defined in flatbuffers and must be synced with
 # the existing C++ definitions.

diff --git a/python/ray/_private/resource_spec.py b/python/ray/_private/resource_spec.py
@@ -90,17 +90,12 @@ def to_resource_dict(self):
         """
         assert self.resolved()
 
-        memory_units = ray_constants.to_memory_units(self.memory, round_up=False)
-        object_store_memory_units = ray_constants.to_memory_units(
-            self.object_store_memory, round_up=False
-        )
-
         resources = dict(
             self.resources,
             CPU=self.num_cpus,
             GPU=self.num_gpus,
-            memory=memory_units,
-            object_store_memory=object_store_memory_units,
+            memory=self.memory,
+            object_store_memory=self.object_store_memory,
         )
 
         resources = {

diff --git a/python/ray/_private/usage/usage_constants.py b/python/ray/_private/usage/usage_constants.py
@@ -49,3 +49,11 @@
 EXTRA_USAGE_TAG_PREFIX = "extra_usage_tag_"
 
 USAGE_STATS_NAMESPACE = "usage_stats"
+
+KUBERNETES_SERVICE_HOST_ENV = "KUBERNETES_SERVICE_HOST"
+KUBERAY_ENV = "RAY_USAGE_STATS_KUBERAY_IN_USE"
+LEGACY_RAY_OPERATOR_ENV = "RAY_USAGE_STATS_LEGACY_OPERATOR_IN_USE"
+
+PROVIDER_KUBERNETES_GENERIC = "kubernetes"
+PROVIDER_KUBERAY = "kuberay"
+PROVIDER_LEGACY_RAY_OPERATOR = "legacy_ray_operator"
diff --git a/python/ray/_private/usage/usage_lib.py b/python/ray/_private/usage/usage_lib.py
@@ -757,8 +757,17 @@ def get_instance_type(node_config):
     except FileNotFoundError:
         # It's a manually started cluster or k8s cluster
         result = ClusterConfigToReport()
-        if "KUBERNETES_SERVICE_HOST" in os.environ:
-            result.cloud_provider = "kubernetes"
+        # Check if we're on Kubernetes
+        if usage_constant.KUBERNETES_SERVICE_HOST_ENV in os.environ:
+            # Check if we're using KubeRay >= 0.4.0.
+            if usage_constant.KUBERAY_ENV in os.environ:
+                result.cloud_provider = usage_constant.PROVIDER_KUBERAY
+            # Check if we're using the legacy Ray Operator with Ray >= 2.1.0.
+            elif usage_constant.LEGACY_RAY_OPERATOR_ENV in os.environ:
+                result.cloud_provider = usage_constant.PROVIDER_LEGACY_RAY_OPERATOR
+            # Else, we're on Kubernetes but not in either of the above categories.
+            else:
+                result.cloud_provider = usage_constant.PROVIDER_KUBERNETES_GENERIC
         return result
     except Exception as e:
         logger.info(f"Failed to get cluster config to report {e}")

diff --git a/python/ray/_private/utils.py b/python/ray/_private/utils.py
@@ -395,11 +395,9 @@ def resources_from_ray_options(options_dict: Dict[str, Any]) -> Dict[str, Any]:
     if num_gpus is not None:
         resources["GPU"] = num_gpus
     if memory is not None:
-        resources["memory"] = ray_constants.to_memory_units(memory, round_up=True)
+        resources["memory"] = memory
     if object_store_memory is not None:
-        resources["object_store_memory"] = ray_constants.to_memory_units(
-            object_store_memory, round_up=True
-        )
+        resources["object_store_memory"] = object_store_memory
     if accelerator_type is not None:
         resources[
             f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"

diff --git a/python/ray/autoscaler/_private/autoscaler.py b/python/ray/autoscaler/_private/autoscaler.py
@@ -113,6 +113,7 @@ class NonTerminatedNodes:
     """Class to extract and organize information on non-terminated nodes."""
 
     def __init__(self, provider: NodeProvider):
+        start_time = time.time()
         # All non-terminated nodes
         self.all_node_ids = provider.non_terminated_nodes({})
 
@@ -128,8 +129,15 @@ def __init__(self, provider: NodeProvider):
             elif node_kind == NODE_KIND_HEAD:
                 self.head_id = node
 
-        # Note: For typical use-cases,
-        # self.all_node_ids == self.worker_ids + [self.head_id]
+        # Note: For typical use-cases, self.all_node_ids == self.worker_ids +
+        # [self.head_id]. The difference being in the case of unmanaged nodes.
+
+        # Record the time of the non_terminated nodes call. This typically
+        # translates to a "describe" or "list" call on most cluster managers
+        # which can be quite expensive. Note that we include the processing
+        # time because on some clients, there may be pagination and the
+        # underlying api calls may be done lazily.
+        self.non_terminated_nodes_time = time.time() - start_time
 
     def remove_terminating_nodes(self, terminating_nodes: List[NodeID]) -> None:
         """Remove nodes we're in the process of terminating from internal

diff --git a/python/ray/autoscaler/_private/commands.py b/python/ray/autoscaler/_private/commands.py
@@ -125,14 +125,16 @@ def try_reload_log_state(provider_config: Dict[str, Any], log_state: dict) -> No
         return reload_log_state(log_state)
 
 
-def debug_status(status, error) -> str:
+def debug_status(status, error, verbose: bool = False) -> str:
     """Return a debug string for the autoscaler."""
     if status:
         status = status.decode("utf-8")
         status_dict = json.loads(status)
         lm_summary_dict = status_dict.get("load_metrics_report")
         autoscaler_summary_dict = status_dict.get("autoscaler_report")
         timestamp = status_dict.get("time")
+        gcs_request_time = status_dict.get("gcs_request_time")
+        non_terminated_nodes_time = status_dict.get("non_terminated_nodes_time")
         if lm_summary_dict and autoscaler_summary_dict and timestamp:
             lm_summary = LoadMetricsSummary(**lm_summary_dict)
             node_availability_summary_dict = autoscaler_summary_dict.pop(
@@ -147,7 +149,12 @@ def debug_status(status, error) -> str:
             )
             report_time = datetime.datetime.fromtimestamp(timestamp)
             status = format_info_string(
-                lm_summary, autoscaler_summary, time=report_time
+                lm_summary,
+                autoscaler_summary,
+                time=report_time,
+                gcs_request_time=gcs_request_time,
+                non_terminated_nodes_time=non_terminated_nodes_time,
+                verbose=verbose,
             )
         else:
             status = "No cluster status."

diff --git a/python/ray/autoscaler/_private/constants.py b/python/ray/autoscaler/_private/constants.py
@@ -6,7 +6,6 @@
     DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES,
     DEFAULT_OBJECT_STORE_MEMORY_PROPORTION,
     LOGGER_FORMAT,
-    MEMORY_RESOURCE_UNIT_BYTES,
     RESOURCES_ENVIRONMENT_VARIABLE,
 )
 
@@ -60,6 +59,10 @@ def env_integer(key, default):
     "AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S", 30 * 60
 )
 
+AUTOSCALER_REPORT_PER_NODE_STATUS = (
+    env_integer("AUTOSCALER_REPORT_PER_NODE_STATUS", 1) == 1
+)
+
 # The maximum allowed resource demand vector size to guarantee the resource
 # demand scheduler bin packing algorithm takes a reasonable amount of time
 # to run.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		FROM [Dockerfile.test image]

		RLLIB_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 bash --login -i ./ci/env/install-dependencies.sh