Skip to content

Commit

Permalink
[CI] [Hackathon] Add dockerfiles for decoupled bootstrapping/Library …
Browse files Browse the repository at this point in the history
…tests (#28535)

* [core/ci] Disallow protobuf 3.19.5 (#28504)

This leads to hangs in Ray client (e.g. test_dataclient_disconnect)

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune] Fix trial checkpoint syncing after recovery from other node (#28470)

On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [air] minor example fix. (#28379)

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

* [cleanup] Remove memory unit conversion (#28396)

The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore.

Recommendation: Review #28394 first, since this is stacked on top of it.

Co-authored-by: Alex <alex@anyscale.com>

* [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422)

* Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)

* Cast rewards as tf.float32 to fix error in DQN in tf2

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

* Add test case for DQN with integer rewards

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>

* [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)

This addresses some of the issues brought up in #28484

* [ci] Increase timeout on test_metrics (#28508)

10 milliseconds is ambitious for the CI to do anything.

Co-authored-by: Alex <alex@anyscale.com>

* [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)

* Add imports to object-spilling.rst Python code (#28507)

* Add imports to object-spilling.rst Python code

Also adjust a couple descriptions, retaining the same general information

Signed-off-by: Jake <DevJake@users.noreply.github.com>

* fix doc build / keep note formatting

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

* another tiny fix

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>

Signed-off-by: Jake <DevJake@users.noreply.github.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>

* [AIR] Make PathPartitionScheme a dataclass (#28390)

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

* [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>

* [autoscaler][observability] Experimental verbose mode (#28392)

This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API.

Example output

======== Autoscaler status: 2020-12-28 01:02:03 ========
GCS request time: 3.141500s
Node Provider non_terminated_nodes time: 1.618000s

Node status
--------------------------------------------------------
Healthy:
 2 p3.2xlarge
 20 m4.4xlarge
Pending:
 m4.4xlarge, 2 launching
 1.2.3.4: m4.4xlarge, waiting-for-ssh
 1.2.3.5: m4.4xlarge, waiting-for-ssh
Recent failures:
 p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)

Resources
--------------------------------------------------------
Total Usage:
 1/2 AcceleratorType:V100
 530.0/544.0 CPU
 2/2 GPU
 2.00/8.000 GiB memory
 3.14/16.000 GiB object_store_memory

Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups
 {'CPU': 16}: 100+ from request_resources()

Node: 192.168.1.1
 Usage:
  0.1/1 AcceleratorType:V100
  5.0/20.0 CPU
  0.7/1 GPU
  1.00/4.000 GiB memory
  3.14/4.000 GiB object_store_memory

Node: 192.168.1.2
 Usage:
  0.9/1 AcceleratorType:V100
  15.0/20.0 CPU
  0.3/1 GPU
  1.00/12.000 GiB memory
  0.00/4.000 GiB object_store_memory

Co-authored-by: Alex <alex@anyscale.com>

* [doc/tune] fix tune stopper attribute name (#28517)

* [doc] Fix tune stopper doctests (#28531)

* [air] Use self-hosted mirror for CIFAR10 dataset (#28480)

The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* draft

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: mgerstgrasser <matthias@gerstgrasser.net>
Signed-off-by: Jake <DevJake@users.noreply.github.com>
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: mgerstgrasser <matthias@gerstgrasser.net>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Jake <DevJake@users.noreply.github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Co-authored-by: Árpád Rózsás <rozsasarpi@gmail.com>
  • Loading branch information
12 people authored Sep 15, 2022
1 parent b787f03 commit 06250b0
Show file tree
Hide file tree
Showing 48 changed files with 1,213 additions and 464 deletions.
57 changes: 57 additions & 0 deletions ci/docker/Dockerfile.base
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
FROM ubuntu:focal

ARG REMOTE_CACHE_URL
ARG BUILDKITE_PULL_REQUEST
ARG BUILDKITE_COMMIT
ARG BUILDKITE_PULL_REQUEST_BASE_BRANCH
ARG PYTHON=3.6
ARG INSTALL_DEPENDENCIES

ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=America/Los_Angeles

ENV BUILDKITE=true
ENV CI=true
ENV PYTHON=$PYTHON
ENV RAY_USE_RANDOM_PORTS=1
ENV RAY_DEFAULT_BUILD=1
ENV RAY_INSTALL_JAVA=1
ENV BUILDKITE_PULL_REQUEST=${BUILDKITE_PULL_REQUEST}
ENV BUILDKITE_COMMIT=${BUILDKITE_COMMIT}
ENV BUILDKITE_PULL_REQUEST_BASE_BRANCH=${BUILDKITE_PULL_REQUEST_BASE_BRANCH}
# For wheel build
# https://github.com/docker-library/docker/blob/master/20.10/docker-entrypoint.sh
ENV DOCKER_TLS_CERTDIR=/certs
ENV DOCKER_HOST=tcp://docker:2376
ENV DOCKER_TLS_VERIFY=1
ENV DOCKER_CERT_PATH=/certs/client
ENV TRAVIS_COMMIT=${BUILDKITE_COMMIT}
ENV BUILDKITE_BAZEL_CACHE_URL=${REMOTE_CACHE_URL}

RUN apt-get update -qq && apt-get upgrade -qq
RUN apt-get install -y -qq \
curl python-is-python3 git build-essential \
sudo unzip unrar apt-utils dialog tzdata wget rsync \
language-pack-en tmux cmake gdb vim htop \
libgtk2.0-dev zlib1g-dev libgl1-mesa-dev maven \
openjdk-8-jre openjdk-8-jdk clang-format-12 jq \
clang-tidy-12 clang-12
# Make using GCC 9 explicit.
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90 --slave /usr/bin/g++ g++ /usr/bin/g++-9 \
--slave /usr/bin/gcov gcov /usr/bin/gcov-9
RUN ln -s /usr/bin/clang-format-12 /usr/bin/clang-format && \
ln -s /usr/bin/clang-tidy-12 /usr/bin/clang-tidy && \
ln -s /usr/bin/clang-12 /usr/bin/clang

RUN curl -o- https://get.docker.com | sh

# System conf for tests
RUN locale -a
ENV LC_ALL=en_US.utf8
ENV LANG=en_US.utf8
RUN echo "ulimit -c 0" >> /root/.bashrc

# Setup Bazel caches
RUN (echo "build --remote_cache=${REMOTE_CACHE_URL}" >> /root/.bazelrc); \
(if [ "${BUILDKITE_PULL_REQUEST}" != "false" ]; then (echo "build --remote_upload_local_results=false" >> /root/.bazelrc); fi); \
cat /root/.bazelrc
15 changes: 15 additions & 0 deletions ci/docker/Dockerfile.build
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM [Dockerfile.base image]

RUN mkdir /ray
WORKDIR /ray

# Below should be re-run each time
COPY . .
RUN ./ci/ci.sh init
RUN bash --login -i ./ci/ci.sh build

RUN (if [ "${INSTALL_DEPENDENCIES}" = "ML" ]; then RLLIB_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 bash --login -i ./ci/env/install-dependencies.sh; fi)

# Run determine test to run
RUN bash --login -i -c "python ./ci/pipeline/determine_tests_to_run.py --output=json > affected_set.json"
RUN cat affected_set.json
2 changes: 2 additions & 0 deletions ci/docker/Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04

3 changes: 3 additions & 0 deletions ci/docker/Dockerfile.ml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM [Dockerfile.test image]

RLLIB_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 bash --login -i ./ci/env/install-dependencies.sh
1 change: 1 addition & 0 deletions ci/docker/Dockerfile.test
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
FROM ubuntu:focal
12 changes: 12 additions & 0 deletions ci/env/install-dependencies.sh
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,18 @@ install_dependencies() {
pip install --upgrade tensorflow-probability=="${TFP_VERSION}" tensorflow=="${TF_VERSION}"
fi

# Inject our own mirror for the CIFAR10 dataset
if [ "${TRAIN_TESTING-}" = 1 ] || [ "${TUNE_TESTING-}" = 1 ] || [ "${DOC_TESTING-}" = 1 ]; then
SITE_PACKAGES=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')
TF_CIFAR="${SITE_PACKAGES}/tensorflow/python/keras/datasets/cifar10.py"
TORCH_CIFAR="${SITE_PACKAGES}/torchvision/datasets/cifar.py"

[ -f "$TF_CIFAR" ] && sed -i 's https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz https://air-example-data.s3.us-west-2.amazonaws.com/cifar-10-python.tar.gz g' \
"$TF_CIFAR"
[ -f "$TORCH_CIFAR" ] &&sed -i 's https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz https://air-example-data.s3.us-west-2.amazonaws.com/cifar-10-python.tar.gz g' \
"$TORCH_CIFAR"
fi

# Additional Tune dependency for Horovod.
# This must be run last (i.e., torch cannot be re-installed after this)
if [ "${INSTALL_HOROVOD-}" = 1 ]; then
Expand Down
27 changes: 23 additions & 4 deletions doc/source/ray-core/objects/object-spilling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,15 @@ Ray 1.3+ spills objects to external storage once the object store is full. By de
Single node
-----------

Ray uses object spilling by default. Without any setting, objects are spilled to `[temp_folder]/spill`. `temp_folder` is `/tmp` for Linux and MacOS by default.
Ray uses object spilling by default. Without any setting, objects are spilled to `[temp_folder]/spill`. On Linux and MacOS, the `temp_folder` is `/tmp` by default.

To configure the directory where objects are placed, use:
To configure the directory where objects are spilled to, use:

.. code-block:: python
import json
import ray
ray.init(
_system_config={
"object_spilling_config": json.dumps(
Expand All @@ -26,6 +29,9 @@ usage across multiple physical devices if needed (e.g., SSD devices):

.. code-block:: python
import json
import ray
ray.init(
_system_config={
"max_io_workers": 4, # More IO workers for parallelism.
Expand All @@ -46,14 +52,18 @@ usage across multiple physical devices if needed (e.g., SSD devices):
},
)
.. note::

To optimize the performance, it is recommended to use SSD instead of HDD when using object spilling for memory intensive workloads.
To optimize the performance, it is recommended to use an SSD instead of an HDD when using object spilling for memory-intensive workloads.

If you are using an HDD, it is recommended that you specify a large buffer size (> 1MB) to reduce IO requests during spilling.

.. code-block:: python
import json
import ray
ray.init(
_system_config={
"object_spilling_config": json.dumps(
Expand All @@ -74,6 +84,9 @@ The default threshold is 0.95 (95%). You can adjust the threshold by setting ``l

.. code-block:: python
import json
import ray
ray.init(
_system_config={
# Allow spilling until the local disk is 99% utilized.
Expand All @@ -94,6 +107,9 @@ To enable object spilling to remote storage (any URI supported by `smart_open <h
.. code-block:: python
import json
import ray
ray.init(
_system_config={
"max_io_workers": 4, # More IO workers for remote storage.
Expand All @@ -116,6 +132,9 @@ Spilling to multiple remote storages is also supported.
.. code-block:: python
import json
import ray
ray.init(
_system_config={
"max_io_workers": 4, # More IO workers for remote storage.
Expand All @@ -124,7 +143,7 @@ Spilling to multiple remote storages is also supported.
{
"type": "smart_open",
"params": {
"uri": ["s3://bucket/path1", "s3://bucket/path2, "s3://bucket/path3"],
"uri": ["s3://bucket/path1", "s3://bucket/path2", "s3://bucket/path3"],
},
"buffer_size": 100 * 1024 * 1024, # Use a 100MB buffer for writes
},
Expand Down
5 changes: 5 additions & 0 deletions doc/source/tune/api_docs/stoppers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,8 @@ TimeoutStopper (tune.stopper.TimeoutStopper)
--------------------------------------------

.. autoclass:: ray.tune.stopper.TimeoutStopper

CombinedStopper (tune.stopper.CombinedStopper)
----------------------------------------------

.. autoclass:: ray.tune.stopper.CombinedStopper
35 changes: 0 additions & 35 deletions python/ray/_private/ray_constants.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""Ray constants used in the Python code."""

import logging
import math
import os

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -118,9 +117,6 @@ def env_bool(key, default):
# for large resource quantities due to bookkeeping of specific resource IDs.
MAX_RESOURCE_QUANTITY = 100e12

# Each memory "resource" counts as this many bytes of memory.
MEMORY_RESOURCE_UNIT_BYTES = 1

# Number of units 1 resource can be subdivided into.
MIN_RESOURCE_GRANULARITY = 0.0001

Expand All @@ -132,37 +128,6 @@ def env_bool(key, default):
RAY_OVERRIDE_DASHBOARD_URL = "RAY_OVERRIDE_DASHBOARD_URL"


def round_to_memory_units(memory_bytes, round_up):
"""Round bytes to the nearest memory unit."""
return from_memory_units(to_memory_units(memory_bytes, round_up))


def from_memory_units(memory_units):
"""Convert from memory units -> bytes."""
return memory_units * MEMORY_RESOURCE_UNIT_BYTES


def to_memory_units(memory_bytes, round_up):
"""Convert from bytes -> memory units."""
value = memory_bytes / MEMORY_RESOURCE_UNIT_BYTES
if value < 1:
raise ValueError(
"The minimum amount of memory that can be requested is {} bytes, "
"however {} bytes was asked.".format(
MEMORY_RESOURCE_UNIT_BYTES, memory_bytes
)
)
if isinstance(value, float) and not value.is_integer():
# TODO(ekl) Ray currently does not support fractional resources when
# the quantity is greater than one. We should fix memory resources to
# be allocated in units of bytes and not 100MB.
if round_up:
value = int(math.ceil(value))
else:
value = int(math.floor(value))
return int(value)


# Different types of Ray errors that can be pushed to the driver.
# TODO(rkn): These should be defined in flatbuffers and must be synced with
# the existing C++ definitions.
Expand Down
9 changes: 2 additions & 7 deletions python/ray/_private/resource_spec.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,17 +90,12 @@ def to_resource_dict(self):
"""
assert self.resolved()

memory_units = ray_constants.to_memory_units(self.memory, round_up=False)
object_store_memory_units = ray_constants.to_memory_units(
self.object_store_memory, round_up=False
)

resources = dict(
self.resources,
CPU=self.num_cpus,
GPU=self.num_gpus,
memory=memory_units,
object_store_memory=object_store_memory_units,
memory=self.memory,
object_store_memory=self.object_store_memory,
)

resources = {
Expand Down
8 changes: 8 additions & 0 deletions python/ray/_private/usage/usage_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,11 @@
EXTRA_USAGE_TAG_PREFIX = "extra_usage_tag_"

USAGE_STATS_NAMESPACE = "usage_stats"

KUBERNETES_SERVICE_HOST_ENV = "KUBERNETES_SERVICE_HOST"
KUBERAY_ENV = "RAY_USAGE_STATS_KUBERAY_IN_USE"
LEGACY_RAY_OPERATOR_ENV = "RAY_USAGE_STATS_LEGACY_OPERATOR_IN_USE"

PROVIDER_KUBERNETES_GENERIC = "kubernetes"
PROVIDER_KUBERAY = "kuberay"
PROVIDER_LEGACY_RAY_OPERATOR = "legacy_ray_operator"
13 changes: 11 additions & 2 deletions python/ray/_private/usage/usage_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -757,8 +757,17 @@ def get_instance_type(node_config):
except FileNotFoundError:
# It's a manually started cluster or k8s cluster
result = ClusterConfigToReport()
if "KUBERNETES_SERVICE_HOST" in os.environ:
result.cloud_provider = "kubernetes"
# Check if we're on Kubernetes
if usage_constant.KUBERNETES_SERVICE_HOST_ENV in os.environ:
# Check if we're using KubeRay >= 0.4.0.
if usage_constant.KUBERAY_ENV in os.environ:
result.cloud_provider = usage_constant.PROVIDER_KUBERAY
# Check if we're using the legacy Ray Operator with Ray >= 2.1.0.
elif usage_constant.LEGACY_RAY_OPERATOR_ENV in os.environ:
result.cloud_provider = usage_constant.PROVIDER_LEGACY_RAY_OPERATOR
# Else, we're on Kubernetes but not in either of the above categories.
else:
result.cloud_provider = usage_constant.PROVIDER_KUBERNETES_GENERIC
return result
except Exception as e:
logger.info(f"Failed to get cluster config to report {e}")
Expand Down
6 changes: 2 additions & 4 deletions python/ray/_private/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,11 +395,9 @@ def resources_from_ray_options(options_dict: Dict[str, Any]) -> Dict[str, Any]:
if num_gpus is not None:
resources["GPU"] = num_gpus
if memory is not None:
resources["memory"] = ray_constants.to_memory_units(memory, round_up=True)
resources["memory"] = memory
if object_store_memory is not None:
resources["object_store_memory"] = ray_constants.to_memory_units(
object_store_memory, round_up=True
)
resources["object_store_memory"] = object_store_memory
if accelerator_type is not None:
resources[
f"{ray_constants.RESOURCE_CONSTRAINT_PREFIX}{accelerator_type}"
Expand Down
12 changes: 10 additions & 2 deletions python/ray/autoscaler/_private/autoscaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ class NonTerminatedNodes:
"""Class to extract and organize information on non-terminated nodes."""

def __init__(self, provider: NodeProvider):
start_time = time.time()
# All non-terminated nodes
self.all_node_ids = provider.non_terminated_nodes({})

Expand All @@ -128,8 +129,15 @@ def __init__(self, provider: NodeProvider):
elif node_kind == NODE_KIND_HEAD:
self.head_id = node

# Note: For typical use-cases,
# self.all_node_ids == self.worker_ids + [self.head_id]
# Note: For typical use-cases, self.all_node_ids == self.worker_ids +
# [self.head_id]. The difference being in the case of unmanaged nodes.

# Record the time of the non_terminated nodes call. This typically
# translates to a "describe" or "list" call on most cluster managers
# which can be quite expensive. Note that we include the processing
# time because on some clients, there may be pagination and the
# underlying api calls may be done lazily.
self.non_terminated_nodes_time = time.time() - start_time

def remove_terminating_nodes(self, terminating_nodes: List[NodeID]) -> None:
"""Remove nodes we're in the process of terminating from internal
Expand Down
11 changes: 9 additions & 2 deletions python/ray/autoscaler/_private/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,14 +125,16 @@ def try_reload_log_state(provider_config: Dict[str, Any], log_state: dict) -> No
return reload_log_state(log_state)


def debug_status(status, error) -> str:
def debug_status(status, error, verbose: bool = False) -> str:
"""Return a debug string for the autoscaler."""
if status:
status = status.decode("utf-8")
status_dict = json.loads(status)
lm_summary_dict = status_dict.get("load_metrics_report")
autoscaler_summary_dict = status_dict.get("autoscaler_report")
timestamp = status_dict.get("time")
gcs_request_time = status_dict.get("gcs_request_time")
non_terminated_nodes_time = status_dict.get("non_terminated_nodes_time")
if lm_summary_dict and autoscaler_summary_dict and timestamp:
lm_summary = LoadMetricsSummary(**lm_summary_dict)
node_availability_summary_dict = autoscaler_summary_dict.pop(
Expand All @@ -147,7 +149,12 @@ def debug_status(status, error) -> str:
)
report_time = datetime.datetime.fromtimestamp(timestamp)
status = format_info_string(
lm_summary, autoscaler_summary, time=report_time
lm_summary,
autoscaler_summary,
time=report_time,
gcs_request_time=gcs_request_time,
non_terminated_nodes_time=non_terminated_nodes_time,
verbose=verbose,
)
else:
status = "No cluster status."
Expand Down
5 changes: 4 additions & 1 deletion python/ray/autoscaler/_private/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES,
DEFAULT_OBJECT_STORE_MEMORY_PROPORTION,
LOGGER_FORMAT,
MEMORY_RESOURCE_UNIT_BYTES,
RESOURCES_ENVIRONMENT_VARIABLE,
)

Expand Down Expand Up @@ -60,6 +59,10 @@ def env_integer(key, default):
"AUTOSCALER_NODE_AVAILABILITY_MAX_STALENESS_S", 30 * 60
)

AUTOSCALER_REPORT_PER_NODE_STATUS = (
env_integer("AUTOSCALER_REPORT_PER_NODE_STATUS", 1) == 1
)

# The maximum allowed resource demand vector size to guarantee the resource
# demand scheduler bin packing algorithm takes a reasonable amount of time
# to run.
Expand Down
Loading

0 comments on commit 06250b0

Please sign in to comment.