Skip to content

Commit

Permalink
[core] memory monitor documentation (#29341)
Browse files Browse the repository at this point in the history
Create docs for memory monitor to show how it can be used and enabled, since it is disabled by default.

Also point to examples on how to resolve memory issue via num_cpu / memory scheduling

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
  • Loading branch information
4 people authored Oct 21, 2022
1 parent e742bc6 commit fe2e50f
Show file tree
Hide file tree
Showing 7 changed files with 336 additions and 5 deletions.
89 changes: 89 additions & 0 deletions doc/source/ray-core/doc_code/ray_oom_prevention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# flake8: noqa
import ray

ray.init(
_system_config={
"memory_monitor_interval_ms": 100,
"memory_usage_threshold_fraction": 0.4,
"min_memory_free_bytes": -1,
},
)
# fmt: off
# __oom_start__
import ray

@ray.remote
def allocate_memory():
chunks = []
bits_to_allocate = 8 * 100 * 1024 * 1024 # ~0.1 GiB
while True:
chunks.append([0] * bits_to_allocate)


try:
ray.get(allocate_memory.remote())
except ray.exceptions.OutOfMemoryError as ex:
print("task failed with OutOfMemoryError, which is expected")
# __oom_end__
# fmt: on


# fmt: off
# __two_actors_start__
from math import ceil
import ray
from ray._private.utils import (
get_system_memory,
) # do not use outside of this example as these are private methods.
from ray._private.utils import (
get_used_memory,
) # do not use outside of this example as these are private methods.


# estimates the number of bytes to allocate to reach the desired memory usage percentage.
def get_additional_bytes_to_reach_memory_usage_pct(pct: float) -> int:
used = get_used_memory()
total = get_system_memory()
bytes_needed = int(total * pct) - used
assert (
bytes_needed > 0
), "memory usage is already above the target. Increase the target percentage."
return bytes_needed


@ray.remote
class MemoryHogger:
def __init__(self):
self.allocations = []

def allocate(self, bytes_to_allocate: float) -> None:
# divide by 8 as each element in the array occupies 8 bytes
new_list = [0] * ceil(bytes_to_allocate / 8)
self.allocations.append(new_list)


first_actor = MemoryHogger.options(
max_restarts=1, max_task_retries=1, name="first_actor"
).remote()
second_actor = MemoryHogger.options(
max_restarts=0, max_task_retries=0, name="second_actor"
).remote()

# each task requests 0.3 of the system memory when the memory threshold is 0.4.
allocate_bytes = get_additional_bytes_to_reach_memory_usage_pct(0.3)

first_actor_task = first_actor.allocate.remote(allocate_bytes)
second_actor_task = second_actor.allocate.remote(allocate_bytes)

error_thrown = False
try:
ray.get(first_actor_task)
except ray.exceptions.RayActorError as ex:
error_thrown = True
print("first actor was killed by memory monitor")
assert error_thrown

ray.get(second_actor_task)
print("finished second actor")
# __two_actors_end__
# fmt: on
1 change: 0 additions & 1 deletion doc/source/ray-core/objects.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,5 @@ More about Ray Objects
:maxdepth: 1

objects/serialization.rst
objects/memory-management.rst
objects/object-spilling.rst
objects/fault-tolerance.rst
11 changes: 11 additions & 0 deletions doc/source/ray-core/scheduling/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Scheduling
==========

This section is an overview of how tasks and actors are scheduled

.. toctree::
:maxdepth: 1

placement-group
memory-management
ray-oom-prevention
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Ray system memory: this is memory used internally by Ray

Application memory: this is memory used by your application
- **Worker heap**: memory used by your application (e.g., in Python code or TensorFlow), best measured as the *resident set size (RSS)* of your application minus its *shared memory usage (SHR)* in commands such as ``top``. The reason you need to subtract *SHR* is that object store shared memory is reported by the OS as shared with each worker. Not subtracting *SHR* will result in double counting memory usage.
- **Object store memory**: memory used when your application creates objects in the object store via ``ray.put`` and when returning values from remote functions. Objects are reference counted and evicted when they fall out of scope. There is an object store server running on each node. In Ray 1.3+, objects will be `spilled to disk <object-spilling.html>`__ if the object store fills up.
- **Object store memory**: memory used when your application creates objects in the object store via ``ray.put`` and when returning values from remote functions. Objects are reference counted and evicted when they fall out of scope. There is an object store server running on each node. In Ray 1.3+, objects will be :ref:`spilled to disk <object-spilling>` if the object store fills up.
- **Object store shared memory**: memory used when your application reads objects via ``ray.get``. Note that if an object is already present on the node, this does not cause additional allocations. This allows large objects to be efficiently shared among many actors and tasks.

ObjectRef Reference Counting
Expand Down Expand Up @@ -234,6 +234,8 @@ In this example, we first create an object via ``ray.put()``, then capture its `
In the output of ``ray memory``, we see that the second object displays as a normal ``LOCAL_REFERENCE``, but the first object is listed as ``CAPTURED_IN_OBJECT``.

.. _memory-aware-scheduling:

Memory Aware Scheduling
~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ Let's create a placement group. Recall that each bundle is a collection of resou

.. tabbed:: Python

.. literalinclude:: doc_code/original_resource_unavailable_example.py
.. literalinclude:: ../doc_code/original_resource_unavailable_example.py
:language: python

.. tabbed:: Java
Expand Down Expand Up @@ -499,7 +499,7 @@ because they are scheduled on a placement group with the STRICT_PACK strategy.
.. tabbed:: Python
.. literalinclude:: doc_code/placement_group_capture_child_tasks_example.py
.. literalinclude:: ../doc_code/placement_group_capture_child_tasks_example.py
:language: python
.. tabbed:: Java
Expand Down
Loading

0 comments on commit fe2e50f

Please sign in to comment.