[core] memory monitor documentation (#29341)

Create docs for memory monitor to show how it can be used and enabled, since it is disabled by default. Also point to examples on how to resolve memory issue via num_cpu / memory scheduling Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
ray-project · Oct 21, 2022 · fe2e50f · fe2e50f
1 parent e742bc6
commit fe2e50f
Show file tree

Hide file tree

Showing 7 changed files with 336 additions and 5 deletions.
diff --git a/doc/source/ray-core/doc_code/ray_oom_prevention.py b/doc/source/ray-core/doc_code/ray_oom_prevention.py
@@ -0,0 +1,89 @@
+# flake8: noqa
+import ray
+
+ray.init(
+    _system_config={
+        "memory_monitor_interval_ms": 100,
+        "memory_usage_threshold_fraction": 0.4,
+        "min_memory_free_bytes": -1,
+    },
+)
+# fmt: off
+# __oom_start__
+import ray
+
+@ray.remote
+def allocate_memory():
+    chunks = []
+    bits_to_allocate = 8 * 100 * 1024 * 1024  # ~0.1 GiB
+    while True:
+        chunks.append([0] * bits_to_allocate)
+
+
+try:
+    ray.get(allocate_memory.remote())
+except ray.exceptions.OutOfMemoryError as ex:
+    print("task failed with OutOfMemoryError, which is expected")
+# __oom_end__
+# fmt: on
+
+
+# fmt: off
+# __two_actors_start__
+from math import ceil
+import ray
+from ray._private.utils import (
+    get_system_memory,
+)  # do not use outside of this example as these are private methods.
+from ray._private.utils import (
+    get_used_memory,
+)  # do not use outside of this example as these are private methods.
+
+
+# estimates the number of bytes to allocate to reach the desired memory usage percentage.
+def get_additional_bytes_to_reach_memory_usage_pct(pct: float) -> int:
+    used = get_used_memory()
+    total = get_system_memory()
+    bytes_needed = int(total * pct) - used
+    assert (
+        bytes_needed > 0
+    ), "memory usage is already above the target. Increase the target percentage."
+    return bytes_needed
+
+
+@ray.remote
+class MemoryHogger:
+    def __init__(self):
+        self.allocations = []
+
+    def allocate(self, bytes_to_allocate: float) -> None:
+        # divide by 8 as each element in the array occupies 8 bytes
+        new_list = [0] * ceil(bytes_to_allocate / 8)
+        self.allocations.append(new_list)
+
+
+first_actor = MemoryHogger.options(
+    max_restarts=1, max_task_retries=1, name="first_actor"
+).remote()
+second_actor = MemoryHogger.options(
+    max_restarts=0, max_task_retries=0, name="second_actor"
+).remote()
+
+# each task requests 0.3 of the system memory when the memory threshold is 0.4.
+allocate_bytes = get_additional_bytes_to_reach_memory_usage_pct(0.3)
+
+first_actor_task = first_actor.allocate.remote(allocate_bytes)
+second_actor_task = second_actor.allocate.remote(allocate_bytes)
+
+error_thrown = False
+try:
+    ray.get(first_actor_task)
+except ray.exceptions.RayActorError as ex:
+    error_thrown = True
+    print("first actor was killed by memory monitor")
+assert error_thrown
+
+ray.get(second_actor_task)
+print("finished second actor")
+# __two_actors_end__
+# fmt: on
diff --git a/doc/source/ray-core/objects.rst b/doc/source/ray-core/objects.rst
@@ -180,6 +180,5 @@ More about Ray Objects
     :maxdepth: 1
 
     objects/serialization.rst
-    objects/memory-management.rst
     objects/object-spilling.rst
     objects/fault-tolerance.rst
diff --git a/doc/source/ray-core/scheduling/index.rst b/doc/source/ray-core/scheduling/index.rst
@@ -0,0 +1,11 @@
+Scheduling
+==========
+
+This section is an overview of how tasks and actors are scheduled
+
+.. toctree::
+    :maxdepth: 1
+
+    placement-group
+    memory-management
+    ray-oom-prevention
diff --git a/...ce/ray-core/objects/memory-management.rst → ...ray-core/scheduling/memory-management.rst b/...ce/ray-core/objects/memory-management.rst → ...ray-core/scheduling/memory-management.rst
@@ -21,7 +21,7 @@ Ray system memory: this is memory used internally by Ray
 
 Application memory: this is memory used by your application
   - **Worker heap**: memory used by your application (e.g., in Python code or TensorFlow), best measured as the *resident set size (RSS)* of your application minus its *shared memory usage (SHR)* in commands such as ``top``. The reason you need to subtract *SHR* is that object store shared memory is reported by the OS as shared with each worker. Not subtracting *SHR* will result in double counting memory usage.
-  - **Object store memory**: memory used when your application creates objects in the object store via ``ray.put`` and when returning values from remote functions. Objects are reference counted and evicted when they fall out of scope. There is an object store server running on each node. In Ray 1.3+, objects will be `spilled to disk <object-spilling.html>`__ if the object store fills up.
+  - **Object store memory**: memory used when your application creates objects in the object store via ``ray.put`` and when returning values from remote functions. Objects are reference counted and evicted when they fall out of scope. There is an object store server running on each node. In Ray 1.3+, objects will be :ref:`spilled to disk <object-spilling>` if the object store fills up.
   - **Object store shared memory**: memory used when your application reads objects via ``ray.get``. Note that if an object is already present on the node, this does not cause additional allocations. This allows large objects to be efficiently shared among many actors and tasks.
 
 ObjectRef Reference Counting
@@ -234,6 +234,8 @@ In this example, we first create an object via ``ray.put()``, then capture its `
 
 In the output of ``ray memory``, we see that the second object displays as a normal ``LOCAL_REFERENCE``, but the first object is listed as ``CAPTURED_IN_OBJECT``.
 
+.. _memory-aware-scheduling:
+
 Memory Aware Scheduling
 ~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/doc/source/ray-core/placement-group.rst → ...e/ray-core/scheduling/placement-group.rst b/doc/source/ray-core/placement-group.rst → ...e/ray-core/scheduling/placement-group.rst
@@ -198,7 +198,7 @@ Let's create a placement group. Recall that each bundle is a collection of resou
 
   .. tabbed:: Python
 
-      .. literalinclude:: doc_code/original_resource_unavailable_example.py
+      .. literalinclude:: ../doc_code/original_resource_unavailable_example.py
         :language: python
 
   .. tabbed:: Java
@@ -499,7 +499,7 @@ because they are scheduled on a placement group with the STRICT_PACK strategy.
 
   .. tabbed:: Python
 
-      .. literalinclude:: doc_code/placement_group_capture_child_tasks_example.py
+      .. literalinclude:: ../doc_code/placement_group_capture_child_tasks_example.py
         :language: python
 
   .. tabbed:: Java