Skip to content

Commit

Permalink
[Doc] Make doc code snippet testable [7/n] (ray-project#36960)
Browse files Browse the repository at this point in the history
Change code snippet from ..code-block:: to ..testcode::

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
  • Loading branch information
jjyao authored and arvind-chandra committed Aug 31, 2023
1 parent 1bde7da commit 6ab2809
Show file tree
Hide file tree
Showing 5 changed files with 72 additions and 50 deletions.
9 changes: 7 additions & 2 deletions doc/source/ray-observability/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Ray has special support to improve the visibility of stdout and stderr produced

For the following code:

.. code-block:: python
.. testcode::

import ray
# Initiate a driver.
Expand All @@ -124,7 +124,12 @@ For the following code:
def task_foo():
print("task!")

ray.get(task.remote())
ray.get(task_foo.remote())

.. testoutput::
:options: +MOCK

(task_foo pid=12854) task!

#. Ray Task ``task_foo`` runs on a Ray Worker process. String ``task!`` is saved into the corresponding Worker ``stdout`` log file.
#. The Driver reads the Worker log file and sends it to its ``stdout`` (terminal) where you should be able to see the string ``task!``.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,39 @@ Debugging Failures
What Kind of Failures Exist in Ray?
-----------------------------------

Ray consists of 2 major APIs. ``.remote()`` to create a task/actor and :func:`ray.get <ray.get>` to get the result.
Debugging Ray means identifying and fixing failures from remote processes that run functions and classes (task and actor) created by the ``.remote`` API.
Ray consists of two major APIs. ``.remote()`` to create a Task or Actor, and :func:`ray.get <ray.get>` to get the result.
Debugging Ray means identifying and fixing failures from remote processes that run functions and classes (Tasks and Actors) created by the ``.remote`` API.

Ray APIs are future APIs (indeed, it is :ref:`possible to convert Ray object references to standard Python future APIs <async-ref-to-futures>`),
and the error handling model is the same. When any remote tasks or actors fail, the returned object ref will contain an exception.
Ray APIs are future APIs (indeed, it is :ref:`possible to convert Ray object references to standard Python future APIs <async-ref-to-futures>`),
and the error handling model is the same. When any remote Tasks or Actors fail, the returned object ref contains an exception.
When you call ``get`` API to the object ref, it raises an exception.

.. code-block:: python
.. testcode::

import ray
@ray.remote
def f():
raise ValueError
raise ValueError("it's an application error")

# Raises a ValueError.
ray.get(f.remote())
try:
ray.get(f.remote())
except ValueError as e:
print(e)

.. testoutput::

...
ValueError: it's an application error

In Ray, there are 3 types of failures. See exception APIs for more details.
In Ray, there are three types of failures. See exception APIs for more details.

- **Application failures**: This means the remote task/actor fails by the user code. In this case, ``get`` API will raise the :func:`RayTaskError <ray.exceptions.RayTaskError>` which includes the exception raised from the remote process.
- **Intentional system failures**: This means Ray is failed, but the failure is intended. For example, when you call cancellation APIs like ``ray.cancel`` (for task) or ``ray.kill`` (for actors), the system fails remote tasks and actors, but it is intentional.
- **Unintended system failures**: This means the remote tasks and actors failed due to unexpected system failures such as processes crashing (for example, by out-of-memory error) or nodes failing.

1. `Linux Out of Memory killer <https://www.kernel.org/doc/gorman/html/understand/understand016.html>`_ or :ref:`Ray Memory Monitor <ray-oom-monitor>` kills processes with high memory usages to avoid out-of-memory.
2. The machine shuts down (e.g., spot instance termination) or a :term:`raylet <raylet>` is crashed (e.g., by an unexpected failure).
2. The machine shuts down (e.g., spot instance termination) or a :term:`raylet <raylet>` crashed (e.g., by an unexpected failure).
3. System is highly overloaded or stressed (either machine or system components like Raylet or :term:`GCS <GCS / Global Control Service>`), which makes the system unstable and fail.

Debugging Application Failures
Expand All @@ -42,8 +50,8 @@ Ray provides a debugging experience that's similar to debugging a single-process
print
~~~~~

``print`` debugging is one of the most common ways to debug Python programs.
:ref:`Ray's Task and Actor logs are printed to the Ray Driver <ray-worker-logs>` by default,
``print`` debugging is one of the most common ways to debug Python programs.
:ref:`Ray's Task and Actor logs are printed to the Ray Driver <ray-worker-logs>` by default,
which allows you to simply use the ``print`` function to debug the application failures.

Debugger
Expand All @@ -60,8 +68,8 @@ In a Ray cluster, arbitrary two system components can communicate with each othe
For example, some workers may need to communicate with GCS to schedule Actors (worker <-> GCS connection).
Your Driver can invoke Actor methods (worker <-> worker connection).

Ray can support 1000s of raylets and 10000s of worker processes. When a Ray cluster gets larger,
each component can have an increasing number of network connections which requires file descriptors.
Ray can support 1000s of raylets and 10000s of worker processes. When a Ray cluster gets larger,
each component can have an increasing number of network connections, which requires file descriptors.

Linux typically limits the default file descriptors per process to 1024. When there are
more than 1024 connections to the component, it can raise error messages below.
Expand All @@ -76,7 +84,7 @@ we recommend you adjust the max file descriptors limit per process via the ``uli

We recommend you apply ``ulimit -n 65536`` to your host configuration. However, you can also selectively apply it for
Ray components (view below example). Normally, each worker has 2~3 connections to GCS. Each raylet has 1~2 connections to GCS.
65536 file descriptors can handle 10000~15000 of workers and 1000~2000 of nodes.
65536 file descriptors can handle 10000~15000 of workers and 1000~2000 of nodes.
If you have more workers, you should consider using a higher number than 65536.

.. code-block:: bash
Expand All @@ -90,7 +98,7 @@ If you have more workers, you should consider using a higher number than 65536.
# Start a Ray driver with higher ulimit.
ulimit -n 65536 <python script>
If that fails, double-check that the hard limit is sufficiently large by running ``ulimit -Hn``.
If that fails, double-check that the hard limit is sufficiently large by running ``ulimit -Hn``.
If it is too small, you can increase the hard limit as follows (these instructions work on EC2).

* Increase the hard ulimit for open file descriptors system-wide by running
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,14 @@ Visualizing Tasks with Ray Timeline
View :ref:`how to use Ray Timeline in the Dashboard <dashboard-timeline>` for more details.

Instead of using Dashboard UI to download the tracing file, you can also export the tracing file as a JSON file by running ``ray timeline`` from the command line or ``ray.timeline`` from the Python API.
.. code-block:: python
ray.timeline(filename="/tmp/timeline.json")

.. testcode::

import ray

ray.init()

ray.timeline(filename="timeline.json")


.. _dashboard-profiling:
Expand Down Expand Up @@ -110,7 +116,7 @@ not have root permissions, the Dashboard prompts with instructions on how to set
.. note::
If you run Ray in a Docker container, you may run into permission errors when using py-spy. Follow the `py-spy documentation`_ to resolve it.

.. _`py-spy documentation`: https://github.com/benfred/py-spy#how-do-i-run-py-spy-in-docker


Expand All @@ -135,7 +141,8 @@ changes to your application code (given that each section of the code you want
to profile is defined as its own function). To use cProfile, add an import
statement, then replace calls to the loop functions as follows:

.. code-block:: python
.. testcode::
:skipif: True

import cProfile # Added import statement

Expand Down Expand Up @@ -202,11 +209,11 @@ let's create a new example and loop over five calls to a remote function
**inside an actor**. Our actor's remote function again just sleeps for 0.5
seconds:

.. code-block:: python
.. testcode::

# Our actor
@ray.remote
class Sleeper(object):
class Sleeper:
def __init__(self):
self.sleepValue = 0.5

Expand All @@ -217,7 +224,7 @@ seconds:
Recalling the suboptimality of ``ex1``, let's first see what happens if we
attempt to perform all five ``actor_func()`` calls within a single actor:

.. code-block:: python
.. testcode::

def ex4():
# This is suboptimal in Ray, and should only be used for the sake of this example
Expand All @@ -232,7 +239,8 @@ attempt to perform all five ``actor_func()`` calls within a single actor:

We enable cProfile on this example as follows:

.. code-block:: python
.. testcode::
:skipif: True

def main():
ray.init()
Expand Down Expand Up @@ -281,7 +289,7 @@ create five ``Sleeper`` actors. That way, we are creating five workers
that can run in parallel, instead of creating a single worker that
can only handle one call to ``actor_func()`` at a time.

.. code-block:: python
.. testcode::

def ex4():
# Modified to create five separate Sleepers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ Getting Started

Take the following example:

.. code-block:: python
.. testcode::
:skipif: True

import ray
ray.init()

@ray.remote
def f(x):
Expand Down Expand Up @@ -113,12 +113,11 @@ Stepping between Ray tasks
You can use the debugger to step between Ray tasks. Let's take the
following recursive function as an example:

.. code-block:: python
.. testcode::
:skipif: True

import ray

ray.init()
@ray.remote
def fact(n):
if n == 1:
Expand All @@ -140,7 +139,7 @@ After running the program by executing the Python file and calling
``ray debug``, you can select the breakpoint by pressing ``0`` and
enter. This will result in the following output:

.. code-block:: python
.. code-block:: shell
Enter breakpoint index or press enter to refresh: 0
> /home/ubuntu/tmp/stepping.py(16)<module>()
Expand All @@ -151,7 +150,7 @@ You can jump into the call with the ``remote`` command in Ray's debugger.
Inside the function, print the value of `n` with ``p(n)``, resulting in
the following output:

.. code-block:: python
.. code-block:: shell
-> result_ref = fact.remote(5)
(Pdb) remote
Expand Down Expand Up @@ -179,7 +178,7 @@ to the location where ``ray.get`` is called on the result by using the
``get`` debugger comand. Use ``get`` again to jump back to the original
call site and use ``p(result)`` to print the result:
.. code-block:: python
.. code-block:: shell
Enter breakpoint index or press enter to refresh: 0
> /home/ubuntu/tmp/stepping.py(14)<module>()
Expand Down Expand Up @@ -231,7 +230,8 @@ We will show how this works using a Ray serve application. To get started, insta
Next, copy the following code into a file called ``serve_debugging.py``:
.. code-block:: python
.. testcode::
:skipif: True
import time
Expand Down
29 changes: 15 additions & 14 deletions doc/source/ray-observability/user-guides/ray-tracing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Tracing
=======
To help debug and monitor Ray applications, Ray integrates with OpenTelemetry to make it easy to export traces to external tracing stacks such as Jaeger.
To help debug and monitor Ray applications, Ray integrates with OpenTelemetry to facilitate exporting traces to external tracing stacks such as Jaeger.


.. note::
Expand All @@ -24,8 +24,8 @@ Tracing startup hook
To enable tracing, you must provide a tracing startup hook with a function that sets up the :ref:`Tracer Provider <tracer-provider>`, :ref:`Remote Span Processors <remote-span-processors>`, and :ref:`Additional Instruments <additional-instruments>`. The tracing startup hook is expected to be a function that is called with no args or kwargs. This hook needs to be available in the Python environment of all the worker processes.

Below is an example tracing startup hook that sets up the default tracing provider, exports spans to files in ``/tmp/spans``, and does not have any additional instruments.
.. code-block:: python

.. testcode::

import ray
import os
Expand All @@ -35,8 +35,8 @@ Below is an example tracing startup hook that sets up the default tracing provid
ConsoleSpanExporter,
SimpleSpanProcessor,
)


def setup_tracing() -> None:
# Creates /tmp/spans folder
os.makedirs("/tmp/spans", exist_ok=True)
Expand All @@ -62,23 +62,24 @@ For open-source users who want to experiment with tracing, Ray has a default tra
$ ray start --head --tracing-startup-hook=ray.util.tracing.setup_local_tmp_tracing:setup_tracing
$ python
>>> ray.init()
>>> @ray.remote
ray.init()
@ray.remote
def my_function():
return 1
obj_ref = my_function.remote()
.. tab-item:: ray.init()

.. code-block:: python
.. testcode::

>>> ray.init(_tracing_startup_hook="ray.util.tracing.setup_local_tmp_tracing:setup_tracing")
>>> @ray.remote
def my_function():
return 1
ray.init(_tracing_startup_hook="ray.util.tracing.setup_local_tmp_tracing:setup_tracing")

obj_ref = my_function.remote()
@ray.remote
def my_function():
return 1

obj_ref = my_function.remote()

If you want to provide your own custom tracing startup hook, provide it in the format of ``module:attribute`` where the attribute is the ``setup_tracing`` function to be run.

Expand Down Expand Up @@ -108,7 +109,7 @@ Add custom tracing in your programs. Within your program, get the tracer object

See below for a simple example of adding custom tracing.

.. code-block:: python
.. testcode::

from opentelemetry import trace

Expand Down

0 comments on commit 6ab2809

Please sign in to comment.