[rfc] [air/tune/train] Improve trial/training failure error printing #27946

krfricke · 2022-08-17T08:25:28Z

Why are these changes needed?

When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.

The main problems here are:

Tracebacks are printed multiple times: In the remote worker and on the driver
Tracebacks include many internal wrappers

The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).

The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.

Deduplicating traceback printing

The solution here is to use logger.error instead of logger.exception in the function_trainable.py to avoid printing a traceback in the trainable.

Additionally, we introduce an environment variable TUNE_PRINT_ALL_TRIAL_ERRORS which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).

To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.

Removing internal wrappers from tracebacks

The solution here is to introcude a magic local variable _ray_start_tb. In two places, we use this magic local variable to reduce the stacktrace. A utility shorten_tb looks for the last occurence of _ray_start_tb in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable RAY_AIR_FULL_TRACEBACKS which disables traceback shortening.

Output before

2022-08-17 09:59:02,485	ERROR trial_runner.py:980 -- Trial DataParallelTrainer_709b0_00000: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::TrainTrainable.train() (pid=56647, ip=127.0.0.1, repr=DataParallelTrainer)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 417, in step
    self._report_thread_runner_error(block=True)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 589, in _report_thread_runner_error
    raise e
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 289, in run
    self._entrypoint()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 365, in entrypoint
    self._status_reporter.get_checkpoint(),
  File "/Users/kai/coding/ray/python/ray/train/base_trainer.py", line 460, in _trainable_func
    super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 684, in _trainable_func
    output = fn()
  File "/Users/kai/coding/ray/python/ray/train/base_trainer.py", line 375, in train_func
    trainer.training_loop()
  File "/Users/kai/coding/ray/python/ray/train/data_parallel_trainer.py", line 358, in training_loop
    for results in training_iterator:
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 762, in __next__
    self._finish_training
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 722, in _run_with_error_handling
    return func()
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 833, in _finish_training
    return self._backend_executor.finish_training()
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 458, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 477, in get_with_failure_handling
    success = check_for_failure(remote_values)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 52, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=56657, ip=127.0.0.1, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x10c574e10>)
  File "/Users/kai/coding/ray/python/ray/train/_internal/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 449, in end_training
    output = session.finish()
  File "/Users/kai/coding/ray/python/ray/train/_internal/session.py", line 140, in finish
    func_output = self.training_thread.join()
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 98, in join
    raise self.exc
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 91, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 155, in <lambda>
    return lambda: wrapped_train_func(config)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 141, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/kai/coding/sandbox/scratch2.py", line 10, in failing_fn
    raise RuntimeError("I am failing")
RuntimeError: I am failing

Output after

2022-08-17 12:55:58,606	ERROR trial_runner.py:984 -- Trial DataParallelTrainer_24224_00003: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::TrainTrainable.train() (pid=67398, ip=127.0.0.1, repr=DataParallelTrainer)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 419, in step
    self._report_thread_runner_error(block=True)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 591, in _report_thread_runner_error
    raise e.with_traceback(shorten_tb(e.__traceback__, attr="_ray_start_tb"))
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=67418, ip=127.0.0.1, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x11256ec50>)
  File "/Users/kai/coding/ray/python/ray/train/_internal/worker_group.py", line 30, in __execute
    raise e.with_traceback(shorten_tb(e.__traceback__, attr="_ray_start_tb"))
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 149, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/kai/coding/sandbox/scratch2.py", line 10, in failing_fn
    raise RuntimeError("I am failing")
RuntimeError: I am failing

Example

Tuning script:

import time

from ray import tune
from ray.air import ScalingConfig, RunConfig, FailureConfig
from ray.train.data_parallel_trainer import DataParallelTrainer


def failing_fn(config):
    time.sleep(1)
    raise RuntimeError("I am failing")


trainer = DataParallelTrainer(
    train_loop_per_worker=failing_fn, scaling_config=ScalingConfig(num_workers=4),
    run_config=RunConfig(failure_config=FailureConfig(fail_fast=False))
)
tuner = tune.Tuner(trainer, tune_config=tune.TuneConfig(num_samples=10))
tuner.fit()

Current output

1333 lines
Includes full tracebacks
Prints tracebacks two times per trial (once in trainable, once in driver thread)

Medium output

~~467~~ 444 lines (~65% reduction)
Includes shortened tracebacks
Only prints tracebacks once per trial (in the driver thread)

Short output

319 lines (~75% reduction to orig, ~30% reduction to medium)
Does not print any tracebacks, instead refers to error.txt files

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2022-08-17T11:01:09Z

python/ray/train/_internal/utils.py

+            except Exception as exc:
+                # Other (e.g. training) errors should be directly raised
+                _ray_start_tb = True  # noqa: F841
+                raise exc.with_traceback(
+                    shorten_tb(exc.__traceback__, attr="_ray_start_tb")
+                )


cc @amogkam if we just catch RayTaskErrors in addition to RayActorErrors this will lead to test failures as training is restarted and a different exception is raised.

What is the expected behavior here? IMO it looks like task errors should fail immediately (as it's likely a logic/syntax error) and only actor failures should be retried. If that's the case (as in the current implementation) maybe we can add better comments for this. Lmk

…dedup

python/ray/air/_internal/util.py

python/ray/train/_internal/backend_executor.py

richardliaw · 2022-08-20T09:01:42Z

Overall huge fan of this change. Just wish we could avoid the weird 'magic' variable instead.

perhaps attach an attribute to the exception?

Signed-off-by: Kai Fricke <kai@anyscale.com>

…dedup

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2022-08-22T10:53:59Z

Yeah, I wasn't happy either with the hackiness of the approach. I've refactored the code to use a raise StartTraceback from original_exception and a skip_exception utility to skip irrelevant exception tracebacks, so this now uses proper python concepts. The output was reduced by another 2 lines per exception - ptal!

Signed-off-by: Kai Fricke <kai@anyscale.com>

richardliaw

looks good. would be great to add a test here on the stacktrace.

Signed-off-by: Kai Fricke <kai@anyscale.com>

python/ray/train/_internal/backend_executor.py

…dedup

Signed-off-by: Kai Fricke <kai@anyscale.com>

Follow-up from #27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility. Signed-off-by: Kai Fricke kai@anyscale.com

Follow-up from ray-project#27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility. Signed-off-by: Kai Fricke kai@anyscale.com Signed-off-by: ilee300a <ilee300@anyscale.com>

Kai Fricke added 6 commits August 16, 2022 11:52

wip better error handling

d78270b

Signed-off-by: Kai Fricke <kai@anyscale.com>

wip magic local var

32b1eda

Signed-off-by: Kai Fricke <kai@anyscale.com>

Shorten more

ebd38b6

Signed-off-by: Kai Fricke <kai@anyscale.com>

Use env

a2ac130

Signed-off-by: Kai Fricke <kai@anyscale.com>

Env for shorten_tb

0782931

Signed-off-by: Kai Fricke <kai@anyscale.com>

Default to 1

8365682

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested review from richardliaw, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners August 17, 2022 08:25

krfricke assigned matthewdeng, richardliaw, amogkam and xwjiang2010 Aug 17, 2022

propagate training errors directly

7ef83c9

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke commented Aug 17, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into tune/trial-error-…

d9e116c

…dedup

richardliaw reviewed Aug 20, 2022

View reviewed changes

python/ray/air/_internal/util.py Show resolved Hide resolved

richardliaw reviewed Aug 20, 2022

View reviewed changes

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved

Kai Fricke added 3 commits August 22, 2022 11:47

Use SkipException class instead

f186811

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/trial-error-…

a07e0e9

…dedup

Add docs

0b1711b

Signed-off-by: Kai Fricke <kai@anyscale.com>

Kai Fricke added 3 commits August 22, 2022 11:55

Rename

992f7ac

Signed-off-by: Kai Fricke <kai@anyscale.com>

Fix tests

4323bd2

Signed-off-by: Kai Fricke <kai@anyscale.com>

Fix trainer test

4759df4

Signed-off-by: Kai Fricke <kai@anyscale.com>

Test

588dfbc

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from richardliaw August 24, 2022 15:07

richardliaw approved these changes Aug 24, 2022

View reviewed changes

fix test

276cc46

Signed-off-by: Kai Fricke <kai@anyscale.com>

Yard1 reviewed Aug 24, 2022

View reviewed changes

python/ray/train/_internal/backend_executor.py Show resolved Hide resolved

Kai Fricke added 2 commits August 26, 2022 12:01

Merge remote-tracking branch 'upstream/master' into tune/trial-error-…

68a62c5

…dedup

Add test

51fa009

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke merged commit d0678b8 into ray-project:master Aug 26, 2022

krfricke deleted the tune/trial-error-dedup branch August 26, 2022 22:02

krfricke mentioned this pull request Sep 7, 2022

[air] Simplify traceback skipping logic in code #28333

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rfc] [air/tune/train] Improve trial/training failure error printing #27946

[rfc] [air/tune/train] Improve trial/training failure error printing #27946

krfricke commented Aug 17, 2022 •

edited

Loading

krfricke Aug 17, 2022

richardliaw commented Aug 20, 2022

krfricke commented Aug 22, 2022 •

edited

Loading

richardliaw left a comment

[rfc] [air/tune/train] Improve trial/training failure error printing #27946

[rfc] [air/tune/train] Improve trial/training failure error printing #27946

Conversation

krfricke commented Aug 17, 2022 • edited Loading

Why are these changes needed?

Deduplicating traceback printing

Removing internal wrappers from tracebacks

Example

Related issue number

Checks

krfricke Aug 17, 2022

Choose a reason for hiding this comment

richardliaw commented Aug 20, 2022

krfricke commented Aug 22, 2022 • edited Loading

richardliaw left a comment

Choose a reason for hiding this comment

krfricke commented Aug 17, 2022 •

edited

Loading

krfricke commented Aug 22, 2022 •

edited

Loading