Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] [air/tune/train] Improve trial/training failure error printing #27946

Merged
merged 18 commits into from
Aug 26, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Aug 17, 2022

Why are these changes needed?

When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.

The main problems here are:

  1. Tracebacks are printed multiple times: In the remote worker and on the driver
  2. Tracebacks include many internal wrappers

The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).

The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.

Deduplicating traceback printing

The solution here is to use logger.error instead of logger.exception in the function_trainable.py to avoid printing a traceback in the trainable.

Additionally, we introduce an environment variable TUNE_PRINT_ALL_TRIAL_ERRORS which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).

To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.

Removing internal wrappers from tracebacks

The solution here is to introcude a magic local variable _ray_start_tb. In two places, we use this magic local variable to reduce the stacktrace. A utility shorten_tb looks for the last occurence of _ray_start_tb in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable RAY_AIR_FULL_TRACEBACKS which disables traceback shortening.

Output before
2022-08-17 09:59:02,485	ERROR trial_runner.py:980 -- Trial DataParallelTrainer_709b0_00000: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::TrainTrainable.train() (pid=56647, ip=127.0.0.1, repr=DataParallelTrainer)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 417, in step
    self._report_thread_runner_error(block=True)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 589, in _report_thread_runner_error
    raise e
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 289, in run
    self._entrypoint()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 365, in entrypoint
    self._status_reporter.get_checkpoint(),
  File "/Users/kai/coding/ray/python/ray/train/base_trainer.py", line 460, in _trainable_func
    super()._trainable_func(self._merged_config, reporter, checkpoint_dir)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 684, in _trainable_func
    output = fn()
  File "/Users/kai/coding/ray/python/ray/train/base_trainer.py", line 375, in train_func
    trainer.training_loop()
  File "/Users/kai/coding/ray/python/ray/train/data_parallel_trainer.py", line 358, in training_loop
    for results in training_iterator:
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 762, in __next__
    self._finish_training
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 722, in _run_with_error_handling
    return func()
  File "/Users/kai/coding/ray/python/ray/train/trainer.py", line 833, in _finish_training
    return self._backend_executor.finish_training()
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 458, in finish_training
    results = self.get_with_failure_handling(futures)
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 477, in get_with_failure_handling
    success = check_for_failure(remote_values)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 52, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=56657, ip=127.0.0.1, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x10c574e10>)
  File "/Users/kai/coding/ray/python/ray/train/_internal/worker_group.py", line 26, in __execute
    return func(*args, **kwargs)
  File "/Users/kai/coding/ray/python/ray/train/_internal/backend_executor.py", line 449, in end_training
    output = session.finish()
  File "/Users/kai/coding/ray/python/ray/train/_internal/session.py", line 140, in finish
    func_output = self.training_thread.join()
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 98, in join
    raise self.exc
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 91, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 155, in <lambda>
    return lambda: wrapped_train_func(config)
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 141, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/kai/coding/sandbox/scratch2.py", line 10, in failing_fn
    raise RuntimeError("I am failing")
RuntimeError: I am failing
Output after
2022-08-17 12:55:58,606	ERROR trial_runner.py:984 -- Trial DataParallelTrainer_24224_00003: Error processing event.
ray.exceptions.RayTaskError(RuntimeError): ray::TrainTrainable.train() (pid=67398, ip=127.0.0.1, repr=DataParallelTrainer)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/trainable.py", line 347, in train
    result = self.step()
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 419, in step
    self._report_thread_runner_error(block=True)
  File "/Users/kai/coding/ray/python/ray/tune/trainable/function_trainable.py", line 591, in _report_thread_runner_error
    raise e.with_traceback(shorten_tb(e.__traceback__, attr="_ray_start_tb"))
ray.exceptions.RayTaskError(RuntimeError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=67418, ip=127.0.0.1, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x11256ec50>)
  File "/Users/kai/coding/ray/python/ray/train/_internal/worker_group.py", line 30, in __execute
    raise e.with_traceback(shorten_tb(e.__traceback__, attr="_ray_start_tb"))
  File "/Users/kai/coding/ray/python/ray/train/_internal/utils.py", line 149, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/kai/coding/sandbox/scratch2.py", line 10, in failing_fn
    raise RuntimeError("I am failing")
RuntimeError: I am failing

Example

Tuning script:

import time

from ray import tune
from ray.air import ScalingConfig, RunConfig, FailureConfig
from ray.train.data_parallel_trainer import DataParallelTrainer


def failing_fn(config):
    time.sleep(1)
    raise RuntimeError("I am failing")


trainer = DataParallelTrainer(
    train_loop_per_worker=failing_fn, scaling_config=ScalingConfig(num_workers=4),
    run_config=RunConfig(failure_config=FailureConfig(fail_fast=False))
)
tuner = tune.Tuner(trainer, tune_config=tune.TuneConfig(num_samples=10))
tuner.fit()

Current output

  • 1333 lines
  • Includes full tracebacks
  • Prints tracebacks two times per trial (once in trainable, once in driver thread)

Medium output

  • 467 444 lines (~65% reduction)
  • Includes shortened tracebacks
  • Only prints tracebacks once per trial (in the driver thread)

Short output

  • 319 lines (~75% reduction to orig, ~30% reduction to medium)
  • Does not print any tracebacks, instead refers to error.txt files

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 6 commits August 16, 2022 11:52
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Comment on lines 60 to 65
except Exception as exc:
# Other (e.g. training) errors should be directly raised
_ray_start_tb = True # noqa: F841
raise exc.with_traceback(
shorten_tb(exc.__traceback__, attr="_ray_start_tb")
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @amogkam if we just catch RayTaskErrors in addition to RayActorErrors this will lead to test failures as training is restarted and a different exception is raised.

What is the expected behavior here? IMO it looks like task errors should fail immediately (as it's likely a logic/syntax error) and only actor failures should be retried. If that's the case (as in the current implementation) maybe we can add better comments for this. Lmk

@richardliaw
Copy link
Contributor

Overall huge fan of this change. Just wish we could avoid the weird 'magic' variable instead.

perhaps attach an attribute to the exception?

Kai Fricke added 3 commits August 22, 2022 11:47
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke
Copy link
Contributor Author

krfricke commented Aug 22, 2022

Yeah, I wasn't happy either with the hackiness of the approach. I've refactored the code to use a raise StartTraceback from original_exception and a skip_exception utility to skip irrelevant exception tracebacks, so this now uses proper python concepts. The output was reduced by another 2 lines per exception - ptal!

Kai Fricke added 3 commits August 22, 2022 11:55
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke requested a review from richardliaw August 24, 2022 15:07
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. would be great to add a test here on the stacktrace.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Kai Fricke added 2 commits August 26, 2022 12:01
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke merged commit d0678b8 into ray-project:master Aug 26, 2022
@krfricke krfricke deleted the tune/trial-error-dedup branch August 26, 2022 22:02
amogkam pushed a commit that referenced this pull request Sep 7, 2022
Follow-up from #27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility.

Signed-off-by: Kai Fricke kai@anyscale.com
ilee300a pushed a commit to ilee300a/ray that referenced this pull request Sep 12, 2022
Follow-up from ray-project#27946, this removes unnecessary calls to `skip_exceptions` to simplify the code and thus make it easier for contributors to understand how to use the utility.

Signed-off-by: Kai Fricke kai@anyscale.com
Signed-off-by: ilee300a <ilee300@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants