Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/cuda oom detection and handling #6934

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions pytorch_lightning/utilities/memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ def is_oom_error(exception):
def is_cuda_out_of_memory(exception):
return isinstance(exception, RuntimeError) \
and len(exception.args) == 1 \
and "CUDA out of memory." in exception.args[0]
and "CUDA" in exception.args[0] \
and "out of memory" in exception.args[0]
Comment on lines 54 to +57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be easier to read if the result will be in a var and then just return it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but was trying to as non-invasive as possible :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good invasion is always welcome :]



# based on https://github.com/BlackHC/toma/blob/master/toma/torch_cuda_memory.py
Expand All @@ -76,4 +77,10 @@ def garbage_collection_cuda():
"""Garbage collection Torch (CUDA) memory."""
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
try:
# This is the last thing that should cause an OOM error, but seemingly it can.
torch.cuda.empty_cache()
except RuntimeError as exception:
if not is_oom_error(exception):
# Only handle OOM errors
raise