-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] pyarrow.fs hangs indefinitely while writing checkpoint file #26802
Comments
cc @ericl |
Since AIR is using pyarrow.fs separately, this is not related to e3bd598. If we are seeing writes hang, then we should either add timeouts around it, or try to figure out the root cause of the fs hang (could be a bug in pyarrow). |
The relevant change was introduced here: 1465eaa#diff-e1d889098f6b27e0d88ba206b0689d77c1a320d58697d98933decde97fd3cac8R486 How large are the checkpoints? And how many files? I agree we should probably add a timeout around it, but I'm wondering how we should configure this. Maybe with an env variable |
@gjoliver could we get a reproduce? I think we need to figure out why pyarrow is not working.. |
In July, the latest pyarrow release was 8.0.0, which didn't have native gcs support, yet. Thus this used GCS via gcsfs and a In any case, I think adding a timeout is a good way to go forward. |
the checkpoints are in the order of 100MB. |
Closed by #28155 |
What happened + What you expected to happen
Can be reliably reproduced when writing checkpoints to GCS. Not sure about S3.
Problem usually happens after we continuously write about 200 RLlib checkpoints, in 12 to 24 hrs of time.
There was 1 case where the job unblocked itself after 48 hrs without any intervention.
I was able to attach to the Ray node and grab a stack trace:
pretty sure this is the PR e3bd598
and per Tune team, there is a workaround to manually construct a SyncCfg to fallback to the old gsutil sync.
Versions / Dependencies
Since Ray 1.13.0
Reproduction script
Any RLlib jobs (>1.13.0) that syncs to GCS will do.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: