-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"unlinkat ([path]) (Directory not empty)" errors with experimental_worker_for_repo_fetching #22680
Comments
I am presently confirming that |
Still on 7.2.0, haven't seen this again with |
I managed to reproduce this via #22748 (comment):
|
@bazel-io fork 7.2.1 |
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards #22680 Work towards #22748 Closes #22772. PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards bazelbuild#22680 Work towards bazelbuild#22748 Closes bazelbuild#22772. PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
…22814) `StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards #22680 Work towards #22748 Closes #22772 PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624 Closes #22775
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards bazelbuild#22680 Work towards bazelbuild#22748 Closes bazelbuild#22772. PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
…22883) `StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards #22680 Work towards #22748 Closes #22772. PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624 Closes #22776
…azelbuild#22814) `StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards bazelbuild#22680 Work towards bazelbuild#22748 Closes bazelbuild#22772 PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624 Closes bazelbuild#22775
I have a repro of this bug here: https://github.com/jjmaestro/bazel_issue_22680_repro It always fails when I run it inside that Docker. @fmeum and @iancha1992 & @moroten (from #22748) could you check it out and see if this also fails for you? Thanks!! PS: also, sorry for the commit noise, I did a few |
@jjmaestro Did you test with 7.2.1? According to #23029 (comment), the fix may have bene included in 7.2.1. |
Yup, that's exactly what I did. The repro repo that I linked is set to It's literally one I hope this will help debug what's causing this! |
Ah, ok, sorry, my brain just caught on to what you are saying 😅 I've update the repo and re-run it with Can you check it out and see if that helps with identifying what's going on? Thanks! |
@jjmaestro Does it work for Bazel built from |
@meteorcloudy not sure, haven't tested it yet with a custom built version. I was now beginning to clone the repo and read about how to build bazel from source, so I'll play with this a bit more and will get back to you! |
I ran the reproducer, but all three builds passed for me. |
@fmeum wow, that's surprising! It's a 100% repro for me, it always fails when I run it. And I had a friend run the repro and it also failed for them, that's why I thought it could be something with Bazel. But maybe it's with Docker and/or some incompatibility or configuration? I'm running Docker 26.1.4, build 5650f9b on a Mac Studio (MacOS 13.6.6 on M1 Max), so Docker runs on linux/arm64. What's your setup? Also, any ideas as to how I could debug this further? I'd love if I could get some help and see if I can dig into this further :) Thanks!! |
I ran this in a Debian VM on a Mac M3 host. Could you share the stack traces and build failures you got? |
Yeah, I just ran it on a Debian VM (debian12 running in UTM) in my mac and it does work... so my guess is that there's something going on with the virtualization that Docker uses on Mac.
This is the error I get running build with
Re. the stack traces, I'm a Bazel n00b so I don't know if there's any other flag or way to get more verbose output or logs. If there is, please let me know and I'll get them. Thanks! |
It seems the offending line is L255 So, Bazel fails when trying to delete a file with a "Directory not empty" error or something like that :-? I'll try to dig deeper and see if I can get more info on the Bazel side |
Since the stack trace is produced by an |
No, I haven't yet. It could definitely be either but I was much more suspecting that the error could be somewhere in Bazel because that I'll try to dig a bit more into this and see where it goes! |
Small update, IMHO there's definitely something weird going on... to me, this definitely looks like some sort of "race condition / non-deterministic behavior. When I add When I just compile
But when I run the repro in Docker using a compiled
This seems to me like the In any case, whether this is what's happening or something else, it definitely seems quite weird to me that this error(s) were caused by I'll keep checking further, hopefully I'll be able to isolate / get closer to what's actually failing. |
CC @rickeylev @aignas in case you know of any logic in the rules_python repo rules that could result in concurrent access to the files it deletes. |
@fmeum BTW, just found this while looking for the implementation of bazel/src/main/native/unix_jni.cc Lines 1075 to 1089 in d915b98
Part of the comment was added by @oquenchil in d6c79db and in the commit it was mentioned:
Could this be the issue? Maybe running the deletes in a Docker volume running in MacOS makes triggering the bug more likely :-? |
We used to see concurrent write issues on Windows because of pyc creation
happening, but fixed it. This would happen during the build phase, not repo
phase.
Part of how the PYC creation process works is Python writes foo.pyc.NNNNN,
where N is a timestamp. Then it does mv(foo.pyc.NNN, foo.pyc) to perform an
atomic rename. This would happen after it created the __pycache__ directory
and within that directory (assuming it could; if it can't, it doesn't try
to precompile)
To fix this, we:
1. The globs ignore pyc files and pycache directories
2. The repo files/dirs are all made read-only
On Windows, if a user is an admin, (2) doesn't actually work. Hence (1) is
the defense in the case.
On Mac/Linux, we never saw issues.
Hope this helps
…On Thu, Jul 25, 2024, 9:43 AM Javier Maestro ***@***.***> wrote:
@fmeum <https://github.com/fmeum> BTW, just found this while looking for
the implementation of deleteTreesBelow:
https://github.com/bazelbuild/bazel/blob/d915b98376295bd010f669e7a60815a3eeca3d0c/src/main/native/unix_jni.cc#L1075-L1089
Part of the comment was added by @oquenchil <https://github.com/oquenchil>
in d6c79db
<d6c79db>
and in the commit it was mentioned:
Asynchronous deletion may make it more likely for the problem to surface
but the likelihood is still very low.
Could this be the issue? Maybe running the deletes in a Docker volume
running in MacOS makes triggering the bug more likely :-?
—
Reply to this email directly, view it on GitHub
<#22680 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIEXQ6RKFPUBALRE6CGB7PLZOETKHAVCNFSM6AAAAABJDGQYW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQHE2TANZSGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Digging into why this happens in Docker running on MacOS, I see that it's using I've also checked past issues with Bazel and I'm trying to "prove this" by using Docker with
And... after that failure, I can't remove the bazel cache! 😅
I get the same errors even when trying to remove the folder outside Docker. I need to So, I have a feeling there's one or more bugs in Thanks for all your help! |
BTW, for completion, I've also tried removing Docker Desktop and running docker cli with Lima:
and making sure the home and
all three builds also work! So, my guess remains as I mentioned in my last comment: something's going on with Docker Desktop and |
…azelbuild#22814) `StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards bazelbuild#22680 Work towards bazelbuild#22748 Closes bazelbuild#22772 PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624 Closes bazelbuild#22775
…azelbuild#22814) `StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`: 1. Cancels all pending async tasks. 2. Awaits their termination. 3. Cleans up the working directory (always for module extensions, on failure for repo rules). 4. Fails if there were pending async tasks in an otherwise successful evaluation. Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases. This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`. Work towards bazelbuild#22680 Work towards bazelbuild#22748 Closes bazelbuild#22772 PiperOrigin-RevId: 644669599 Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624 Closes bazelbuild#22775
Description of the bug:
After upgrading to Bazel 7.2.0 and removing
--experimental_worker_for_repo_fetching=off
from.bazelrc
(because #21803 was fixed), we started to observe failures of the following form:These errors appear to be linked with repository fetch restarts, particularly "fetch interrupted due to memory pressure; restarting". In two of two cases where I have seen this error, it happened in a repo where the repo fetch got restarted, judging by either the presence of that log message or the JSON trace profile.
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.2.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
Bazel Slack discussion: https://bazelbuild.slack.com/archives/CA31HN1T3/p1718057176146129
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: