Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental_shell_command sometimes spins for a long time on "Scheduling" step #16825

Closed
danxmoran opened this issue Sep 12, 2022 · 24 comments
Closed
Assignees
Labels
backend: Shell Shell backend-related issues bug

Comments

@danxmoran
Copy link
Contributor

Describe the bug

I've been trying to plug our front-end build process into Pants in a minimal way so we can include static assets in our container builds. I have an experimental_shell_command that runs yarn install as part of this. The command ran once successfully, but e ever since that point when I try to re-run it Pants will spin for awhile on:

Scheduling: Running experimental_shell_command src/projects:yarn-install

When I run with -ldebug I see:

22:19:57.64 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=3
22:19:58.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=4
22:19:59.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=5
22:20:00.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=6
22:20:01.66 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=7
22:20:02.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=8
22:20:03.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=9
22:20:04.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=10
22:20:05.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=11
22:20:06.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=12
22:20:07.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=13
22:20:08.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=14
22:20:09.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=15
22:20:10.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=16
22:20:11.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=17
22:20:12.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=18
22:20:13.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=19
22:20:14.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=20
22:20:15.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=21
22:20:16.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=22
22:20:17.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=23
22:20:18.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=24
22:20:19.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=25
22:20:20.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=26
22:20:21.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=27
22:20:22.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=28
22:20:23.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=29
22:20:24.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=30
22:20:25.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=31
22:20:26.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=32
22:20:27.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=33
22:20:28.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=34
22:20:29.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=35
22:20:30.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=36
22:20:31.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=37
22:20:32.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=38
22:20:33.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=39
22:20:34.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=40
22:20:35.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=41
22:20:36.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=42
22:20:37.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=43
22:20:38.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=44
22:20:39.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=45
22:20:40.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=46
22:20:41.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=47
22:20:42.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=48
22:20:43.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=49
22:20:44.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=50
22:20:45.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=51
22:20:46.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=52
22:20:47.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=53
22:20:48.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=54
22:20:49.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=55
22:20:50.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=56
22:20:51.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=57
22:20:52.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=58
22:20:53.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=59
22:20:54.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=60
22:20:55.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=61
22:20:56.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=62
22:20:57.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=63
22:20:58.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=64
22:20:59.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=65
22:21:00.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=66
22:21:01.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=67
22:21:02.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=68
22:21:03.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=69
22:21:04.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=70
22:21:05.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=71
22:21:06.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=72
22:21:07.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=73
22:21:08.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=74
22:21:09.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=75
22:21:10.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=76
22:21:11.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=77
22:21:12.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=78
22:21:13.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=79
22:21:14.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=80
22:21:15.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=81
22:21:16.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=82
22:21:17.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=83
22:21:18.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=84
22:21:19.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=85
22:21:20.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=86

The step eventually (after ~200s) finishes and moves on.

Pants version

v2.13.0

OS

MacOS

Additional info

https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2022_09_12_08_55_13_980_5f4261a10dfe4c219fc7bde55d1031de/

@danxmoran danxmoran added the bug label Sep 12, 2022
@stuhood

This comment was marked as outdated.

@stuhood

This comment was marked as outdated.

@stuhood
Copy link
Member

stuhood commented Sep 13, 2022

From the additional trace-level trace that you provided offline, it looks like at least two things are happening:

  1. eager_fetch_action_cache is taking a very long time to determine that the cache entry is loadable locally.
    • I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.
  2. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.
    • Even after potentially figuring out why the workunit is detached, I don't really see a good workaround for this one today... but see below.

Both of these point to the fact that having tons of small files as inputs/outputs to processes is currently relatively inefficient. Using zips/tars is a workaround, but in the medium term it's possible that we'll want to support using FUSE for these use-cases (see).

@stuhood
Copy link
Member

stuhood commented Sep 13, 2022

  1. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

I was wrong about #16848, so this is likely actually the root cause of the display issue: the stray setup_sandbox workunit runs for another ~100s after we've hit the cache for the process.

@stuhood
Copy link
Member

stuhood commented Sep 14, 2022

  • I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.

@danxmoran confirmed that switching from "fetch" to "validate" dropped the time for the local cache hit by 95% (which is probably at least partially due to #16627: thanks!).

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.


2. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

The next step for this one is fuzzier... but it probably involves trying to get a repro in pantsbuild/pants that I can iterate on with logging.

@stuhood
Copy link
Member

stuhood commented Nov 2, 2022

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.

Filed #17448 for this one.

@danxmoran
Copy link
Contributor Author

danxmoran commented Nov 2, 2022

I apparently forgot to link it here, but I did set up a public repository that repros the current behavior: https://github.com/danxmoran/pants-yarn-install-repro

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

@danxmoran I get a bunch of errors of the following form when I try to run :run-yarn-install, both with ./pants as well as pants_with_sources. Is this something you've seen before?

➤ YN0001: │ Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at new NodeError (node:internal/errors:393:5)
    at PassThrough.onclose (node:internal/streams/end-of-stream:142:30)
    at PassThrough.emit (node:events:513:28)
    at emitCloseNT (node:internal/streams/destroy:132:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

(it looks like the above is an issue on Mac OS w/ arm64; I'm trying under x86 emulation, and behaviour seems to be more in line with the above) Nope, under x86 emulation, I get the same issues, it just takes 6 times longer to get the errors :D

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

(or not; see edited comment)

@danxmoran
Copy link
Contributor Author

danxmoran commented Nov 8, 2022

What version of node and yarn do you have installed? I have node v16.14.0 and yarn v1.22.10.

(Note that yarn --version from within the repo will print 3.0.2 because of the config + vendored release, so I had to check from outside the repo to see what version is being used to bootstrap running the vendored version)

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

chrisjrn@chrisjrns-MacBook-Pro ~ % node --version && yarn --version
v19.0.1
1.22.19

@danxmoran
Copy link
Contributor Author

Ugh, my guess is the newer node - are you able to downgrade?

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

I actually upgraded form a reasonably old node (though I didn't check the version) in order to see if node version was the issue.

I'll see if I can use the same version of node as you

@chrisjrn
Copy link
Contributor

chrisjrn commented Nov 8, 2022

Yup, this appears to be working now!

@chrisjrn
Copy link
Contributor

After some experimentation, I found that there's been a few regressions in experimental_shell_command interplaying here. There's at least one commit on main (be00ed5) where the reproduction case runs surprisingly quickly. I'm trying to work out what's regressed since then so that we can get to a point where we can make it really fast (I think we can).

@chrisjrn
Copy link
Contributor

So the good news is that the bug specific bug reported here was fixed on main at one point -- you can see that in be00ed5. It's fast, and with work on symlinks, it should be even more so.

The bad news is that d8c0488 triggered some other bug that is causing consistent cache misses for the yarn install command.

@chrisjrn
Copy link
Contributor

(To reproduce on main: ./pants_from_sources -ldebug run :run-yarn-install . Look for:

Error loading process execution result from local cache: Computed digest for Snapshot

)

@chrisjrn
Copy link
Contributor

Update: We've got this to a point where on be00ed5, we can get the 3rd build and onwards to run in ~3 seconds by installing node_modules as an immutable_input_digests value, provided you set the pantsd memory limit to 4GiB (which is unusually high, and not entirely explained by the 800k+ files in the node_modules directory).

@stuhood is going to look at #17542 to see if the memory issue has gone away on main before investigating higher memory usage.

chrisjrn pushed a commit that referenced this issue Nov 14, 2022
While working on #16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved.

This adds such a work unit.
@chrisjrn
Copy link
Contributor

Checking back in on the current main, the cache miss issues have gone away when immutable_input_digests are used. Memory usage is still quite high, but I can successfully get an "instant" build if the pantsd memory limit is set to 2GiB instead of 4, which is somewhat of an improvement.

@chrisjrn
Copy link
Contributor

And just for fun, I'm running on an unadorned main, and I think I'm now prepared to close this issue out:

I'm running time ./pants_from_sources run :run-yarn-install:

First run:
./pants_from_sources run :run-yarn-install 87.75s user 125.52s system 239% cpu 1:29.24 total

Second run:
./pants_from_sources run :run-yarn-install 8.78s user 26.67s system 274% cpu 12.913 total

Third run:
./pants_from_sources run :run-yarn-install 8.73s user 27.10s system 282% cpu 12.681 total

@danxmoran to reproduce, try using PANTS_SHA=0a3d096c5e6c2915274826e96b4076bc3a6dfab7. Let me know if you get similar results!

Based on my tests when immutable_input_digests are used, the third+ run should result in ~3s execution times for :run-yarn-install when #17282 is adequately in place, but my feeling is that the current state of main is probably fine to close out this issue. @stuhood?

@stuhood
Copy link
Member

stuhood commented Nov 15, 2022

I think that we should probably still open a ticket about the memory usage, as it is surprising. But yea, this is closer to reasonable.

@chrisjrn
Copy link
Contributor

@stuhood I agree about the memory usage needing a separate ticket, but IMHO performance even without the immutable inputs is probably tolerable at this point.

@chrisjrn
Copy link
Contributor

I think we can close this one. @danxmoran please reopen if you're still experiencing untolerable performance on main.

stuhood pushed a commit to stuhood/pants that referenced this issue Jan 14, 2023
While working on pantsbuild#16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved.

This adds such a work unit.
stuhood added a commit that referenced this issue Jan 14, 2023
)

While working on #16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved.

This adds such a work unit.

Co-authored-by: Christopher Neugebauer <chrisjrn@toolchain.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Shell Shell backend-related issues bug
Projects
None yet
Development

No branches or pull requests

5 participants