`experimental_shell_command` sometimes spins for a long time on "Scheduling" step #16825

danxmoran · 2022-09-12T12:55:42Z

Describe the bug

I've been trying to plug our front-end build process into Pants in a minimal way so we can include static assets in our container builds. I have an experimental_shell_command that runs yarn install as part of this. The command ran once successfully, but e ever since that point when I try to re-run it Pants will spin for awhile on:

Scheduling: Running experimental_shell_command src/projects:yarn-install

When I run with -ldebug I see:

22:19:57.64 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=3
22:19:58.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=4
22:19:59.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=5
22:20:00.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=6
22:20:01.66 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=7
22:20:02.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=8
22:20:03.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=9
22:20:04.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=10
22:20:05.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=11
22:20:06.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=12
22:20:07.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=13
22:20:08.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=14
22:20:09.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=15
22:20:10.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=16
22:20:11.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=17
22:20:12.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=18
22:20:13.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=19
22:20:14.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=20
22:20:15.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=21
22:20:16.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=22
22:20:17.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=23
22:20:18.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=24
22:20:19.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=25
22:20:20.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=26
22:20:21.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=27
22:20:22.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=28
22:20:23.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=29
22:20:24.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=30
22:20:25.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=31
22:20:26.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=32
22:20:27.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=33
22:20:28.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=34
22:20:29.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=35
22:20:30.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=36
22:20:31.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=37
22:20:32.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=38
22:20:33.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=39
22:20:34.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=40
22:20:35.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=41
22:20:36.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=42
22:20:37.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=43
22:20:38.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=44
22:20:39.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=45
22:20:40.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=46
22:20:41.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=47
22:20:42.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=48
22:20:43.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=49
22:20:44.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=50
22:20:45.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=51
22:20:46.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=52
22:20:47.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=53
22:20:48.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=54
22:20:49.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=55
22:20:50.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=56
22:20:51.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=57
22:20:52.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=58
22:20:53.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=59
22:20:54.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=60
22:20:55.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=61
22:20:56.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=62
22:20:57.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=63
22:20:58.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=64
22:20:59.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=65
22:21:00.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=66
22:21:01.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=67
22:21:02.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=68
22:21:03.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=69
22:21:04.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=70
22:21:05.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=71
22:21:06.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=72
22:21:07.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=73
22:21:08.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=74
22:21:09.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=75
22:21:10.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=76
22:21:11.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=77
22:21:12.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=78
22:21:13.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=79
22:21:14.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=80
22:21:15.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=81
22:21:16.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=82
22:21:17.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=83
22:21:18.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=84
22:21:19.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=85
22:21:20.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=86

The step eventually (after ~200s) finishes and moves on.

Pants version

v2.13.0

OS

MacOS

Additional info

https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2022_09_12_08_55_13_980_5f4261a10dfe4c219fc7bde55d1031de/

The text was updated successfully, but these errors were encountered:

stuhood · 2022-09-13T21:27:49Z

From the additional trace-level trace that you provided offline, it looks like at least two things are happening:

eager_fetch_action_cache is taking a very long time to determine that the cache entry is loadable locally.
- I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.
There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.
- Even after potentially figuring out why the workunit is detached, I don't really see a good workaround for this one today... but see below.

Both of these point to the fact that having tons of small files as inputs/outputs to processes is currently relatively inefficient. Using zips/tars is a workaround, but in the medium term it's possible that we'll want to support using FUSE for these use-cases (see).

stuhood · 2022-09-13T21:33:04Z

There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

I was wrong about #16848, so this is likely actually the root cause of the display issue: the stray setup_sandbox workunit runs for another ~100s after we've hit the cache for the process.

stuhood · 2022-09-14T21:08:39Z

I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.

@danxmoran confirmed that switching from "fetch" to "validate" dropped the time for the local cache hit by 95% (which is probably at least partially due to #16627: thanks!).

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.

2. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

The next step for this one is fuzzier... but it probably involves trying to get a repro in pantsbuild/pants that I can iterate on with logging.

stuhood · 2022-11-02T22:24:43Z

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.

Filed #17448 for this one.

danxmoran · 2022-11-02T22:27:26Z

I apparently forgot to link it here, but I did set up a public repository that repros the current behavior: https://github.com/danxmoran/pants-yarn-install-repro

chrisjrn · 2022-11-08T18:34:07Z

@danxmoran I get a bunch of errors of the following form when I try to run :run-yarn-install, both with ./pants as well as pants_with_sources. Is this something you've seen before?

➤ YN0001: │ Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at new NodeError (node:internal/errors:393:5)
    at PassThrough.onclose (node:internal/streams/end-of-stream:142:30)
    at PassThrough.emit (node:events:513:28)
    at emitCloseNT (node:internal/streams/destroy:132:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

chrisjrn · 2022-11-08T18:55:56Z

~~(it looks like the above is an issue on Mac OS w/ arm64; I'm trying under x86 emulation, and behaviour seems to be more in line with the above)~~ Nope, under x86 emulation, I get the same issues, it just takes 6 times longer to get the errors :D

chrisjrn · 2022-11-08T18:58:03Z

(or not; see edited comment)

danxmoran · 2022-11-08T19:00:20Z

What version of node and yarn do you have installed? I have node v16.14.0 and yarn v1.22.10.

(Note that yarn --version from within the repo will print 3.0.2 because of the config + vendored release, so I had to check from outside the repo to see what version is being used to bootstrap running the vendored version)

chrisjrn · 2022-11-08T19:01:51Z

chrisjrn@chrisjrns-MacBook-Pro ~ % node --version && yarn --version
v19.0.1
1.22.19

danxmoran · 2022-11-08T19:02:58Z

Ugh, my guess is the newer node - are you able to downgrade?

chrisjrn · 2022-11-08T19:05:42Z

I actually upgraded form a reasonably old node (though I didn't check the version) in order to see if node version was the issue.

I'll see if I can use the same version of node as you

chrisjrn · 2022-11-08T19:52:48Z

Yup, this appears to be working now!

chrisjrn · 2022-11-10T21:01:07Z

After some experimentation, I found that there's been a few regressions in experimental_shell_command interplaying here. There's at least one commit on main (be00ed5) where the reproduction case runs surprisingly quickly. I'm trying to work out what's regressed since then so that we can get to a point where we can make it really fast (I think we can).

chrisjrn · 2022-11-11T00:20:10Z

So the good news is that the bug specific bug reported here was fixed on main at one point -- you can see that in be00ed5. It's fast, and with work on symlinks, it should be even more so.

The bad news is that d8c0488 triggered some other bug that is causing consistent cache misses for the yarn install command.

chrisjrn · 2022-11-11T00:25:44Z

(To reproduce on main: ./pants_from_sources -ldebug run :run-yarn-install . Look for:

Error loading process execution result from local cache: Computed digest for Snapshot

)

chrisjrn · 2022-11-14T20:32:02Z

Update: We've got this to a point where on be00ed5, we can get the 3rd build and onwards to run in ~3 seconds by installing node_modules as an immutable_input_digests value, provided you set the pantsd memory limit to 4GiB (which is unusually high, and not entirely explained by the 800k+ files in the node_modules directory).

@stuhood is going to look at #17542 to see if the memory issue has gone away on main before investigating higher memory usage.

While working on #16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved. This adds such a work unit.

chrisjrn · 2022-11-15T16:55:17Z

Checking back in on the current main, the cache miss issues have gone away when immutable_input_digests are used. Memory usage is still quite high, but I can successfully get an "instant" build if the pantsd memory limit is set to 2GiB instead of 4, which is somewhat of an improvement.

chrisjrn · 2022-11-15T17:09:17Z

And just for fun, I'm running on an unadorned main, and I think I'm now prepared to close this issue out:

I'm running time ./pants_from_sources run :run-yarn-install:

First run:
./pants_from_sources run :run-yarn-install 87.75s user 125.52s system 239% cpu 1:29.24 total

Second run:
./pants_from_sources run :run-yarn-install 8.78s user 26.67s system 274% cpu 12.913 total

Third run:
./pants_from_sources run :run-yarn-install 8.73s user 27.10s system 282% cpu 12.681 total

@danxmoran to reproduce, try using PANTS_SHA=0a3d096c5e6c2915274826e96b4076bc3a6dfab7. Let me know if you get similar results!

Based on my tests when immutable_input_digests are used, the third+ run should result in ~3s execution times for :run-yarn-install when #17282 is adequately in place, but my feeling is that the current state of main is probably fine to close out this issue. @stuhood?

stuhood · 2022-11-15T17:40:20Z

I think that we should probably still open a ticket about the memory usage, as it is surprising. But yea, this is closer to reasonable.

chrisjrn · 2022-11-15T18:14:12Z

@stuhood I agree about the memory usage needing a separate ticket, but IMHO performance even without the immutable inputs is probably tolerable at this point.

chrisjrn · 2022-11-22T17:35:40Z

I think we can close this one. @danxmoran please reopen if you're still experiencing untolerable performance on main.

While working on pantsbuild#16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved. This adds such a work unit.

) While working on #16825, we discovered that parts of the trace for our `experimental_run_shell_command` were not captured. That's because there was no work unit preserved. This adds such a work unit. Co-authored-by: Christopher Neugebauer <chrisjrn@toolchain.com>

danxmoran added the bug label Sep 12, 2022

This comment was marked as outdated.

Sign in to view

stuhood mentioned this issue Sep 13, 2022

Render experimental_(run_)shell_command at Debug level. #16848

Closed

This comment was marked as outdated.

Sign in to view

thejcannon added the backend: Shell Shell backend-related issues label Sep 30, 2022

stuhood assigned benjyw Oct 12, 2022

benjyw assigned chrisjrn Nov 1, 2022

stuhood mentioned this issue Nov 2, 2022

Adjust cache_content_behavior="fetch" to skip loading of data #17448

Closed

stuhood unassigned benjyw Nov 10, 2022

chrisjrn assigned stuhood Nov 11, 2022

chrisjrn mentioned this issue Nov 14, 2022

experimental_shell_command can trigger consistent cache misses #17542

Closed

chrisjrn mentioned this issue Nov 14, 2022

Adds workunit for interactive processes. #17544

Merged

chrisjrn mentioned this issue Nov 15, 2022

Immutably hardlink "large" files in a sandbox #17520

Merged

chrisjrn mentioned this issue Nov 15, 2022

Large node_modules directories exhaust pantsd's memory limit #17553

Open

chrisjrn closed this as completed Nov 22, 2022

stuhood mentioned this issue Jan 14, 2023

Adds workunit for interactive processes. (Cherry-pick of #17544) #17996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`experimental_shell_command` sometimes spins for a long time on "Scheduling" step #16825

`experimental_shell_command` sometimes spins for a long time on "Scheduling" step #16825

danxmoran commented Sep 12, 2022

This comment was marked as outdated.

This comment was marked as outdated.

stuhood commented Sep 13, 2022

stuhood commented Sep 13, 2022 •

edited

Loading

stuhood commented Sep 14, 2022 •

edited

Loading

stuhood commented Nov 2, 2022

danxmoran commented Nov 2, 2022 •

edited

Loading

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 8, 2022 •

edited

Loading

chrisjrn commented Nov 8, 2022

danxmoran commented Nov 8, 2022 •

edited

Loading

chrisjrn commented Nov 8, 2022

danxmoran commented Nov 8, 2022

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 10, 2022

chrisjrn commented Nov 11, 2022

chrisjrn commented Nov 11, 2022

chrisjrn commented Nov 14, 2022

chrisjrn commented Nov 15, 2022

chrisjrn commented Nov 15, 2022

stuhood commented Nov 15, 2022

chrisjrn commented Nov 15, 2022

chrisjrn commented Nov 22, 2022

experimental_shell_command sometimes spins for a long time on "Scheduling" step #16825

experimental_shell_command sometimes spins for a long time on "Scheduling" step #16825

Comments

danxmoran commented Sep 12, 2022

This comment was marked as outdated.

This comment was marked as outdated.

stuhood commented Sep 13, 2022

stuhood commented Sep 13, 2022 • edited Loading

stuhood commented Sep 14, 2022 • edited Loading

stuhood commented Nov 2, 2022

danxmoran commented Nov 2, 2022 • edited Loading

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 8, 2022 • edited Loading

chrisjrn commented Nov 8, 2022

danxmoran commented Nov 8, 2022 • edited Loading

chrisjrn commented Nov 8, 2022

danxmoran commented Nov 8, 2022

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 8, 2022

chrisjrn commented Nov 10, 2022

chrisjrn commented Nov 11, 2022

chrisjrn commented Nov 11, 2022

chrisjrn commented Nov 14, 2022

chrisjrn commented Nov 15, 2022

chrisjrn commented Nov 15, 2022

stuhood commented Nov 15, 2022

chrisjrn commented Nov 15, 2022

chrisjrn commented Nov 22, 2022

`experimental_shell_command` sometimes spins for a long time on "Scheduling" step #16825

`experimental_shell_command` sometimes spins for a long time on "Scheduling" step #16825

stuhood commented Sep 13, 2022 •

edited

Loading

stuhood commented Sep 14, 2022 •

edited

Loading

danxmoran commented Nov 2, 2022 •

edited

Loading

chrisjrn commented Nov 8, 2022 •

edited

Loading

danxmoran commented Nov 8, 2022 •

edited

Loading