Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android builds fail on Darwin due to watchman issues #13783

Closed
jakubgs opened this issue Aug 9, 2022 · 17 comments
Closed

Android builds fail on Darwin due to watchman issues #13783

jakubgs opened this issue Aug 9, 2022 · 17 comments
Assignees
Labels

Comments

@jakubgs
Copy link
Member

jakubgs commented Aug 9, 2022

Currently if someone tries to build Android on a Darwin OS using a multi-user Nix installation they will get this from Gradle:

> Task :app:bundleReleaseJsAndAssets
warning: the transform cache was reset.
2022-08-09T13:21:32,914: [cli] unable to talk to your watchman on /tmp/tmp-status-mobile-a94797710/jenkins-state/sock! (Permission denied)

Watchman:  watchman --no-pretty get-sockname returned with exit code=1, signal=null, stderr= 2022-08-09T13:21:32,914: [cli] unable to talk to your watchman on /tmp/tmp-status-mobile-a94797710/jenkins-state/sock! (Permission denied)

Which is caused by limited permissions of the watchman socket:

jenkins@maci7-03.ms-eu-dublin.slave.ci:~/status-mobile % ls -l /tmp/tmp-status-mobile-a94797710/jenkins-state/sock 
srw-------  1 jenkins  wheel  0 Aug  9 14:14 /tmp/tmp-status-mobile-a94797710/jenkins-state/sock
jenkins@maci7-03.ms-eu-dublin.slave.ci:~/status-mobile % ls -l /tmp/tmp-status-mobile-a94797710/             
total 40
-rw-r--r--  1 jenkins  wheel  16527 Aug  9 15:52 env-vars
drwx------  5 jenkins  wheel    160 Aug  9 14:14 jenkins-state

Created by the scripts/build-android.sh script:

if [[ "$(uname -s)" =~ Darwin ]]; then
# Start a watchman instance if not started already and store its socket path.
# In order to get access to the right versions of watchman and jq,
# we start an ad-hoc nix-shell that imports the packages from nix/nixpkgs-bootstrap.
WATCHMAN_SOCKFILE=$(watchman get-sockname --no-pretty | jq -r .sockname)

Since builds are run as nixbld* users we either have to fix permissions, or change how we start Watchman in the first place.

@jakubgs jakubgs self-assigned this Aug 9, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

If I try to drop starting Watchman entirely, and remove it from the Android build, I get this:

> Task :app:bundleReleaseJsAndAssets
warning: the transform cache was reset.
                 Welcome to React Native!
                Learn once, write anywhere


node:events:505
      throw er; // Unhandled 'error' event
      ^

Error: EMFILE: too many open files, watch
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:204:21)
Emitted 'error' event on NodeWatcher instance at:
    at NodeWatcher.checkedEmitError (/private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/node_modules/sane/src/node_watcher.js:143:12)
    at FSWatcher.emit (node:events:527:28)
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:210:12) {
  errno: -24,
  syscall: 'watch',
  code: 'EMFILE',
  filename: null
}

> Task :app:bundleReleaseJsAndAssets FAILED

Which makes sense, since that's kinda what Watchman fixes for a lot of people:

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

Normally, we provide a wrapped Watchman instance which already has a socket path specified:

makeWrapper ${watchman}/bin/watchman \
$out/bin/watchman \
--add-flags "--sockname=${watchmanSockPath}"

But if I try to provide unwrapped Watchman directly to the Android build, it fails with:

[cli] Failed to open /var/empty/Library/LaunchAgents/com.github.facebook.watchman.plist for write: No such file or directory

Watchman:  watchman --no-pretty get-sockname returned with exit code=null, signal=SIGABRT, stderr= 2022-08-09T15:47:23,705: [cli] Failed to open /var/empty/Library/LaunchAgents/com.github.facebook.watchman.plist for write: No such file or directory

Error! Failed to open file: /private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/android/app/build/generated/assets/react/release/index.android.bundle

> Task :app:bundleReleaseJsAndAssets FAILED

Which is something I have not seen before.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

Interestingly enough, this was originally introduced to Fix Android build on macOS by Pedro: 46e3b52

So I assume this was done to fix the EMFILE: too many open files error seen above.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

It appears the com.github.facebook.watchman.plist path can be found in the spawn_via_launchd() function:

  snprintf(
      plist_path,
      sizeof(plist_path),
      "%s/Library/LaunchAgents/com.github.facebook.watchman.plist",
      pw->pw_dir);

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/main.cpp#L466-L470

And inside of try_spawn_watchman() function we can see this:

#if defined(__APPLE__)
  return spawn_via_launchd();
#elif defined(_WIN32)
  return spawn_win32(daemon_argv);
#else
  return run_service_as_daemon();
#endif

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/main.cpp#L969-L976

So it appears the default way Watchman wants to spawn under Darwin is using Launchd, but that can't work in Nix sandbox.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

The docs aren't very helpful in understanding how the path is actually determined:

If you’re integrating against watchman using the unix socket and either the JSON or BSER protocol, you may need to discover the correct socket path. Rather than hard-coding the path or replicating the logic discussed in Command Line, you can simply execute the CLI to determine the path. This has the side effect of spawning the service for your user if it was not already running–bonus!

https://facebook.github.io/watchman/docs/cmd/get-sockname.html

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

The get_unix_sock_name and get_named_pipe_sock_path functions just return values from flags:

const std::string& get_unix_sock_name() {
  return flags.unix_sock_name;
}

const std::string& get_named_pipe_sock_path() {
  return flags.named_pipe_path;
}

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/sockname.cpp#L25-L31

The unix_sock_name flag is the --unix-listener-path CLI flag:

    {"unix-listener-path",
     'u',
#ifdef _WIN32
     "Specify alternate unix domain socket path (specifying this will disable"
     " named pipes unless `--named-pipe-path` is specified)",
#else
     "Specify alternate unix domain socket path",
#endif
     REQ_STRING,
     &flags.unix_sock_name,
     "PATH",
     IS_DAEMON},

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/Options.cpp#L85-L96

Which default value is generated by compute_file_name:

  compute_file_name(flags.unix_sock_name, user, "sock", "sockname");

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/main.cpp#L802

Which is defined here:
https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/main.cpp#L698

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

According to the socket docs:

The server will create a unix domain socket for communication with its clients. The location of the socket depends on compile time options and command line flags. It is recommended that you invoke watchman get-sockname to discover the location, or if you are being invoked via a trigger (since version 2.9.7) you will find the location in the $WATCHMAN_SOCK environmental variable.

https://facebook.github.io/watchman/docs/socket-interface.html

We might be able to use the WATCHMAN_SOCK env variable to enforce our own path.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

No, that doesn't seem to work:

 > export WATCHMAN_SOCK=/totally/legit/path
 > watchman get-sockname                   
{
    "version": "4.9.0",
    "sockname": "/tmp/jakubgs-state/sock"
}

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

This flag might be of interest:

    {"no-site-spawner",
     'S',
     "Don't use the site or system spawner",
     OPT_NONE,
     &flags.no_site_spawner,
     NULL,
     IS_DAEMON},

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/Options.cpp#L40-L46

Which does this:

  if (flags.no_site_spawner) {
    // The astute reader will notice this we're calling run_service_as_daemon()
    // here and not the various other platform spawning functions in the block
    // further below in this function.  This is deliberate: we want
    // to do the most simple background running possible when the
    // no_site_spawner flag is used.   In the future we plan to
    // migrate the platform spawning functions to use the site_spawn
    // functionality.
    return run_service_as_daemon();
  }

https://github.com/facebook/watchman/blob/208a092dc895fcbea9d52de2ebaf84947ba24c77/watchman/main.cpp#L950-L959

Which could possibly a way to have it spawn a daemon without using Launchd.

But it appears this flag was introduced in:

Which was only introduced in release v2020.05.11.00.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

I tried building most recent v2022.08.08.00 and the v2020.05.11.00 but for both I get:

Assessing boost...
Traceback (most recent call last):
  File "/build/source/build/fbcode_builder/getdeps.py", line 1275, in <module>
    sys.exit(main())
  File "/build/source/build/fbcode_builder/getdeps.py", line 1258, in main
    return args.func(args)
  File "/build/source/build/fbcode_builder/getdeps.py", line 109, in run
    self.run_project_cmd(args, loader, manifest)
  File "/build/source/build/fbcode_builder/getdeps.py", line 575, in run_project_cmd
    cached_project = CachedProject(cache, loader, m)
  File "/build/source/build/fbcode_builder/getdeps.py", line 227, in __init__
    self.cache_file_name = "-".join(
TypeError: sequence item 1: expected str instance, NoneType found

Which appears to be due to them having some kind of weird custom Facebook-specific dependency fetching code:

The main entry point is the getdeps.py script. This script has several subcommands, but the most notable is the build command. This will download and build all dependencies for a project, and then build the project itself.

https://github.com/facebook/watchman/tree/v2022.08.08.00/build/fbcode_builder

Because we can never reinvent the wheel enough times...

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

So, what are my options:

  1. Try to get new release of Watchman built with Nix and use the new --no-site-spawner flag.
    • This might not even be doable with the fbcode_builder garbage in the way.
  2. Try to start Watchman in the Nix build sandbox context instead of user context.
    • Might be cleaner and simpler than trying to pass config values and paths into the sandbox.
  3. Try to fix the EMFILE: too many open files error and bypass Watchman entirely.
    • Sounds like a good idea, but could be quite the rabbit hole to go down.
  4. Leave current behavior and just adjust permissions of the created socket.
    • Ugly, but would work, and would not mess with a well tested way of doing things.

I'm learning towards either option 2 or 3.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

I tried editing the NumberOfFiles setting for the nix-daemon on the MacOS host, but that doesn't solve the error:

admin@maci7-03.ms-eu-dublin.slave.ci:~ % sudo grep -A3 SoftResourceLimits /Library/LaunchDaemons/org.nixos.nix-daemon.plist
    <key>SoftResourceLimits</key>
    <dict>
      <key>NumberOfFiles</key>
      <integer>65535</integer>

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

If I run ulimit -Sf in the Nix build sandbox shell I just get unlimited, which is useless.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

One idea I had to fix MFILE: too many open files was to use --no-watch-fs/org.gradle.vfs.watch=false but unfortunately this option is only available starting from Gradle 6:
https://docs.gradle.org/6.8.3/userguide/gradle_daemon.html#sec:daemon_watch_fs

And we are stuck on Gradle 5 until we are able to upgrade to React-Native v68:

And that is blocked on some react-native-navigation issues:

And also this might not be the right way to go about it, since the watching might be done by React Native and not Gradle.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 9, 2022

While looking into this issue:

I found this suggestion from 2019:

sed -i -e 's/watch: true/watch: false/' node_modules/metro/src/node-haste/DependencyGraph.js

But when I look at the same file in our node_modules I can see this:

      watch: watch == null ? !ci.isCI : watch

Which uses the ci-info package, which checks environment variables, like CI:

exports.isCI = !!(
  env.CI || // Travis CI, CircleCI, Cirrus CI, Gitlab CI, Appveyor, CodeShip, dsari
  env.CONTINUOUS_INTEGRATION || // Travis CI, Cirrus CI
  env.BUILD_NUMBER || // Jenkins, TeamCity
  env.RUN_ID || // TaskCluster, dsari
  exports.name ||
  false
)

https://github.com/watson/ci-info/blob/18fe88088b626a6dcefc66e66fb32badb05ea216/index.js#L52-L59

So setting CI=true might be the ticket to stopping metro from using jest-haste-map with watch=true.

jakubgs added a commit that referenced this issue Aug 9, 2022
This passing of Watchman socket was implemented in order to avoid this:
```
Error: EMFILE: too many open files, watch
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:204:21)
Emitted 'error' event on NodeWatcher instance at:
    at NodeWatcher.checkedEmitError (/private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/node_modules/sane/src/node_watcher.js:143:12)
    at FSWatcher.emit (node:events:527:28)
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:210:12) {
  errno: -24,
  syscall: 'watch',
  code: 'EMFILE',
  filename: null
}
```
Which is caused by `jest-haste-map` used by `metro` starting to watch
the filesystem for file changes, which is pointless when doing a
one-off build using Nix.

By entirely dropping use of Watchman we also fix the following issue:
```
[cli] unable to talk to your watchman on /tmp/tmp-status-mobile-ABC/jenkins-state/sock! (Permission denied)
```
Which happens on multi-user Nix installations becuase the user that the
Nix build is executed as is not the same as the user that starts
Watchman and creates the socket file.

Issue: #13783

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@chrispader
Copy link

chrispader commented Aug 9, 2022

So setting CI=true might be the ticket to stopping metro from using jest-haste-map with watch=true.

But this wouldn't fix the issue for local builds on Darwin right?

jakubgs added a commit that referenced this issue Aug 9, 2022
This passing of Watchman socket was implemented in order to avoid this:
```
Error: EMFILE: too many open files, watch
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:204:21)
Emitted 'error' event on NodeWatcher instance at:
    at NodeWatcher.checkedEmitError (/private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/node_modules/sane/src/node_watcher.js:143:12)
    at FSWatcher.emit (node:events:527:28)
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:210:12) {
  errno: -24,
  syscall: 'watch',
  code: 'EMFILE',
  filename: null
}
```
Which is caused by `jest-haste-map` used by `metro` starting to watch
the filesystem for file changes, which is pointless when doing a
one-off build using Nix.

But by setting `CI=true` we can make `metro` not start this waching of
files in the first place, removing the need for use of Watchman entirely.

By entirely dropping use of Watchman we also fix the following issue:
```
[cli] unable to talk to your watchman on /tmp/tmp-status-mobile-ABC/jenkins-state/sock! (Permission denied)
```
Which happens on multi-user Nix installations becuase the user that the
Nix build is executed as is not the same as the user that starts
Watchman and creates the socket file.

Issue: #13783

Signed-off-by: Jakub Sokołowski <jakub@status.im>
jakubgs added a commit that referenced this issue Aug 9, 2022
This passing of Watchman socket was implemented in order to avoid this:
```
Error: EMFILE: too many open files, watch
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:204:21)
Emitted 'error' event on NodeWatcher instance at:
    at NodeWatcher.checkedEmitError (/private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/node_modules/sane/src/node_watcher.js:143:12)
    at FSWatcher.emit (node:events:527:28)
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:210:12) {
  errno: -24,
  syscall: 'watch',
  code: 'EMFILE',
  filename: null
}
```
Which is caused by `jest-haste-map` used by `metro` starting to watch
the filesystem for file changes, which is pointless when doing a
one-off build using Nix.

But by setting `CI=true` we can make `metro` not start this waching of
files in the first place, removing the need for use of Watchman entirely.

By entirely dropping use of Watchman we also fix the following issue:
```
[cli] unable to talk to your watchman on /tmp/tmp-status-mobile-ABC/jenkins-state/sock! (Permission denied)
```
Which happens on multi-user Nix installations becuase the user that the
Nix build is executed as is not the same as the user that starts
Watchman and creates the socket file.

Issue: #13783

Signed-off-by: Jakub Sokołowski <jakub@status.im>
jakubgs added a commit that referenced this issue Aug 10, 2022
This passing of Watchman socket was implemented in order to avoid this:
```
Error: EMFILE: too many open files, watch
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:204:21)
Emitted 'error' event on NodeWatcher instance at:
    at NodeWatcher.checkedEmitError (/private/tmp/nix-build-status-mobile-build-nightly-android.drv-0/node_modules/sane/src/node_watcher.js:143:12)
    at FSWatcher.emit (node:events:527:28)
    at FSEvent.FSWatcher._handle.onchange (node:internal/fs/watchers:210:12) {
  errno: -24,
  syscall: 'watch',
  code: 'EMFILE',
  filename: null
}
```
Which is caused by `jest-haste-map` used by `metro` starting to watch
the filesystem for file changes, which is pointless when doing a
one-off build using Nix.

But by setting `CI=true` we can make `metro` not start this waching of
files in the first place, removing the need for use of Watchman entirely.

By entirely dropping use of Watchman we also fix the following issue:
```
[cli] unable to talk to your watchman on /tmp/tmp-status-mobile-ABC/jenkins-state/sock! (Permission denied)
```
Which happens on multi-user Nix installations becuase the user that the
Nix build is executed as is not the same as the user that starts
Watchman and creates the socket file.

Issue: #13783

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Member Author

jakubgs commented Aug 10, 2022

Fixed in: #13784

@jakubgs jakubgs closed this as completed Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants