Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoking bazel with a different PATH re-fetches pip_repository #2551

Closed
Ubehebe opened this issue Jan 8, 2025 · 6 comments · Fixed by #2552
Closed

Invoking bazel with a different PATH re-fetches pip_repository #2551

Ubehebe opened this issue Jan 8, 2025 · 6 comments · Fixed by #2552

Comments

@Ubehebe
Copy link
Contributor

Ubehebe commented Jan 8, 2025

🐞 bug report

Forked from #2283 (comment)

Affected Rule

The issue is caused by the rule: pip_repository

Is this a regression?

Yes, the previous version in which this bug was not present was: ~0.32.2

Description

I run bazel build ... from a git pre-commit hook as part of my development workflow. Although it is not well-documented, git modifies the PATH env var when running hooks. (You can see this for yourself by putting echo "$PATH" in a pre-commit hook, then comparing the output of git hook run pre-commit versus running the hook script directly.)

Meanwhile, since #2000, the pip_repository rule has internal code that calls repository_ctx.getenv("PATH"): https://github.com/bazelbuild/rules_python/blob/main/python/private/repo_utils.bzl#L271. repository_ctx.getenv is documented:

any change to the value of the variable named by name will cause this repository to be re-fetched.

Putting the git and rules_python behaviors together causes the bug. Doing a bazel build ... directly and then doing it indirectly via git hook run pre-commit re-fetches the pip_repository. (The reverse order also works.)

🔬 Minimal Reproduction

https://github.com/Ubehebe/rules_python_2283_mvce

🔥 Exception or Error

Console messages printed when pip_repository is unexpectedly re-fetched. Example:

Collecting pandas==2.1.4 (from -r /var/folders/28/qqqxx4y55dz08x1bzcc9z7jm0000gn/T/tmpyann5bc8 (line 1))

🌍 Your Environment

Operating System:

$ uname -a
Darwin 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:03:11 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6020 arm64

Output of bazel version:

Bazelisk version: development
Build label: 8.0.0
Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer
Build time: Mon Dec 09 18:02:00 2024 (1733767320)
Build timestamp: 1733767320
Build timestamp as int: 1733767320

Rules_python version:

1.0.0

@Ubehebe
Copy link
Contributor Author

Ubehebe commented Jan 8, 2025

Here is the function that causes the re-fetch: https://github.com/bazelbuild/rules_python/blob/main/python/private/repo_utils.bzl#L256-L282

def _which_unchecked(mrctx, binary_name):
    """Tests to see if a binary exists.

    This is also watch the `PATH` environment variable.

    Args:
        binary_name: name of the binary to find.
        mrctx: repository context.

    Returns:
        `struct` with attributes:
        * `binary`: `repository_ctx.Path`
        * `describe_failure`: `Callable | None`; takes no args. If the
          binary couldn't be found, provides a detailed error description.
    """
    path = _getenv(mrctx, "PATH", "")
    binary = mrctx.which(binary_name)
    if binary:
        _watch(mrctx, binary)
        describe_failure = None
    else:
        describe_failure = lambda: _which_describe_failure(binary_name, path)

    return struct(
        binary = binary,
        describe_failure = describe_failure,
    )

The fix is simple. The result of the _getenv(mrctx, "PATH", "") call is only used inside the failure branch, so move the call there. I've verified this prevents re-fetching pip_repository when invoking bazel build ... directly vs. a git hook.

It's possible I'm missing the larger purpose of this code. Is it desirable to watch PATH even when the binary is resolved? cc @rickeylev, who wrote it in #2000.

@Ubehebe
Copy link
Contributor Author

Ubehebe commented Jan 8, 2025

It's reasonable to ask what the expected behavior of Bazel (rulesets, etc.) should be when PATH changes. Should the various Bazel caches be sensitive to PATH changes or not? There's a flag for exactly this problem: https://bazel.build/reference/command-line-reference#flag--incompatible_strict_action_env:

If true, Bazel uses an environment with a static value for PATH

I have to set this flag in my .bazelrc to preserve the action cache between regular builds and git hook builds. (It doesn't affect repository rules.)

Years ago, the Bazel team tried to enable this flag by default, but it was reverted and they haven't tried since: bazelbuild/bazel#7026. So the expected behavior is unclear.

As anecdotal evidence, I use lots of other rulesets (java, kotlin, js, go), and I've never observed repository rule re-fetching when PATH changes.

@Ubehebe
Copy link
Contributor Author

Ubehebe commented Jan 8, 2025

Sent #2552 as a potential fix. Maintainers, happy to discuss on this issue or the PR. I'm not sure how/whether to write a test, or whether the change should be considered breaking.

@rickeylev
Copy link
Collaborator

Thanks for the well researched and detailed report!

Although it is not well-documented, git modifies the PATH env var when running hooks.

Huh. TIL. I wonder if this explains another report of frequent invalidation that has been hard to figure out.

why call getenv(path) before calling which()? Is it desirable to watch PATH even when the binary is resolved?

I can't remember if this is just sloppy coding on my part, or if I intentionally did it. I rewrote that code quite a few times over a long time and lost track of the what/why of many changes. I'm more inclined to chalk this up to sloppy coding. I'm generally pretty good about adding comments when for a non-obvious side-effect type of thing like this. Assigning the variable in an outer block when it's only used in an inner block is also suspicious (I avoid doing that).

If I did do it intentionally, then it'd be because...uhhhh...probably under the assumption that which() is based upon PATH, so changes to PATH should invalidate which() results. This makes sense for setting up a local toolchain, where a main use case is to use the environment to find python and figure things out. It does seem overkill for every case, though -- touch, chmod, and id are some of the other tools that are looked up, and looking those up again if PATH changes seems unwarranted. (install_name_tool is another; not sure if that should also watch path or not).

So yes, I think it generally makes sense to not watch PATH for every which() call. I think #2552 makes sense.

It's reasonable to ask what the expected behavior of Bazel (rulesets, etc.) should be when PATH changes.

Yeah, this is a tricky question in light of git's behavior to modify PATH. For a local toolchain, it makes sense to watch PATH, however, this would cause the same unnecessary invalidation if one were to run a python tool in a git hook (a pretty reasonable thing). Hrmph.

I think in the abstract, rules don't care if PATH itself changes, they just care if the program they're looking for changed. Which, translated, is more like asking, "watch for if $program shows up in an earlier PATH directory". I'm not sure how, or if, we could approximate that on the Starlark side without watching PATH. It is probably an optimization that Bazel itself would need to implement.

@Ubehebe
Copy link
Contributor Author

Ubehebe commented Jan 8, 2025

Although it is not well-documented, git modifies the PATH env var when running hooks.

This is hard to google for. Here's my best reference: https://github.com/git/git/blob/master/exec-cmd.c#L333-L350:

void setup_path(void)
{
	const char *exec_path = git_exec_path();
	const char *old_path = getenv("PATH");
	struct strbuf new_path = STRBUF_INIT;

	git_set_exec_path(exec_path);
	add_path(&new_path, exec_path);

	if (old_path)
		strbuf_addstr(&new_path, old_path);
	else
		strbuf_addstr(&new_path, _PATH_DEFPATH);

	setenv("PATH", new_path.buf, 1);

	strbuf_release(&new_path);
}

If I'm reading this correctly, git prepends GIT_EXEC_PATH to PATH when running any command. If it later forks a process (which it presumably does when running a hook), the child process will see the modified PATH.

@aignas
Copy link
Collaborator

aignas commented Jan 8, 2025

Thank you @Ubehebe for taking time to document this. This is an amazing bug report.

Ubehebe added a commit to Ubehebe/rules_python that referenced this issue Jan 8, 2025
Fixes bazelbuild#2551.

Currently, the _which_unchecked helper unconditionally watches the
`PATH` env var via repository_ctx.getenv. getenv is documented
https://bazel.build/rules/lib/builtins/repository_ctx#getenv:

> any change to the value of the variable named by name will cause
> this repository to be re-fetched.

Thus, any change to `PATH` will cause any repository rule that
transitively calls _which_unchecked to be re-fetched. This includes
python_repository and whl_library.

There are reasonable development workflows that modify `PATH`. In
particular, when git runs a hook, it adds the value of `GIT_EXEC_PATH`
to `PATH` before invoking the hook. If the hook invokes bazel (for
example, a pre-commit hook running `bazel build ...`), it will cause
the Python repository rules to be re-fetched.

This commit lowers the repository_ctx.getenv("PATH") call to its only
use site in _which_unchecked, which happens to be a failure case (when
the binary is not found). This allows the success case to not watch
`PATH`, and therefore not to re-fetch the repository rule when it
changes.
github-merge-queue bot pushed a commit that referenced this issue Jan 9, 2025
Currently, the _which_unchecked helper unconditionally watches the
`PATH` env var via repository_ctx.getenv. getenv is documented
https://bazel.build/rules/lib/builtins/repository_ctx#getenv:

> any change to the value of the variable named by name will cause this
repository to be re-fetched.

Thus, any change to `PATH` will cause any repository rule that
transitively calls _which_unchecked to be re-fetched. This includes
python_repository and whl_library.

There are reasonable development workflows that modify `PATH`. In
particular, when git runs a hook, it adds the value of `GIT_EXEC_PATH`
to `PATH` before invoking the hook. If the hook invokes bazel (for
example, a pre-commit hook running `bazel build ...`), it will cause the
Python repository rules to be re-fetched.

This commit lowers the repository_ctx.getenv("PATH") call to its only
use site in _which_unchecked, which happens to be a failure case (when
the binary is not found). This allows the success case to not watch
`PATH`, and therefore not to re-fetch the repository rule when it
changes.

Fixes #2551.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants