-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output directory hash differs if the same build settings are applied via transition vs. command line #12171
Comments
cc @juliexxia, this is a regression we found in Bazel 3.5 in the configuration workstream, and I think you're the right person to triage this? |
Yep! apologies, I'm a little confused by the issue: historically, starlark transitions always updated the output directory. What 6360ff7 did was make it so that this only happened in cases where options were actually given new values. The output directory of starlark transitioned targets being different isn't new so I'm confused about what the regression is here. Are you saying the output directory wasn't being updated in 3.2.0 and now is?
Are you saying this is undesired behavior? |
FWIW, you're right, the repro above should not have an updated transition directory fragment because as you say, it's the same configuration. But not having the updated transition directory fragment is new as of 6360ff7. |
So I think I can sum up the issue as: "bazel is using multiple output directories for configurations that are otherwise identical". This means targets are getting configured & built multiple times for identical configurations, and can mess up things like rules that try to do IDE configuration, because targets that should be unique in fact are showing up under multiple output directories. This makes starlark configurations path dependent, in a sense, where the incoming configuration can have an effect on the output configuration's config hash. I don't think this is intentional? My expectation as a rule author is that "identical" configurations should always map to the config hash, and the set of |
Yep! This all makes sense and I agree. Before 6360ff7, this wasn't the case, we were hashing all the values that were set by transitions regardless of they were the same value as they came into the transition. After 6360ff7, what you're saying should hold true as we slimmed down the group of post-transition configurations that get an updated output directory to those where the transition actually changed the value of an option (by comparing the output value to the input value). What I'm confused about is the timeline/regression. Starlark transitions have historically always updated the output directory and only recently, with 6360ff7, have started not always updating the output directory, only doing so when a flag is actually changed. So I'm confused about how you weren't seeing updated output directory paths before 6360ff7 and are seeing them now since it should be exactly the opposite. I wanted to clarify that's indeed what you're seeing so I can dig further. |
I think we maybe were seeing updating output directory paths, but they were being updated to the same value? Because all the outputs were being taken into consideration? Again, pure speculation, but I feel like what needs to be done is look at the entire output configuration diffed against the "default" configuration, rather than only considering the diff of transition input/output flags against the incoming configuration? Not sure if that makes any sense.
From looking at the diff, I have a hunch that the change from 6360ff7#diff-01af620735f7a29048977650332a354cR415-R416 to only looking at inputs to looking at inputs and outputs might be the trigger? |
That's the purpose of the "affected by starlark transition" bit. So the way the logic works:
The key here is that we maintain a list of affected options over an entire build and we only update the transition directory name fragment when a transition is performed. So given the following graph structure: A If you build A with --foo=bar, then in theory* you get the following configured targets:
All this does is give the input dict to the transition access to the default values of the output settings. Build settings that are set to their default value are not stored in the configuration (so that configurations with non-set build settings and explicitly-set-to-default build settings look the same) so if we don't have a recorded value for the build setting in the configuration, we need to know what the default value of the setting is so we can either record the new value or recognize the new value as the default and continue to ignore it. |
I said in theory* in the previous post because it sounds like you're getting different results. But again, need some clarity on what you're seeing. |
What would help provide that clarity? Given I have the sort of single-file repro, I'm happy to do whatever debugging / logging you think would be helpful |
Do you mind verifying for me that you're running Bazel 3.5.0? Is it possible to run it with Bazel 3.5.0 proper and see if you get the same results? I notice you're running bazelisk - are you setting the bazel version anywhere? See: https://github.com/bazelbuild/bazelisk#how-does-bazelisk-know-which-bazel-version-to-run I'm running the following repro (adapted from yours) with regular bazel 3.5.1 and I'm not seeing the same results you are:
|
Another helpful verification would be at what bazel version were you not seeing this behavior because I would expect this behavior to be in all bazel versions before 3.5.0 |
https://github.com/segiddins/bazel-config-multiple-hashes.git shows this is a regression between 3.4.0 and 3.5.0 |
There seem to be some build failures In that repro. Where are the config hashes that you use at the end coming from? I don't see them show up anytime before the config command: "bazel config 1908601e8157a3765e39ede2fc98d23c9c1be24080e5589fc71fa6a15d7d9ff1 28e83bd7c518774a6378a7ee5024c969ab10fdacb062577d0f528bb70b3e6917" Can you run the repro I wrote above? It's smaller and should give a clear signal. |
The build failures are intentional -- that's how I grab the pair of configs to diff. I or @amberdixon will try to run your repro. |
Hi Julie! Sorry about the delayed response here. Here are my responses to your comments and my notes about the repro case.
That's fine that the output directory gets updated. But it seems like the parameters used to compute the new directory hash should be a function of all the build settings in the new configuration, not just the changed settings. If you only look at the changed settings on the transition, then different hashes will end up getting used for targets that are actually configured with the same settings.
The intent of 6360ff7 is to not change the output directory hash on a no-op. That is perfectly fine. However the bug (we think) is that this commit computes the hash from the changed build settings, not from all of them. This becomes a problem with one or more build settings changes. Therefore, this commit is not just affecting no-op transitions .. it's affecting all transitions. You presented an example with these targets and transitions:
For D, the transition directory name fragment will only be hash_of("zoom=zop"). This is beacuse foo=blargh has not changed on the transition between C -> D. Let's consider the following scenario. In your dependency graph, let's say there is another way of getting to D from E, as follows: E In either case, D is configured with --zoom=zop, --foo=blargh. But we see that transitioning from C->D vs. from E->D results in a different output directory has for D. E: transition directory name fragment = null, affected by starlark transition = {}
Perhaps the problem here is that the code doesn't actually know what the "default" settings are. Therefore it regards any unchanged setting as a "default." Confirmed we are using Bazel 3.5.0 with bazelisk, as follows:
In your repro case, you ran Subsequently, I ran But now unfortunately we find that there are different hashes being used for FW, even though in either case, it's being built with the same CPU flag. I hope that helps illustrate the problem! On my end, I will try to repro this problem on an older bazel version and also on 3.5.1. |
This isn't true because the "affected by starlark transition" is a list that is maintained and grown over the entire build. This list is what's used to determine what options will be hashed to form the output directory. So what happens in the code is that D should have the same output directory regardless of it was built via E or C (assuming they set the same values) because in both cases, the list of "affected by starlark transition" contains both Here's the code that adds the newly updated options to the running list of options affected by starlark options Here's the code that runs over the entire running list of changed options, not just the newly updated options:
What is the unexpected behavior here? The behavior of the output directory hash changing without --cpu=ios_x86_64 set on the command line and not changing with --cpu=ios_x86_64 set on the command line is what I would expect to happen. |
Ah yes, you are right, the "affected by starlark transition" does seem to aggregate settings that are affected by transitions. However, I still think there are some problems with the current implementation. In my company's repo, I have a scenario where all the build settings for a target are the same, regardless of where we traverse into it from: AppTarget -> D In each case, D ends up with the same set of build settings. And yet 3 different directory hashes are generated and the "affected by starlark transition" list looks different in each case. I'll work on providing you with a repro case for this scenario.
In the original repro case I presented, we see the FW is being built with the same value for cpu. We see that there is no diff in /tmp/a and /tmp/b, other than which fields were affected by starlark transition. At the end of the day, the built artifact is being built with the same settings and therefore should map to the same output directory hash (I think). |
Gotcha, that doesn't sound right. Thanks in advance for the repro. That would be very helpful. Is it possible the flags on the original command line/blazerc is different in the different cases? That would also lead to different directory hashes even if D ended up in the same configuration across the builds.
The last sentence is correct! Thanks for clarifying. Using your repro I was unable to get the same results from my end, I got the expected same output directory hash. Not sure what's going on but hopefully a further repro will help us get to the bottom of it. It would be helpful if you could try setting up and running the series of command I posted above as well (in #12171 (comment)) |
I have tried the following on bazel versions 3.5.0 and 3.5.1, but am seeing the same behavior.
I think this particular issue has to do with the way the AppleCrosstoolTransition from the AppleBinary rule in the bazel repo gets applied. (There is different from the issue I reported in the initial description in the bug report, where the directory output hash differs, depending on whether or not the build setting was applied via the command line or via a transition.) Here's how to repro the issue with AppleCrosstoolTransition problem:
In this case, you see that D is hashing to 3 different directory hashes:
Notice that D has 3 different configs. For the config bb37f927e397a8b32cb6c9b78a0ff679ed31cb2fe60c868e9034d1c1337b3ece, D is getting built via an ongoing edge transition from TestTarget, which applies the build setting "cpu." For the config e00315f4608662cd39dfa498f714a535dca6a0152986056deab77e9fc1fc2511 however, D is getting built via AppTarget, which applies AppleCrosstoolTransition. In either case, the build settings for D do not differ. But in the first case, the "affected by starlark transition" array includes
This is a problem, since in the second case, the AppleCrossTool transition is actually updating the
In the example above, we don't have different command line or bazelrc setting ... but isn't it a bug that would result in different directory hashes, even if D ended up in the same configuration across builds?
I ran your repro steps. I saw the same result as you. However, in order to repro this issue, I think you would need to run the following steps. First, run everything with the cpu command line flag, just like you did. Then try running it without the command line flag. You will notice that FW is now pointing to another output directory hash ... despite the fact that the build settings for FW have not changed (in the second case, the cpu build setting for FW gets applied from the transition, not the command line).
|
Hi @juliexxia ! I was wondering if my latest comment was helpful. I was hoping that the bazel team could address the issue where different directory hashes are used for artifacts built with the same set of build settings. We do see that different hashes are getting used when the build settings are applied from the command line vs. from a transition. It has necessitated a number of workarounds on our end. Thanks! |
Hi! Apologies it's been a busy couple of weeks. I'll read through you last couple of responses and get back to you in the next couple of days. |
AppleCrosstoolTransition is a native transition defined in java code: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/rules/objc/AppleCrosstoolTransition.java;l=39?q=AppleCrosstoolTransition&ss=bazel. The
It's not ideal. Ideally we would have perfectly deduplicated outputs, but it's not technically a bug. Bazel doesn't actually guarantee any output directory structure; how we do output directories has changed over time and will continue to change. When it's difficult to get the de-duping just right, we prefer to play it safe and create new output paths instead of risking clobbering existing artifacts. For example, we use the
But the build settings for FW have changed in the second case because the default value of --cpu is not ios_x86_64, it's one of the values here depending on your OS: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/analysis/config/AutoCpuConverter.java;l=35;drc=a5a4f47295ccb6cd81cf8732772c849b1a8b9d58. Let's say the default value of --cpu is detected as "k8" for examples sake. In the bazelisk cquery --cpu=ios_x86_64 :all case, FW and other have the same configuration hash because the transition sets --cpu to to same value as the command line - we register that as a no change and don't update anything. Both FW and other are built with --cpu=ios_x86_64. In the ~/sample-bug-report-bazel bazelisk cquery :all case, FW and other have different configuration hashes because for :FW, --cpu=ios_x86_64 and for :other, --cpu=k8. The key here is |
I think the motivating problem we're facing is that, before this change, it was possible for a starlark transition to force a target into a canonical configuration, so when it appeared in the configured target graph multiple times, it would always be a single target. That meant, for example, you wouldn't try to link two (identical) static libraries -- which is super important when building native code! Now, because of the native options, there's no way to force a single configuration for a target -- either you get a consistent value for the native option (and an inconsistent I think that gets at the crux of the issue -- with this change, there's no way for us to write custom starlark transitions that interplay safely with the native transitions. |
A couple questions - Does the bug title still accurately represent what we're talking about here? If not, can it be updated?
Which change are you talking about specifically? Can you give a minimal example of the multiple configured targets that used to all have the same configuration and now don't? Historically it's never been possible for starlark transitions to put a configured target into any configuration that a native transition could but it in. Historically, |
Sorry, I didn't mean put in the same configuration as a native transition. I meant it was possible to force a target with a star lark transition into a single (given a reasonable universe of incoming configurations) output configuration. Eg something that looked like a constant transition (same output hash given some restricted set of inputs) would lead to the same output configuration. This is no longer true because now only changed settings get taken into consideration when computing the directory fragment, meaning a constant transition can still result in multiple fragments, depending on the input configurations. @amberdixon To correct me if I'm wrong |
Thank you, that's clarifying. Is there a scenario in a single build where this is the case? For example:
it would be helpful to have a repro and command line that shows this happening. |
Hey all, I think I've finally wrangled a small repro of the situation you're discussing and a fix is on the way. Stay tuned. |
This transtion when used in conjunction with rules_apple's transition causes `apple_split_cpu` to be 100% removed. This PR propagates it to ensure that `swift_library`'s have the same configuration from the CLI. This is the last known instance of something that looks like bazelbuild/bazel#12171 that I could find Testing - it's not clear how to exactly test this. The analysis test mentioned in (#188) will have some benefit but this is mainly an issue with CLI invocations. For now, please test on the CLI ``` bazel clean bazel build -s tests/ios/app:AppWithSelectableCopts tests/ios/app:SwiftLib --apple_platform_type=ios find bazel-out/ -name \*.swiftmodule ```
This test asserts that transitive dependencies have identical outputs for different transition paths. In particular, a rules_apple ios_application and an a apple_framework that share a swift_library, :SwiftLib_swift. This test ensures that the actions in both builds have functionally equal transitions applied by normalizing their output directories into a set. For instance these tests will fail if there is any delta and requires both: - adding apple_common.multi_arch_split to apple_framework.deps - #188 - the transition yields the same result when used w/rules_apple - #196 Note: The gist of Bazel's configuration resolver is that it will apply relevant transitions to keys that are used by a given action. e.g. ios_multi_cpus. src/main/java/com/google/devtools/build/lib/analysis/config/ConfigurationResolver.java In order to get the same configuration for a rule, a given transition has to produce the same values for dependent keys for all possible combinations of edges in a given build.
I spent some time looking at this issue in the context of rules_ios and it looks like there are a few cases that result in "different transitions on the CLI". rules_ios VS rules_apple transition deltasFirst different transitions were applied differently when
There is a test case in this PR bazel-ios/rules_ios#196 Redundant builds of ios_application deps due to no transitionsBuilding a dependency of an |
* Pass through apple_split_cpu This transtion when used in conjunction with rules_apple's transition causes `apple_split_cpu` to be 100% removed. This PR propagates it to ensure that `swift_library`'s have the same configuration from the CLI. This is the last known instance of something that looks like bazelbuild/bazel#12171 that I could find Testing - it's not clear how to exactly test this. The analysis test mentioned in (#188) will have some benefit but this is mainly an issue with CLI invocations. For now, please test on the CLI ``` bazel clean bazel build -s tests/ios/app:AppWithSelectableCopts tests/ios/app:SwiftLib --apple_platform_type=ios find bazel-out/ -name \*.swiftmodule ``` * Analysis test for bazelbuild/bazel#12171 This test asserts that transitive dependencies have identical outputs for different transition paths. In particular, a rules_apple ios_application and an a apple_framework that share a swift_library, :SwiftLib_swift. This test ensures that the actions in both builds have functionally equal transitions applied by normalizing their output directories into a set. For instance these tests will fail if there is any delta and requires both: - adding apple_common.multi_arch_split to apple_framework.deps - #188 - the transition yields the same result when used w/rules_apple - #196 Note: The gist of Bazel's configuration resolver is that it will apply relevant transitions to keys that are used by a given action. e.g. ios_multi_cpus. src/main/java/com/google/devtools/build/lib/analysis/config/ConfigurationResolver.java In order to get the same configuration for a rule, a given transition has to produce the same values for dependent keys for all possible combinations of edges in a given build.
Any news on this issue? I believe I have run into the same issue using my OCaml rules, and unfortunately it may be a show-stopper. The problem in a nutshell:
This means that Tok.cmi and Gramlib__Plexing.cmi were compiled using different Util.cmi files, which we can verify by using a tool (ocamlobjinfo) to inspect the contents - each will have an entry for Util associated with a different hash (so the error message is not quite accurate.) Util gets compiled twice because transition functions create different output directories, even though Util is compiled in only one way. The Bazel problem boils down to a diamond dependency. The primary build target depends (separately} on Tok and Gramlib__Plexing, which both depend on Util. But only Gramlib__Plexing involves transition functions, and I have been unable to find a way to make both Tok and Gramlib__Plexing depend on the same build of Util. I've tried making the transition function reset the configs to their initial state, matching the state in effect when Tok is built, to no avail. Here's my cquery output:
Let me put this in the form of a question: what is the complete set of factors determining the output directory? The "config state" (build settings) obviously; anything else? Some of the OCaml rules involve actions to copy/rename/preprocess source files; could that have an effect? It would be very, very helpful if I could find a way to make transitions work so I don't have to throw away months of work. Thanks, Gregg |
Sorry for the delay, @mobileink. Julie, who's the Bazel resident expert on this stuff, has shifted her focus off Bazel. We're trying to pick up the slack. Check out my comment at #13317 (comment). Does that help? Short story is there's lots we can, and should, and hopefully will (in a timely way) do to improve this. One of Julie's last contributions was to lay out a clear analysis of the status quo and possible improvements. Sorry, I'm skimming through your comment and can't immediately tell if your scenario falls into
Could you summarize your exact transition logic, and how/where it's consumed, a bit more? |
Thanks for the response. FYI I'm hoping to get back to this next week...
…On Wed, Apr 14, 2021 at 4:25 PM Greg ***@***.***> wrote:
Sorry for the delay, @mobileink <https://github.com/mobileink>. Julie,
who's the Bazel resident expert on this stuff, has shifted her focus off
Bazel. We're trying to pick up the slack. Check out my comment at #13317
(comment)
<#13317 (comment)>.
Does that help?
Short story is there's lots we can, and should, and hopefully *will* (in
a timely way) do to improve this. One of Julie's last contributions was to
lay out a clear analysis of the status quo and possible improvements.
Sorry, I'm skimming through your comment and can't immediately tell if
your scenario falls into
1. pure transition inefficiency that the above can improve
2. a dep structure that Bazel doesn't understand means the same target
can be shared in both instances (the diamond scenario you describe). This
would be harder.
Could you summarize your exact transition logic, and how/where it's
consumed, a bit more?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12171 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAES2NG2KXWKNMAHJSNKCMDTIYB5JANCNFSM4RY3VHDA>
.
|
We currently use transition on the "//command_line_option:output directory name" to work around similar issues where different transition sequences are giving different output directory names (thus duplicating actions in D): A build-setting set by B but set back to default value before reaching D gave different hashes for D in the different branches depending on if this build-setting was included in the hash or not. In this case, we use this patch to avoid action conflicts due to different hashes: #13587 |
Hi all. @sdtwigg and I are actively looking into improvements for this problem. I think we all agree that a better implementation would set hashes as a pure diff from a baseline configuration, vs. examining the exact sequence of transitions down the graph and trying to keep their results consistent. In other words, we don't really care how we get to a particular configuration. We ultimately just care how that configuration differs from a baseline. There are two promising candidates for that baseline:
For example, if you're running on a Mac the default value for I lean toward 2) because I think it offers more consistency across builds, which may be helpful for remote execution caching. @juliexxia sketched up some great thoughts about this a while ago. Maybe we could prep up a Google doc to consolidate our thoughts and approach. I'm still unsure how this mixes with native transitions, which have their own independent logic for path names. Ideally we'd merge them all into the same algorithm. But that's a separate project. Tactically, @sdtwigg is focusing on adding a new I also want to highlight an interesting tension in this challenge:
|
FYI for my part: I think #13580 is the solution. |
I'm closing this for #13580. There remain other optimizations to make, as reviewed at #12171 (comment). Let's port over remaining concerns to better-scoped issues. |
@mobileink @jerrymarino can we sync to clarify the nature of your problems from the above conflicts? I'm trying to prioritize improvements as mentioned in my previous comments. Understanding which use cases they would and wouldn't solve would help clarify. |
Sorry for the delayed response. I don't really have much to add beyond my previous message. Basically I need to be able to cancel a transition. So if I have A -> B -> X, and C -> X, with a transition between A and B, I need to be able to pop the transition config state so that B and C depend on the same (sole) build of X. Maybe this has been addressed in the meantime, it's been a while since I dealt with this use case. |
As far as I can tell this may be possible to achieve today, but would definitely be cumbersome. Even if you reset the state after the transition with another transition, your two builds will differ by the value of the In order to get a single build, you would also have to build the previously untransitioned target through a chain of transitions that first change the same set of config settings and then reset them. Since Bazel composes subsequent transitions in a chain, this might only work of you separate them with a rule. All of this would become much easier with the new transition logic mentioned above which would do away with the |
Description of the problem / feature request:
When the output directory hash for the bazel-out subdirectory for generated artifacts is computed on a transition, it is based upon a delta of the build settings before and after the transition. As a result, multiple output directories with different hashes are getting generated for targets that are built with the exact same build settings.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Observe that the transition directory name fragments are different, even though the other build settings are the same.
What operating system are you running Bazel on?
Mac OS 10.15.6.
What's the output of
bazel info release
?release 3.5.0
If
bazel info release
returns "development version" or "(@non-git)", tell us how you built Bazel.It does not include development version.
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?I am not running bazel from a local git checkout of the bazel repo. I am using the release build.
Have you found anything relevant by searching the web?
The FunctionTransitionUtil.java file in bazel is intentionally computing the output directory hash only based upon the options that are affected by the starlark transition. It looks like this commit updated bazel to look at the delta of build settings: 6360ff7 cc @juliexxia
Any other information, logs, or outputs that you want to share?
In this branch, run repro.sh to reproduce the problem: https://github.com/segiddins/bazel-config-multiple-hashes.git
My team has confirmed this is a regression since bazel version 3.2.0.
In this PR, I also reproduced the problem: bazel-ios/rules_ios#124
The text was updated successfully, but these errors were encountered: