Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double build of kernel.o #281

Merged
merged 1 commit into from
May 20, 2021
Merged

Conversation

nbdd0121
Copy link
Member

Reported by @TheSven73

@ksquirrel
Copy link
Member

Review of 9fd4e5778b83:

  • ✔️ Commit 9fd4e57: Looks fine!

@ojeda
Copy link
Member

ojeda commented May 20, 2021

Can you please add a couple lines to the commit message about why it was getting built twice?

@nbdd0121
Copy link
Member Author

I am not a Kbuild export, so I don't know why it's build twice.

But I read https://www.kernel.org/doc/html/latest/kbuild/makefiles.html which says the target has to be listed in $(targets), so I need to add it into one of obj-y/m, lib-y/m, extra-y/m, always-y/m. Since it shouldn't get linked, it couldn't be added to obj-y/m, lib-y/m. I determined that extra-y probably is the best place.

@TheSven73
Copy link
Collaborator

My apologies for missing the discussion in #257.

Gary is clearly doing some great work there, but I wonder what is the use-case for allowing build-time errors to optionally panic or oops the kernel via defconfig?

My understanding is that kernel runtime panics (=kill the whole kernel) or oopses (=kill the current process, hopefully, or perhaps corrupt the whole system instead) are considered toxic. Linus holds extremely strong views on this.

In the handful of places where an oops or panic does happen in the kernel, it's always done explicitly via BUG()/panic(). The thought that Rust code may trigger these implicitly, e.g. by a hard-coded string not fitting its bounds, may provoke some very strong feelings upstream.

I'm also a little bit scared by the Kconfig defaults for this. The choice between build time error / runtime panic seems to depend on the Rust optimization level. That in turn depends on the C optimization level. It's not intuitively clear (to me) that all existing defconfigs will default to "build time". Danger!

There are of course many dangerous pe-existing defconfig options. But they all state "say N if unsure", clearly default to N, and explain why someone would want to to switch them on.

Perhaps this has been discussed before with Greg or other senior kernel people, and my understanding is out of date? If so, I apologize in advance for the strong opinions expressed here :)

@nbdd0121
Copy link
Member Author

The reason is to support -C opt-level=0, for which inlining is not turned on. In such case the optimizer may not be able to optimize away the assertion even if it holds. RUST_BUILD_ASSERT_DENY is the default if opt-level is non-zero.

@ojeda
Copy link
Member

ojeda commented May 20, 2021

@nbdd0121 Ah, that is fine, do not worry -- I am definitely not one either ;) make is building it because build_error.o gets rebuilt (so we actually have 2 double builds going on), and that one is because of what you mention because we use if_changed.

Perhaps we could avoid building it at all and compile conditionally in kernel but it does not seem worth the complexity. extra is fine.

rust/Makefile Outdated Show resolved Hide resolved
@TheSven73
Copy link
Collaborator

Confirmed that the double build no longer happens on my system when this PR is applied.

Put build_error.o in `extra-y` if it's not in `obj-y`
so it's always in `targets`.

Reported-by: Sven Van Asbroeck <thesven73@gmail.com>
Signed-off-by: Gary Guo <gary@garyguo.net>
@ksquirrel
Copy link
Member

Review of 5972f3e86266:

  • ✔️ Commit 5972f3e: Looks fine!

@ojeda
Copy link
Member

ojeda commented May 20, 2021

@TheSven73 As Gary said, this was needed to have both the build assert + keep the "debug" builds. So people should not be hitting this, and the options are in the Hacking menu (so "normal" users should not touch it either).

Even if someone manages to compile them by mistake, they should never happen at runtime since they are normally build-time errors (and therefore they should always be true).

Now, it is true that it is not trivial to understand (I would perhaps remove the Kconfig options and do it automatically exclusively based on the optimization level, and add a comment inside the opt-level 0 option).

It is also true I know some people in the kernel may prefer to drop support for non-optimized builds. So we could simply drop it and simplify and have an easier life :)

Since Gary had already done it, I took the PR as it was, but this can of course still be discussed.

@ojeda ojeda merged commit ffdacf7 into Rust-for-Linux:rust May 20, 2021
@TheSven73
Copy link
Collaborator

TheSven73 commented May 20, 2021

It is also true I know some people in the kernel may prefer to drop support for non-optimized builds. So we could simply drop it and simplify and have an easier life :)

Powerful build time assertions (via const fn) are one of Rust's awesome advantages. Implicitly triggering panics and oopses however, are the kernel community's cryptonite!

Strong runtime correctness checks make perfect sense in academic research projects, or during very specific phases of development. But if you have thousands of Linux devices out on a manufacturing production line, and the customer calls and says "each time we do this thing in our line, your fancy device freezes, and our whole line is blocked for 5 minutes until it restarts!" then one can get quite upset at the person who felt that a (detected) string overflow in some unimportant driver is serious enough to blow up the whole device... Been there, done that. Linux's success is tightly bound to how it behaves in the real world, and mailing list opinions reflect that.

So in this case, I think less might be more. If build time assertions don't work for certain optimization levels, perhaps we should get rid of those levels. @nbdd0121 I think you've done some great work in this PR. Would you be open to a PR that got rid of these dangerous optimization levels?

So people should not be hitting this, and the options are in the Hacking menu (so "normal" users should not touch it either).

Stranger things have happened! It's good to be paranoid sometimes :) Especially because we have no hope of testing each and every defconfig in the tree.

When building the Raspberry Pi defconfig, I noticed that "Overflow Checks" are on by default. I tried the following bit of code:

pr_info!("overflow = {}", 0xffff_u16 + core::hint::black_box(1));

... and it resulted in an oops. Overflows are really tricky in production code. You might not see them for weeks, until some internal counter overflows. But it would be a bit excessive to bring the whole system down for that.

@ojeda @nbdd0121 would you be open to marking "Overflow Checks" as dangerous and default N?

@nbdd0121
Copy link
Member Author

Arithmetic overflow can be a critical error depending on situation. Combining with unsafe code it can easily lead to security vulnerabilities. Obvious counters are not one of them. I think it's not too bad to oops the driver.

Ideally it would be configured as a runtime warning instead of oops, but I am not sure if it is current possible without tweaking rustc.

@TheSven73
Copy link
Collaborator

Arithmetic overflow can be a critical error depending on situation. [...] I think it's not too bad to oops the driver.

I disagree in the strongest possible (respectful) way. And you'll find that the kernel mailing list will disagree even stronger than I do, and perhaps not as respectfully :)

@ojeda
Copy link
Member

ojeda commented May 21, 2021

@TheSven73 This is a bit off-topic for the PR -- probably the best place would be a meeting.

Implicitly triggering panics and oopses however, are the kernel community's cryptonite!

"Implicit" panics/oopses are not the problem, but unneeded ones. The kernel already panics/oopses when you attempt to access the start of the address space, for instance.

Strong runtime correctness checks make perfect sense in academic research projects, or during very specific phases of development.

Not really. It all depends on the system. Like you, I have worked in systems that bleed money when stopped and that have people on-call and/or operators monitoring them 24/7 and we had a lot of strong runtime correctness checks. Sometimes even independent systems cross-checking results. Plus hardware fail-safes for the safety-critical bits etc.

in some unimportant driver is serious enough to blow up the whole device...

When this was raised, one plan/idea was to kill the current thread, not the entire device.

But anyway, even that depends on the particular case and the particular system (and I would prefer it to be configurable). For instance, I would like my desktop to not die due to a faulty driver so that I can save my work, since it is most likely just that: a bug. But for a public facing server? Surely I prefer it dies if someone managed to find a way to trigger an OOB store.

Linux's success is tightly bound to how it behaves in the real world, and mailing list opinions reflect that.

That is not my reading of the mailing list nor of Linus Torvalds. He complained about unneeded panics, not panics in general.

You could interpret one of his lines ("or due to anything else") as complaining about all panics, but that does not really make sense. There are, after all, cases where the kernel calls panic()/BUG() for similar reasons.

So instead I believe he has good judgement and did not intend to imply every single panic including detected OOB accesses etc.

I think you've done some great work in this PR. Would you be open to a PR that got rid of these dangerous optimization levels?

They are not "dangerous". They may be e.g. not worth the complexity cost, but why "dangerous"?

Stranger things have happened! It's good to be paranoid sometimes :)

For that to happen, you need to:

  1. Misconfigure a kernel for production, including going into the Hacking menu etc. etc.
  2. Not notice any performance degradation.
  3. Have a build-time assert that somehow becomes true only at runtime (compiler bug?) or at some particular configuration (likely a bug in the code to begin with that you just luckily discovered :).
  4. Hit it in a path where you do not notice right away nor in testing, e.g. outside the happy path and in an edge case.

Sure, it can happen. But if you are building a system that requires that level of paranoia, probably you should not have been reconfiguring it, nor running your own built kernels, nor even running the Linux kernel at all.

Especially because we have no hope of testing each and every defconfig in the tree.

I am not sure I understand. The overflow check configuration is on by default, so it should be always being tested for any use of a particular driver. Why would we need to test the defconfigs for that?

(Related: what happened with the defconfigs discussion? Years ago Linus was saying he wanted to remove all of them...)

Overflows are really tricky in production code. You might not see them for weeks, until some internal counter overflows. But it would be a bit excessive to bring the whole system down for that.

It all depends what the counter is counting. If it is something that can silently destroy your production line, or your data, or open your server to others, etc. it is not so clear.

Of course, killing the thread or the device will also disrupt your current data or your production line. What is the best trade-off depends. The option is there for those that want it one way or the other. We could take it out of the Hacking menu if people want it properly supported.

would you be open to marking "Overflow Checks" as dangerous and default N?

If someone overrides the kernel security people, sure, but otherwise no, sorry: I actually enabled that because Kees suggested it. See commit f6fc80f ("Enable overflow checks by default and for release builds in the CI").

@TheSven73
Copy link
Collaborator

TheSven73 commented May 21, 2021

Thank you @ojeda for the constructive discussion and your lucid arguments, I really appreciate it!

On the “build assertions as runtime assertions” issue, I think we are actually 98% in agreement. Yes, oopses/panics have their place, either implicit (eg. invalid memory accesses) or explicit (via BUG()/panic()). And strong runtime checks are crucially important, the right ones in the right places that is, no question about it.

I would not want to over-interpret Linus’s words. As you correctly bring up, his prohibition is against “unneeded” panics and oopses. Do we have a clear and unambiguous use-case for turning Rust build time assertions into runtime assertions (oopses)? If not, these oopses are “unneeded”, and I believe we should consider removing the Kconfig options that activate them.

For the Kconfig rules, I think you are correct here - “build assertions as runtime assertions” can never happen implicitly. I mistakenly thought that a change in C optimization level could implicitly switch these on, but that’s not possible, because RUST_OPT_LEVEL_SIMILAR_AS_CHOSEN_FOR_C can never select Rust opt level 0. So that’s fine.

If Kees suggested turning on “Overflow Checks” by default, then I certainly would not want to argue with that. I am very interested however in learning the motivation behind the decision, to improve my own understanding. Does this originate from a mailing list discussion? All I can find is a post from Kees where he believes that the default overflow behaviour should be saturate-with-WARN?

@bjorn3
Copy link
Member

bjorn3 commented May 21, 2021

Do we have a clear and unambiguous use-case for turning Rust build time assertions into runtime assertions (oopses)?

Disabling optimizations for a kernel module for easier debugging. When optimizations are disabled the built time assertions will always fail as the optimizations that normally remove the reference to the non-existent symbol used to implement build time assertions don't run.

@ojeda
Copy link
Member

ojeda commented May 21, 2021

@TheSven73 No, thanks to you for raising these things! I am always afraid that any answer like mine above can be seen as "aggressive" and make people not want to raise questions. So please keep raising questions, I appreciate a lot that you question how things are done and give your perspective. I encourage everyone to do so. (Of course, also excuse me/us if any reply etc. seems like "over the top").

@ojeda
Copy link
Member

ojeda commented May 21, 2021

(let us follow on the other topic you created)

@TheSven73
Copy link
Collaborator

@ojeda I am always conscious of the same issue for myself. I tend to have strong opinions, which could sometimes be mis-interpreted as "aggressive". Nothing could be further from the truth. Thank you for being open to constructive discussion! I did not interpret your reply as over-the-top or aggressive at all.

(yes let's go over to the other topic)

@nbdd0121 nbdd0121 deleted the link-time-panic branch May 26, 2021 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants