Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std library thread.rs: undo attempt at NetBSD "find parallelism" #116665

Closed
wants to merge 1 commit into from

Conversation

he32
Copy link
Contributor

@he32 he32 commented Oct 12, 2023

First off, I have it on good authority this code is Just Wrong: on NetBSD, by default a given thread does not have affinity to any specific set of CPUs.

This particular change came with this pull request:

Rust pull request 112226

However, even worse, this code causes a segmentation fault for certain NetBSD target architectures in the "bootstrap" program when building rust natively on those platforms. So far armv7/9.0, powerpc/10.0_BETA, and i386/9.3 all crash with a segmentation fault. However, for some strange reason, this isn't consistent across the board: riscv64/current, amd64/10.0_BETA, aarch64/9.0 and sparc64/10.0_BETA all pass this hurdle. A trivial C reimplementation also doesn't crash on any of these systems, ref. the thread which starts at

NetBSD current-users postings about this issue

but also always prints 0. However, if we get a SEGV running this code, the entire build fails, of course.

So ... while I do not have a full explanation for the SEGVs, this undoes the addition from pull request 112226, and restores the ability to build rust natively on the above flagged-as-problematical platforms.

…ism".

First off, I have it on good authority this code is Just Wrong: by
default a given thread does not have affinity to any specific set
of CPUs.

This particular change came with this pull request:

  rust-lang#112226

However, even worse, this code causes a segmentation fault for certain
NetBSD target architectures in the "bootstrap" program when building
rust natively on those platforms.  So far armv7/9.0, powerpc/10.0_BETA,
and i386/9.3 all crash with a segmentation fault.  However, for some
strange reason, this isn't consistent across the board: riscv64/current,
amd64/10.0_BETA, aarch64/9.0 and sparc64/10.0_BETA all pass this
hurdle.  A trivial C reimplementation also doesn't crash on any of
these systems, ref. the thread which starts at

  https://mail-index.netbsd.org/current-users/2023/10/10/msg044510.html

but also always prints 0.  However, if we get a SEGV running this
code, the entire build fails, of course.

So ... while I do not have a full explanation for the SEGVs, this
undoes the addition from pull request 112226, and restores the ability
to build rust natively on the above flagged-as-problematical platforms.
@rustbot
Copy link
Collaborator

rustbot commented Oct 12, 2023

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 12, 2023
@he32
Copy link
Contributor Author

he32 commented Oct 12, 2023

However, even worse, this code causes a segmentation fault for certain NetBSD target architectures in the "bootstrap" program when building rust natively on those platforms. So far armv7/9.0, powerpc/10.0_BETA, and i386/9.3 all crash with a segmentation fault. However, for some strange reason, this isn't consistent across the board: riscv64/current, amd64/10.0_BETA, aarch64/9.0 and sparc64/10.0_BETA all pass this hurdle.

Thinking a bit more about this, all those targets which pass this hurdle are 64-bit platforms, while all which crash are 32-bit platforms. I'll bet that is at least part of the explanation.

wip-sync pushed a commit to NetBSD/pkgsrc-wip that referenced this pull request Oct 12, 2023
This mirrors the upstream pull request I just filed at

  rust-lang/rust#116665

I have applied this to a 1.72.1 rust checkout, and have re-built
the bootstrap kits for our NetBSD ports with that change applied,
and this commit bumps the bootstrap requirement for 1.73.0 from
1.72.0 to use the newly generated 1.72.1 bootstrap kits for all
the NetBSD targets.

This gets us building again on NetBSD/evbarmv7hf 9.0, NetBSD/macppc
10.0_BETA, and most probably NetBSD/i386 9.3 as well, whereas with
the 1.72.0 bootstrap code, the "bootstrap" rust program would crash
with SEGV on those particular platforms.
@he32
Copy link
Contributor Author

he32 commented Oct 12, 2023

It looks like we found another part of this problem. The problem is that the definition of cpuid_t in

https://github.com/rust-lang/libc/blob/0dbadb4dad68c7017772c4888700d43f1609493c/src/unix/bsd/netbsdlike/netbsd/mod.rs#L13

is wrong for 32-bit ports. 32-bit ports are typically ILP32, i.e. both "int", "long" and "pointer" are 32 bits. And the C definition is

typedef unsigned long   cpuid_t;

So explicitly setting this to a 64-bit type for all platforms is wrong. Will return with a separate request for that later.

he32 added a commit to he32/libc that referenced this pull request Oct 13, 2023
...in particular for 32-bit CPUs / ports, such as 32-bit arm,
i386, and powerpc.

In the C header files on NetBSD, this is defined as

  typedef unsigned long   cpuid_t;

and on ILP32 CPUs, that ends up being a 32-bit quantity.
Defining this as a 64-bit type wrecks havoc on our 32-bit ports
when e.g. _cpuset_isset() is used (as was introduced with 1.72.0),
causing immediate SEGV due to NULL pointer de-reference, as observed
in

  rust-lang/rust#116665

So, instead, define it as ::c_ulong, and let the CPU-specific type
definitions take care of the sizing.
bors added a commit to rust-lang/libc that referenced this pull request Oct 13, 2023
NetBSD's mod.rs: fix cpuid_t definition.

...in particular for 32-bit CPUs / ports, such as 32-bit arm, i386, and powerpc.

In the C header files on NetBSD, this is defined as

  typedef unsigned long   cpuid_t;

and on ILP32 CPUs, that ends up being a 32-bit quantity. Defining this as a 64-bit type wrecks havoc on our 32-bit ports when e.g. _cpuset_isset() is used (as was introduced with rust 1.72.0), causing immediate SEGV due to NULL pointer de-reference, as observed in

  rust-lang/rust#116665

So, instead, define it as ::c_ulong, and let the CPU-specific type definitions take care of the sizing.
@he32
Copy link
Contributor Author

he32 commented Oct 14, 2023

Hm, on second consideration, perhaps ripping this out is ... not the right thing. While it is true that by default a thread does not have affinity to a particular set of CPUs, the process may have a non-null set.

However, following the application of rust-lang/libc#3386 this code will now most probably get a type error on at least some targets, in particular the 32-bit targets, so that part needs to be fixed if this code is to stay.

Brijeshkrishna pushed a commit to Brijeshkrishna/libc that referenced this pull request Oct 22, 2023
...in particular for 32-bit CPUs / ports, such as 32-bit arm,
i386, and powerpc.

In the C header files on NetBSD, this is defined as

  typedef unsigned long   cpuid_t;

and on ILP32 CPUs, that ends up being a 32-bit quantity.
Defining this as a 64-bit type wrecks havoc on our 32-bit ports
when e.g. _cpuset_isset() is used (as was introduced with 1.72.0),
causing immediate SEGV due to NULL pointer de-reference, as observed
in

  rust-lang/rust#116665

So, instead, define it as ::c_ulong, and let the CPU-specific type
definitions take care of the sizing.
JohnTitor pushed a commit to JohnTitor/libc that referenced this pull request Oct 28, 2023
...in particular for 32-bit CPUs / ports, such as 32-bit arm,
i386, and powerpc.

In the C header files on NetBSD, this is defined as

  typedef unsigned long   cpuid_t;

and on ILP32 CPUs, that ends up being a 32-bit quantity.
Defining this as a 64-bit type wrecks havoc on our 32-bit ports
when e.g. _cpuset_isset() is used (as was introduced with 1.72.0),
causing immediate SEGV due to NULL pointer de-reference, as observed
in

  rust-lang/rust#116665

So, instead, define it as ::c_ulong, and let the CPU-specific type
definitions take care of the sizing.
@m-ou-se
Copy link
Member

m-ou-se commented Nov 28, 2023

Hm, on second consideration, perhaps ripping this out is ... not the right thing.

@he32 What's the status of this? Do you still want this reviewed and merged?

@m-ou-se m-ou-se added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 28, 2023
@he32
Copy link
Contributor Author

he32 commented Nov 28, 2023

Hm, on second consideration, perhaps ripping this out is ... not the right thing.

@he32 What's the status of this? Do you still want this reviewed and merged?

In NetBSD you might administratively run with a smaller-than-max affinity set, although the default is that the affinity set is empty. So ... in principle the code itself isn't wrong, as long as the type for cpuid_t is correct (which it wasn't and which caused a crash on ILP32 platforms).

So ... this patch exists in our packaging system, but will be removed once the correct definition of cpuid_t has percolated through the system (its definition is in the libc crate, and rust carries multiple versions of it inside...).

So ... I think I'll just abandon / cancel this pull request as an upstream request, and we'll deal with the issues in patches in our packaging system as long as is necessary.

@bors
Copy link
Contributor

bors commented Jan 13, 2024

☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts.

@he32
Copy link
Contributor Author

he32 commented Jan 13, 2024

As mentioned above, on second (third, fourth) take, the code isn't wrong, but the sibling fix to cpuid_t was the real culprit. I'll therefore now just close this pull request, since it is agreed this is no longer the right thing to do; an admin can apportion a subset of CPUs to a user, and restricting parallelism to that set is ~ok, even though it's a quite uncommon usage model.

So with that, this is pull request is abandoned / canceled / closed.

@he32 he32 closed this Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-unix Operating system: Unix-like S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants