-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX support, take 2 #38559
PTX support, take 2 #38559
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @arielb1 (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
This looks like a good start to me. Thank you so much for coming back to this, right when I'd given up hope 😄 I am pleasantly surprised that core works now. Given that there were previously missing LLVM intrinsics, what changed? Did the LLVM backend just get better, or did something change on our end? (The diff doesn't include any obvious canidates.) An built-in target would be nice eventually but given the stability issue I'd say it's best to punt on it now. Eventually, when we've solved other open issues and are ready to e.g. ship pre-built core binaries for it, we can talk about it again. For now, using this target is already more involved than ordinary targets, so lugging around a JSON file as well isn't a big burden. |
☔ The latest upstream changes (presumably #38314) made this pull request unmergeable. Please resolve the merge conflicts. |
- `--emit=asm --target=nvptx64-nvidia-cuda` can be used to turn a crate into a PTX module (a `.s` file). - intrinsics like `__syncthreads` and `blockIdx.x` are exposed as `"platform-intrinsics"`. - "cabi" has been implemented for the nvptx and nvptx64 architectures. i.e. `extern "C"` works. - a new ABI, `"ptx-kernel"`. That can be used to generate "global" functions. Example: `extern "ptx-kernel" fn kernel() { .. }`. All other functions are "device" functions.
Rebased.
Apparently, it fixed itself. 😄
There's the unstable built-in target option though. It would not only help in this case, but could also be used for the WIP msp430 target. |
Ok discussed during tools triage today, and decision was to merge! @bors: r+ |
📌 Commit aac5ff7 has been approved by |
⌛ Testing commit aac5ff7 with merge 90718f1... |
💔 Test failed - status-appveyor |
Ok well that's one that I haven't seen before. @japaric I think this is what you get for writing such long and detailed commit messages! The actual failure there was:
Where if we check the logs we find:
That message is super long because apparently AppVeyor put the whole commit message in an environment variable (escaped newlines and all). The weird part is that the env var is in the output. In Rust, however, we expect the env var value to end with So I'm reading that as:
I just double-checked the implementation of I... guess we could try trimming the commit message and trying again? |
I believe the message included is in fact the PR description? One way to work around this may be moving the text into a commit, leaving the PR description minimal. |
PTX support, take 2 - You can generate PTX using `--emit=asm` and the right (custom) target. Which then you can run on a NVIDIA GPU. - You can compile `core` to PTX. [Xargo] also works and it can compile some other crates like `collections` (but I doubt all of those make sense on a GPU) [Xargo]: https://github.com/japaric/xargo - You can create "global" functions, which can be "called" by the host, using the `"ptx-kernel"` ABI, e.g. `extern "ptx-kernel" fn kernel() { .. }`. Every other function is a "device" function and can only be called by the GPU. - Intrinsics like `__syncthreads()` and `blockIdx.x` are available as `"platform-intrinsics"`. These intrinsics are *not* in the `core` crate but any Rust user can create "bindings" to them using an `extern "platform-intrinsics"` block. See example at the end. - Trying to emit PTX with `-g` (debuginfo); you get an LLVM error. But I don't think PTX can contain debuginfo anyway so `-g` should be ignored and a warning should be printed ("`-g` doesn't work with this target" or something). - "Single source" support. You *can't* write a single source file that contains both host and device code. I think that should be possible to implement that outside the compiler using compiler plugins / build scripts. - The equivalent to CUDA `__shared__` which it's used to declare memory that's shared between the threads of the same block. This could be implemented using attributes: `#[shared] static mut SCRATCH_MEMORY: [f32; 64]` but hasn't been implemented yet. - Built-in targets. This PR doesn't add targets to the compiler just yet but one can create custom targets to be able to emit PTX code (see the example at the end). The idea is to have people experiment with this feature before committing to it (built-in targets are "insta-stable") - All functions must be "inlined". IOW, the `.rlib` must always contain the LLVM bitcode of all the functions of the crate it was produced from. Otherwise, you end with "undefined references" in the final PTX code but you won't get *any* linker error because no linker is involved. IOW, you'll hit a runtime error when loading the PTX into the GPU. The workaround is to use `#[inline]` on non-generic functions and to never use `#[inline(never)]` but this may not always be possible because e.g. you could be relying on third party code. - Should `--emit=asm` generate a `.ptx` file instead of a `.s` file? TL;DR Use Xargo to turn a crate into a PTX module (a `.s` file). Then pass that PTX module, as a string, to the GPU and run it. The full code is in [this repository]. This section gives an overview of how to run Rust code on a NVIDIA GPU. [this repository]: https://github.com/japaric/cuda - Create a custom target. Here's the 64-bit NVPTX target (NOTE: the comments are not valid because this is supposed to be a JSON file; remove them before you use this file): ``` js // nvptx64-nvidia-cuda.json { "arch": "nvptx64", // matches LLVM "cpu": "sm_20", // "oldest" compute capability supported by LLVM "data-layout": "e-i64:64-v16:16-v32:32-n16:32:64", "llvm-target": "nvptx64-nvidia-cuda", "max-atomic-width": 0, // LLVM errors with any other value :-( "os": "cuda", // matches LLVM "panic-strategy": "abort", "target-endian": "little", "target-pointer-width": "64", "target-vendor": "nvidia", // matches LLVM -- not required } ``` (There's a 32-bit target specification in the linked repository) - Write a kernel ``` rust extern "platform-intrinsic" { fn nvptx_block_dim_x() -> i32; fn nvptx_block_idx_x() -> i32; fn nvptx_thread_idx_x() -> i32; } /// Copies an array of `n` floating point numbers from `src` to `dst` pub unsafe extern "ptx-kernel" fn memcpy(dst: *mut f32, src: *const f32, n: usize) { let i = (nvptx_block_dim_x() as isize) .wrapping_mul(nvptx_block_idx_x() as isize) .wrapping_add(nvptx_thread_idx_x() as isize); if (i as usize) < n { *dst.offset(i) = *src.offset(i); } } ``` - Emit PTX code ``` $ xargo rustc --target nvptx64-nvidia-cuda --release -- --emit=asm Compiling core v0.0.0 (file://..) (..) Compiling nvptx-builtins v0.1.0 (https://github.com/japaric/nvptx-builtins) Compiling kernel v0.1.0 $ cat target/nvptx64-nvidia-cuda/release/deps/kernel-*.s // // Generated by LLVM NVPTX Back-End // .version 3.2 .target sm_20 .address_size 64 // .globl memcpy .visible .entry memcpy( .param .u64 memcpy_param_0, .param .u64 memcpy_param_1, .param .u64 memcpy_param_2 ) { .reg .pred %p<2>; .reg .s32 %r<5>; .reg .s64 %rd<12>; ld.param.u64 %rd7, [memcpy_param_2]; mov.u32 %r1, %ntid.x; mov.u32 %r2, %ctaid.x; mul.wide.s32 %rd8, %r2, %r1; mov.u32 %r3, %tid.x; cvt.s64.s32 %rd9, %r3; add.s64 %rd10, %rd9, %rd8; setp.ge.u64 %p1, %rd10, %rd7; @%p1 bra LBB0_2; ld.param.u64 %rd3, [memcpy_param_0]; ld.param.u64 %rd4, [memcpy_param_1]; cvta.to.global.u64 %rd5, %rd4; cvta.to.global.u64 %rd6, %rd3; shl.b64 %rd11, %rd10, 2; add.s64 %rd1, %rd6, %rd11; add.s64 %rd2, %rd5, %rd11; ld.global.u32 %r4, [%rd2]; st.global.u32 [%rd1], %r4; LBB0_2: ret; } ``` - Run it on the GPU ``` rust // `kernel.ptx` is the `*.s` file we got in the previous step const KERNEL: &'static str = include_str!("kernel.ptx"); driver::initialize()?; let device = Device(0)?; let ctx = device.create_context()?; let module = ctx.load_module(KERNEL)?; let kernel = module.function("memcpy")?; let h_a: Vec<f32> = /* create some random data */; let h_b = vec![0.; N]; let d_a = driver::allocate(bytes)?; let d_b = driver::allocate(bytes)?; // Copy from host to GPU driver::copy(h_a, d_a)?; // Run `memcpy` on the GPU kernel.launch(d_b, d_a, N)?; // Copy from GPU to host driver::copy(d_b, h_b)?; // Verify assert_eq!(h_a, h_b); // `d_a`, `d_b`, `h_a`, `h_b` are dropped/freed here ``` --- cc @alexcrichton @brson @rkruppe > What has changed since rust-lang#34195? - `core` now can be compiled into PTX. Which makes it very easy to turn `no_std` crates into "kernels" with the help of Xargo. - There's now a way, the `"ptx-kernel"` ABI, to generate "global" functions. The old PR required a manual step (it was hack) to "convert" "device" functions into "global" functions. (Only "global" functions can be launched by the host) - Everything is unstable. There are not "insta stable" built-in targets this time (\*). The users have to use a custom target to experiment with this feature. Also, PTX instrinsics, like `__syncthreads` and `blockIdx.x`, are now implemented as `"platform-intrinsics"` so they no longer live in the `core` crate. (\*) I'd actually like to have in-tree targets because that makes this target more discoverable, removes the need to lug around .json files, etc. However, bundling a target with the compiler immediately puts it in the path towards stabilization. Which gives us just two cycles to find and fix any problem with the target specification. Afterwards, it becomes hard to tweak the specification because that could be a breaking change. A possible solution could be "unstable built-in targets". Basically, to use an unstable target, you'll have to also pass `-Z unstable-options` to the compiler. And unstable targets, being unstable, wouldn't be available on stable. > Why should this be merged? - To let people experiment with the feature out of tree. Having easy access to the feature (in every nightly) allows this. I also think that, as it is, it should be possible to start prototyping type-safe single source support using build scripts, macros and/or plugins. - It's a straightforward implementation. No different that adding support for any other architecture.
I have reduced the PR description to match the commit message, which is much shorter. I'll open a new thread with more details in the user forum after this lands and then add a link to that thread to the PR description. @bors r=alexcrichton |
💡 This pull request was already approved, no need to approve it again.
|
📌 Commit aac5ff7 has been approved by |
@bors retry |
Er, I think the thread should be created and referenced before the PR lands, as the PR message would be frozen into git history at landing. So that people who prefer looking at commits could immediately link to the details without having to click through the branch information. In short: please ensure that |
⌛ Testing commit aac5ff7 with merge 2be4971... |
@bors: retry |
Version 1.16.0 (2017-03-16) =========================== Language -------- * Lifetimes in statics and consts default to `'static`. [RFC 1623] * [The compiler's `dead_code` lint now accounts for type aliases][38051]. * [Uninhabitable enums (those without any variants) no longer permit wildcard match patterns][38069] * [Clean up semantics of `self` in an import list][38313] * [`Self` may appear in `impl` headers][38920] * [`Self` may appear in struct expressions][39282] Compiler -------- * [`rustc` now supports `--emit=metadata`, which causes rustc to emit a `.rmeta` file containing only crate metadata][38571]. This can be used by tools like the Rust Language Service to perform metadata-only builds. * [Levenshtein based typo suggestions now work in most places, while previously they worked only for fields and sometimes for local variables][38927]. Together with the overhaul of "no resolution"/"unexpected resolution" errors (#[38154]) they result in large and systematic improvement in resolution diagnostics. * [Fix `transmute::<T, U>` where `T` requires a bigger alignment than `U`][38670] * [rustc: use -Xlinker when specifying an rpath with ',' in it][38798] * [`rustc` no longer attempts to provide "consider using an explicit lifetime" suggestions][37057]. They were inaccurate. Stabilized APIs --------------- * [`VecDeque::truncate`] * [`VecDeque::resize`] * [`String::insert_str`] * [`Duration::checked_add`] * [`Duration::checked_sub`] * [`Duration::checked_div`] * [`Duration::checked_mul`] * [`str::replacen`] * [`str::repeat`] * [`SocketAddr::is_ipv4`] * [`SocketAddr::is_ipv6`] * [`IpAddr::is_ipv4`] * [`IpAddr::is_ipv6`] * [`Vec::dedup_by`] * [`Vec::dedup_by_key`] * [`Result::unwrap_or_default`] * [`<*const T>::wrapping_offset`] * [`<*mut T>::wrapping_offset`] * `CommandExt::creation_flags` * [`File::set_permissions`] * [`String::split_off`] Libraries --------- * [`[T]::binary_search` and `[T]::binary_search_by_key` now take their argument by `Borrow` parameter][37761] * [All public types in std implement `Debug`][38006] * [`IpAddr` implements `From<Ipv4Addr>` and `From<Ipv6Addr>`][38327] * [`Ipv6Addr` implements `From<[u16; 8]>`][38131] * [Ctrl-Z returns from `Stdin.read()` when reading from the console on Windows][38274] * [std: Fix partial writes in `LineWriter`][38062] * [std: Clamp max read/write sizes on Unix][38062] * [Use more specific panic message for `&str` slicing errors][38066] * [`TcpListener::set_only_v6` is deprecated][38304]. This functionality cannot be achieved in std currently. * [`writeln!`, like `println!`, now accepts a form with no string or formatting arguments, to just print a newline][38469] * [Implement `iter::Sum` and `iter::Product` for `Result`][38580] * [Reduce the size of static data in `std_unicode::tables`][38781] * [`char::EscapeDebug`, `EscapeDefault`, `EscapeUnicode`, `CaseMappingIter`, `ToLowercase`, `ToUppercase`, implement `Display`][38909] * [`Duration` implements `Sum`][38712] * [`String` implements `ToSocketAddrs`][39048] Cargo ----- * [The `cargo check` command does a type check of a project without building it][cargo/3296] * [crates.io will display CI badges from Travis and AppVeyor, if specified in Cargo.toml][cargo/3546] * [crates.io will display categories listed in Cargo.toml][cargo/3301] * [Compilation profiles accept integer values for `debug`, in addition to `true` and `false`. These are passed to `rustc` as the value to `-C debuginfo`][cargo/3534] * [Implement `cargo --version --verbose`][cargo/3604] * [All builds now output 'dep-info' build dependencies compatible with make and ninja][cargo/3557] * [Build all workspace members with `build --all`][cargo/3511] * [Document all workspace members with `doc --all`][cargo/3515] * [Path deps outside workspace are not members][cargo/3443] Misc ---- * [`rustdoc` has a `--sysroot` argument that, like `rustc`, specifies the path to the Rust implementation][38589] * [The `armv7-linux-androideabi` target no longer enables NEON extensions, per Google's ABI guide][38413] * [The stock standard library can be compiled for Redox OS][38401] * [Rust has initial SPARC support][38726]. Tier 3. No builds available. * [Rust has experimental support for Nvidia PTX][38559]. Tier 3. No builds available. * [Fix backtraces on i686-pc-windows-gnu by disabling FPO][39379] Compatibility Notes ------------------- * [Uninhabitable enums (those without any variants) no longer permit wildcard match patterns][38069] * In this release, references to uninhabited types can not be pattern-matched. This was accidentally allowed in 1.15. * [The compiler's `dead_code` lint now accounts for type aliases][38051]. * [Ctrl-Z returns from `Stdin.read()` when reading from the console on Windows][38274] * [Clean up semantics of `self` in an import list][38313] [37057]: rust-lang/rust#37057 [37761]: rust-lang/rust#37761 [38006]: rust-lang/rust#38006 [38051]: rust-lang/rust#38051 [38062]: rust-lang/rust#38062 [38062]: rust-lang/rust#38622 [38066]: rust-lang/rust#38066 [38069]: rust-lang/rust#38069 [38131]: rust-lang/rust#38131 [38154]: rust-lang/rust#38154 [38274]: rust-lang/rust#38274 [38304]: rust-lang/rust#38304 [38313]: rust-lang/rust#38313 [38314]: rust-lang/rust#38314 [38327]: rust-lang/rust#38327 [38401]: rust-lang/rust#38401 [38413]: rust-lang/rust#38413 [38469]: rust-lang/rust#38469 [38559]: rust-lang/rust#38559 [38571]: rust-lang/rust#38571 [38580]: rust-lang/rust#38580 [38589]: rust-lang/rust#38589 [38670]: rust-lang/rust#38670 [38712]: rust-lang/rust#38712 [38726]: rust-lang/rust#38726 [38781]: rust-lang/rust#38781 [38798]: rust-lang/rust#38798 [38909]: rust-lang/rust#38909 [38920]: rust-lang/rust#38920 [38927]: rust-lang/rust#38927 [39048]: rust-lang/rust#39048 [39282]: rust-lang/rust#39282 [39379]: rust-lang/rust#39379 [`<*const T>::wrapping_offset`]: https://doc.rust-lang.org/std/primitive.pointer.html#method.wrapping_offset [`<*mut T>::wrapping_offset`]: https://doc.rust-lang.org/std/primitive.pointer.html#method.wrapping_offset [`Duration::checked_add`]: https://doc.rust-lang.org/std/time/struct.Duration.html#method.checked_add [`Duration::checked_div`]: https://doc.rust-lang.org/std/time/struct.Duration.html#method.checked_div [`Duration::checked_mul`]: https://doc.rust-lang.org/std/time/struct.Duration.html#method.checked_mul [`Duration::checked_sub`]: https://doc.rust-lang.org/std/time/struct.Duration.html#method.checked_sub [`File::set_permissions`]: https://doc.rust-lang.org/std/fs/struct.File.html#method.set_permissions [`IpAddr::is_ipv4`]: https://doc.rust-lang.org/std/net/enum.IpAddr.html#method.is_ipv4 [`IpAddr::is_ipv6`]: https://doc.rust-lang.org/std/net/enum.IpAddr.html#method.is_ipv6 [`Result::unwrap_or_default`]: https://doc.rust-lang.org/std/result/enum.Result.html#method.unwrap_or_default [`SocketAddr::is_ipv4`]: https://doc.rust-lang.org/std/net/enum.SocketAddr.html#method.is_ipv4 [`SocketAddr::is_ipv6`]: https://doc.rust-lang.org/std/net/enum.SocketAddr.html#method.is_ipv6 [`String::insert_str`]: https://doc.rust-lang.org/std/string/struct.String.html#method.insert_str [`String::split_off`]: https://doc.rust-lang.org/std/string/struct.String.html#method.split_off [`Vec::dedup_by_key`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.dedup_by_key [`Vec::dedup_by`]: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.dedup_by [`VecDeque::resize`]: https://doc.rust-lang.org/std/collections/vec_deque/struct.VecDeque.html#method.resize [`VecDeque::truncate`]: https://doc.rust-lang.org/std/collections/vec_deque/struct.VecDeque.html#method.truncate [`str::repeat`]: https://doc.rust-lang.org/std/primitive.str.html#method.repeat [`str::replacen`]: https://doc.rust-lang.org/std/primitive.str.html#method.replacen [cargo/3296]: rust-lang/cargo#3296 [cargo/3301]: rust-lang/cargo#3301 [cargo/3443]: rust-lang/cargo#3443 [cargo/3511]: rust-lang/cargo#3511 [cargo/3515]: rust-lang/cargo#3515 [cargo/3534]: rust-lang/cargo#3534 [cargo/3546]: rust-lang/cargo#3546 [cargo/3557]: rust-lang/cargo#3557 [cargo/3604]: rust-lang/cargo#3604 [RFC 1623]: https://github.com/rust-lang/rfcs/blob/master/text/1623-static.md
UPDATE Documentation
--emit=asm --target=nvptx64-nvidia-cuda
can be used to turn a crateinto a PTX module (a
.s
file).intrinsics like
__syncthreads
andblockIdx.x
are exposed as"platform-intrinsics"
."cabi" has been implemented for the nvptx and nvptx64 architectures.
i.e.
extern "C"
works.a new ABI,
"ptx-kernel"
. That can be used to generate "global"functions. Example:
extern "ptx-kernel" fn kernel() { .. }
. Allother functions are "device" functions.
ExampleOutdated. The examples have been move. See link at the top.