-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SHA2 in Rust. #199
Conversation
I had a go at writing some simple code for the SHA2 family. I know the original issues specified fast implementations, so these might not cut it. Roughly 2x slower than previous version and bit slower than rust-crypto. Timings on x86_64:
Ring after:
rust-crypto:
If you have any suggestions for optimisations, I'm happy to give them a go. |
Haha, #187 wasn't there a few days ago when I first looked for something to solve 😄 Well, I was looking for something to get my hands dirty anyway. |
Coverage decreased (-0.5%) to 81.044% when pulling 727aad5cd5cae3e4017ef34c0bc74340f396a175 on samscott89:SHA2_impl into dcdf473 on briansmith:master. |
src/digest/sha2.rs
Outdated
|
||
// SHA512 functions | ||
#[inline] fn big_s0_512(x: W64) -> W64 { x.rotr(28) ^ x.rotr(34) ^ x.rotr(39) } | ||
#[inline] fn big_s1_512(x: W64) -> W64 { x.rotr(14) ^ x.rotr(18) ^ x.rotr(41) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: your alignment is messed up here.
Thanks for submitting this! The duplication of effort with #187 is unfortunate. Maybe you and @gsoltis can work together to determine a path forward? Like I said in #187, since the rust-crypto code is faster, I think we need to choose between (a) adding more commits that optimize the code further (either the code in #187 or the code here), or (b) simply use the rust-crypto implementation of the block function, if we don't think we can implement something that is as efficient as rust-crypto. Also, unless/until the pure-Rust implementation is as fast as the assembly-language implementations in ring, we'll continue to use the assembly-language implementations on whatever platforms we have faster assembly language implementations for. (Ideally, we'd figure out a way to do a pure Rust implementation that is faster than the ASM code, but at least on x86-64 it looks to be not even close. Maybe on ARM the story is different.) This means, in particular, that the final PR needs to include a commit that uses ![cfg()] to choose between using an asm implementation or the pure-Rust implementation based on what's available for that platform. (If we can't benchmark on every platform yet, then we can just assume that the asm implementation is faster, when an asm implementation is available, as the OpenSSL team has spent a lot of effort optimizing the asm code on many different platforms.) |
Interestingly, I think that was also the case with the SHA-1 implementation that @SimonSapin contributed. Is there a difference between using
We don't need an implementation that is as fast as the ASM implementations. We just need a Rust implementation to fall back to on platforms that don't have ASM implementations, which is about as fast as rust-crypto (otherwise we can use rust-crypto). |
I did a little more profiling, and it seems like |
I think it is probably worth studying what other implementations do, regarding reducing the state size (I think) and regarding unrolling. Most implementations look much different than this because they are partially or often fully unrolled, which allows for some really important operations. Also, reformulating the structure of the implementation might enable some auto-vectorization to happen. You're not going to reduce the number of additions or rotations by any significant amount, because they are fundamental to the algorithm. |
Will have a look at implementing these suggestions to improve performance. And, ultimately, add the cfg flags where necessary. |
Played around with some optimisations. The commits above made the following changes:
Interestingly, removing At the bottom I've shown the times for the included openssl/boringssl asm when disabling different extensions. I assume rust has limited access to those instructions, so is the best we could hope to achieve the top line? |
With this last push, performance for SHA-256 is now equivalent (on my machine) to rust-crypto. Whereas the SHA-512 performance is actually better (by just over 5%). However, as I mentioned in the other PR comments, if rust-crypto were to use the optimised Sigma variants, they would again be ahead. |
Coverage decreased (-0.3%) to 81.598% when pulling e5d8b07391c0ed9c498ca73e317b3b93e04c56ca on samscott89:SHA2_impl into 7490753 on briansmith:master. |
Coverage decreased (-0.3%) to 81.651% when pulling e5d8b07391c0ed9c498ca73e317b3b93e04c56ca on samscott89:SHA2_impl into 7490753 on briansmith:master. |
Coverage decreased (-0.3%) to 81.642% when pulling aac3226e97a7a2b30fe86abbfd83a1982cd17a42 on samscott89:SHA2_impl into 7490753 on briansmith:master. |
Coverage decreased (-0.3%) to 81.598% when pulling ae3ef1b64ed9350fe400f60da816093c2752f5fc on samscott89:SHA2_impl into 7490753 on briansmith:master. |
Coverage decreased (-0.1%) to 81.758% when pulling ae3ef1b64ed9350fe400f60da816093c2752f5fc on samscott89:SHA2_impl into 7490753 on briansmith:master. |
I really appreciate the benchmarking done here and the evidence-based optimization work to get this into awesome shape. I'm a little bit unsure what's going on with this PR and the other PR, #187. Did we decide to do everything in this PR (#199)? It seems like we'll still need to add the conditional logic to use the assembly-language implementations on the platforms that we have them for, right? The main goal I had when I filed the issue for adding a Rust SHA-2 implementation was to provide a fallback for platforms where we don't have asm implementations. Later, I got the idea we might just use rust-crypto's implementations for all those platforms we need fallbacks for, so I investigated that path a bit. That path seems attractive because it would eliminate some duplication of effort. But, rust-crypto does a lot of stuff we intentionally avoid doing in ring, and also its license isn't as nice as the ISC-style license we're using in ring. Also, I contacted the rust-crypto author about splitting out the "core" of rust-crypto into a library we might share, but understandably that's kind of inconvenient for everybody too. Anyway, investigating those things in the background was why I was slow to get back to this. Now it seems like we should go ahead and just add our own SHA-2, Poly1305, ChaCha20, etc. implementations. The remaining thing I'm uncertain about is this: How are we going to test this? On Travis CI we have four architectures: x86, x86-64, ARM, and Aarch64. And, we have asm implementations for every one. We could pick one platform--probably x86-64, and modify travis.sh so that it rebuilds ring on that platform with the ASM implementations disabled, and then re-tests. Or, we could try adding MIPS to Travis CI via qemu like we did for ARM. Since we don't have any asm code for MIPS, that port would then serve as the platform for testing this. I would prefer the latter (adding MIPS), but I'm also fine with the former (adjusting the travis script for x86-64 to also test this.) |
The reason I pushed some changes to the other PR branch was because there was an incredibly simple optimisation trick which improved performance significantly, so I thought it would be interesting to see if it had the same effect for the other implementation. I'll continue pushing things along in this branch and if @gsoltis is okay with it we'll just keep things here. Personally I'm a big fan of the approach and philosophy you have in this project, so I think it is definitely worth writing new implementations. I would also argue there is value in having pure rust implementations of these algorithms for auditing purposes. For a ~50% performance penalty, some applications may consider it worthwhile to use the simpler code. That being said, I'll have a play with qemu because I'm interested to see whether the optimisations I made were only effective because it allowed the compiler to leverage some of the more modern instructions which might not be available to other architectures anyway. In which case the naive code might be preferable? |
Ok, that should do it. I have tests passing on my machine and on qemu when I merge #256 followed by this. |
OK, I had a serious look at this. Before we dive deep into it, do you think any of these transformations are a good idea?: https://github.com/briansmith/ring/tree/sha2 |
My first thought was that they all looked to be sensible modifications. However, according to the benchmarks, splitting the loop is a good idea, but the refactoring of the message schedule to only use 16 values hurts performance. Perhaps the extra arithmetic needed for the indices isn't worth it. Splitting the loop (commits 08d0c45 ... 73f282f) gives you a ~5% speedup but shrinking the message schedule (commit 2f18b5b) ends up losing a full 10% (from the base case, i.e. over 15% overall). One minor thing: I get that Cargo features are supposed to be additive so the syntax of |
Most likely, it's inhibiting vectorization. I guess we won't do that then. Thanks for measuring it. |
Let's say there's a program P that uses crate C that uses ring, and the crate. Let's say that C's dependency on ring is just a normal dependency using the defaults. If we have "no_asm" then P to can depend on ring with the "no_asm" feature and then the build will work as long as C doesn't use any features that are disabled by "no_asm". If we have "asm" instead then would there be a way for "P" to get the same effect? I don't think so, but I'm not sure. So, this seems like evidence in favor of "no_asm". On the other hand, with "asm" crates could, in theory, advertise that they work in the no-asm mode in their Cargo.toml, but with "no_asm" there's no such way. This seems like a minor benefit of "asm" though. On the other other hand, let's say that we have "no_asm" and then C depends on ring with "no_asm" enabled. Then P and any other crates that P uses won't be able to use any of the features that require the asm optimizations, even if they would work! Thus, if we go with the "no_asm" design we would have to document that it shouldn't be used by libraries, only by the top-level program. It seems like it will be common for library crates to ignore the asm/no_asm issue and so we should do whatever lets the top-level program override the default choice. If there's a way to do that overriding with "asm" then I think that's best; otherwise, I think "no_asm" is best. |
Agreed. One particularly ugly option would be use "asm" as a default feature and add a further "no_asm_override" feature or something which when activated would take precedence over the "asm" feature. This would require At which point, the question would be why is there an asm option in the first place. Another option would be to just change the syntax slightly. So as opposed to "no_asm", it could be "use_rust_code" . |
Please check out https://github.com/briansmith/ring/tree/sha2-3. I dropped my proposed no_asm -> asm change. I also dropped all of my proposed optimizations, including the splitting. It isn't clear to me why the splitting would make things so much faster, unless it is purely due to saving some load-stores or allowing better reordering of operations. In any case, if the splitting is really faster, we can do it later. I did some experimenting with trying to reduce the duplicate code, without decreasing performance. Could you check which changes in the sha2-3 branch decrease performance? You mentioned earlier that using I suggested adding a dependency on the num-traits crate so that we can write generic functions. I prefer to write generic functions rather than copy-paste if we can do so without hurting performance. I'm not that excited on depending on an external crate just for this; maybe in the future we can create a trait similar (and simpler than) You said that using the precomputed tables is significantly faster, right? If so, we might consider using such a precomputed table for SHA-1 too, as the slow SHA-1 is actually a huge slowdown when it comes to running the test suite. Anyway, I think this is getting close to landing. |
I think the use of traits kills performance. Commit 644d870 slows everything down to barely 1MB/s. Previously I've tested inline functions vs macros and there didn't seem to be any difference, so I can only assume it's the trait object which is causing the slowdown. Replacing the trait functions with macros in the head of your branch recovers performance again as I've done here: samscott89@cb8752f.
I don't think that was me. Could simply remove the SHA-1 tests from the Travis set, since SHA-1 is mostly deprecated in ring anyway? |
@samscott89 maybe you can file a rustc bug about this. It seems like the compiler should be able to monomorphize the traits for this case to improve the performance. (I'd do it myself, but you'll probably be able to more accurately describe the problem and answer any questions they might have.) |
It's a bit bizarre, I haven't been able to reproduce it with a minimal example. Checking the assembly code the problem is fairly obvious: the However, as I said before I can't reproduce this with a minimal example. In any case, perhaps we should just leave a comment explaining why macros are used instead of traits over generics? |
Does adding |
Adding |
Is there any way of forcing |
I don't believe so, so I filed rust-num/num#218 about it. |
Cool, thanks! |
@samscott89 There are many things we can do:
I think #1 is good enough. I would eventually like to get rid of all the macros but it seems we need language improvements to the type system before it would be possible. |
OK, I feel pretty strongly now that we should avoid the num-traits crate at least until we have more significant need for it. I don't want to duplicate that effort but also I don't want to add that dependency for such a small thing. Please let me know if you can update the PR to use the nested function approach inside the Regardless, this has drug on for a long time (mostly my fault, it seems) so LMK what I can do to help finish this off. |
Will do. |
Something like this: samscott89@d3711b4 ? |
Yes, that seems fine. Will look at it again after the travis queue clears a bit. |
See #256 (comment). This is very good work but it is in conflict with some other things we're trying to do with respect to increasing confidence in the code quality, so we're going to park it for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the most important thing for this code (aside from avoiding side channels) is that, when ring is compiled optimized for size, it should be small. Most people interested in this code at the present time don't care as much about the performance as they care about the size.
I think now is a good time to revisit this work and use this as a fallback implementation for platforms where we don't have (and don't want to maintain) assembly language implementations, in particular WebAssembly and MIPS.
Is anybody interested in this for any other reason?
@samscott89 Are you still interested in this? I wouldn't be surprised if you are or aren't. If you are, once it is updated I can approve and merge it right away.
type W32 = Wrapping<u32>; | ||
type W64 = Wrapping<u64>; | ||
|
||
macro_rules! ch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all these uses of macros can be replaced by const fn
generic functions now.
#[inline] | ||
fn small_s1_512(x: W64) -> W64 { rotr!((rotr!(x, 42) ^ x), 19) ^ (x >> 6) } | ||
|
||
pub fn block_data_order_256(state: &mut [u64; MAX_CHAINING_LEN / 8], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have SHA-256 and SHA-512 in separate files.
@@ -37,6 +37,19 @@ pub fn wrapping_rotate_left_u32(x: core::num::Wrapping<u32>, n: u32) | |||
pub mod slice { | |||
use core; | |||
|
|||
#[cfg(feature="no_asm")] | |||
#[inline(always)] | |||
pub fn u64_from_be_u8(buffer: &[u8; 8]) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be removed in favor of using the endian
submodule.
@@ -37,6 +37,19 @@ pub fn wrapping_rotate_left_u32(x: core::num::Wrapping<u32>, n: u32) | |||
pub mod slice { | |||
use core; | |||
|
|||
#[cfg(feature="no_asm")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than have a feature no_asm
, I think it makes sense to do this conditionally based on whether the target is not x86, x86_64, arm or aarch64.
PR #863 has a new implementation of SHA-2. Please take a look. Thanks again for the effort here. |
Code is simple adaptation of the pre-existing SHA1 code.
Algorithm implemented as specified in [FIPS 180-4].
This addresses #61 and #62.