Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch the BLAKE2 implementation to blake2b_simd/blake2s_simd #88

Closed
wants to merge 3 commits into from

Conversation

oconnor663
Copy link

[This is a very large change which I haven't discussed with anyone yet, so I'm
not sure it's the right choice for the project. But hopefully this is a good starting
point for discussion.]

This is mostly a large performance improvement. The BLAKE2b bench_10000
case is improved by about 30%. This implementation also detects SIMD
support at runtime, so the feature flags related to SIMD support are
removed.

The only performance loss is in the bench_10 cases, where the caller
repeatedly feeds input slices less than one block long. The BLAKE2s
bench_10 case is almost 20% slower. I'm not sure exactly why, but this
implementation optimizes for avoiding copies on long runs of input, so
it might just be that it's doing more math up front. This performance
issue disappears if the inputs are a full block or longer.

The only API consequence of this change is that the undocumented
with_parameter_block constructor is no longer supported. Callers who
need other parameters might prefer to use the blake2b_simd/blake2s_simd
APIs directly, which expose them in a safer way through a Params object.

@oconnor663
Copy link
Author

Ah, it seems like one issue is that blake2b_simd and blake2s_simd don't support rustc 1.21. Currently they don't even have a rustc support target besides stable. It probably wouldn't make sense to move them back to the 2015 edition, but if we could target a rustc version that supported the 2018 edition I could probably fix them to be compatible with something older than current stable.

@newpavlov
Copy link
Member

Greetings!

I would strongly prefer to merge your implementations into blake2 instead of depending on blake2b_simd and blake2s_simd. Plus IIRC we can share some code between B and S variants. Are you willing to do it? Have you compared performance to code in #72? How performance of SIMD-less implementation compares to one in this crate?

Also how do you handle feature detection? Are you using CPUID directly or utilize is_x86_feature_detected! macro? Note that the latter is unavailable on no_std targets right now.

Regarding MSRV, in v0.9 I plan to update it to 1.32 and migrate crates to 2018 edition.

@oconnor663
Copy link
Author

Have you compared performance to code in #72?

Ah no I hadn't. Looking at that branch now, the BLAKE2b long input performance seems to be about 1.5% improved relative to master. However, enabling the coresimd feature seems to result in a 12% drop in performance. I'm not sure why.

I also notice that that branch adds BLAKE2bp and BLAKE2sp implementations, with comparable performance to BLAKE2b and BLAKE2s. However, this way of implementing the parallel variants (calling compress in a loop on each of the inner instances) isn't the best. It turns out that if you reorganize your SIMD vectors so that they hold a state word from each input, the performance goes through the roof, I think mainly because you never need to diagonalize the vectors. Here are the relevant numbers from blake2_simd:

bench_1mb_blake2b_avx2      989,743 ns/iter (+/- 9,049) = 1059 MB/s
bench_1mb_blake2bp          471,941 ns/iter (+/- 3,686) = 2221 MB/s
bench_1mb_blake2s_sse41   1,529,309 ns/iter (+/- 15,261) = 685 MB/s
bench_1mb_blake2sp          429,325 ns/iter (+/- 12,186) = 2442 MB/s

Also how do you handle feature detection? Are you using CPUID directly or utilize is_x86_feature_detected! macro? Note that the latter is unavailable on no_std targets right now.

I use the macro. With no_std (which you get at using --no-default-features), dynamic feature detection is currently disabled. That means SIMD is disabled, unless you build with something like RUSTFLAGS="-C target-cpu=native", in which case SIMD is statically enabled.

How performance of SIMD-less implementation compares to one in this crate?

I haven't done a very careful comparison yet. Trying it just now, it seems like with --no-default-features on my x86_64 laptop, my crate is 5-10% slower than this one. However, my suspicion is that that's because this crate's implementation is written in a vectorized style that compiles well to SSE2, which is statically enabled on x86_64. When I compare benchmarks on a 64-bit ARM machine (an AWS a1.xlarge instance), I find that my crate is 3-4% faster. It might be that my portable implementation is generally faster on architectures without SIMD, but I definitely haven't done enough testing to make any claims.

I would strongly prefer to merge your implementations into blake2 instead of depending on blake2b_simd and blake2s_simd. Plus IIRC we can share some code between B and S variants.

I'm not sure we can share that much code between B and S. The SIMD implementations -- particularly message loads that differ from round to round -- are done "by hand" and can't really be generated from a shared macro. We could maybe share some of the boilerplate in the higher level types like State and Params, but that's not so much code compared to all the SIMD intrinsics. And we'd have to code around little issues like how B and S pack the node_offset parameter differently. Those issues together are why the blake2_simd project just duplicates everything between the two versions.

Can you say more about why you prefer to vendor the implementation? In my mind it would be a bit of a maintenance burden for both crates. Note that the compression functions have both received substantial optimizations as recently as a couple weeks ago, see oconnor663/blake2_simd@32065b5 and oconnor663/blake2_simd@e26796e. The higher level APIs are also a core dependency of my Bao project, and they're getting tweaks as I get more experience using them. The HashManyJob API in particular is kind of novel, and it's gone through a few iterations. RustCrypto might not care to include that part of the API, but again the more divergence there is between the two codebases the harder it'd be to port changes.

@oconnor663
Copy link
Author

oconnor663 commented Aug 2, 2019

I went ahead and added BLAKE2bp and BLAKE2sp support to this PR, to mirror what was done in #72, since it's not very much additional code. (Note though, I haven't figured out how to add test vectors for those yet.) Here's where things stand performance-wise, as measured with the benchmarks in this crate on my laptop:

  • In master, BLAKE2b was already the fastest hash function, at 809 MB/s. The next fastest secure hash function (so excluding SHA1) is SHA-512 at 359 MB/s. I measure OpensSSL's SHA-512 implementation to be 666 MB/s, so there's definitely a lot of room for improvement there, but it'll probably never catch up to BLAKE2b.
  • With these changes, BLAKE2b is now 1057 MB/s, a 30% improvement over master.
  • The fastest hash function is now BLAKE2sp at 2313 MB/s.

If something like this lands, these figures would change the recommendation I'd make for application developers. In particular, I'd recommend that most applications concerned about hash performance use BLAKE2sp, if they have a choice. In addition to being faster on modern x86 machines with AVX2 support, it's also faster on 32-bit embedded systems without SIMD, because of its 32-bit words. The main downside of BLAKE2sp is reduced throughput for very short messages compared to BLAKE2s and BLAKE2b. That's probably not a big deal for most applications, since bottlenecking on hashing performance usually means you're hashing something big, but it's the kind of thing you need to measure.

@newpavlov
Copy link
Member

Sorry for the late reply!

What do you think about moving your crates into this repository? I am ready to provide you a full access to it and using the new code owners feature all PRs affecting BLAKE2 crates will go through you.

@oconnor663
Copy link
Author

My first instinct is that it wouldn't make sense to merge them. The blake2s_simd and blake2b_simd crates expose an API that's different from and larger than the APIs of existing RustCrypto crates. For example, BLAKE2s and BLAKE2b have variable length output, but not fully extensible output (BLAKE2X being a separate standard). They also support keying and a number of other customization parameters. And each crate exposes a separate hash_many interface for high-performance SIMD parallelism.

Rather than having these odd ducks in the RustCrypto repo, wouldn't it make more sense to just wrap their APIs? If taking a dependency is an issue, I suppose vendoring the relevant parts of the code could also make sense. The core hash implementations are likely to be stable for the foreseeable future, at least until Rust stabilizes AVX-512 and NEON intrinsics.

@newpavlov
Copy link
Member

For example, BLAKE2s and BLAKE2b have variable length output

This case is covered by the digest::VariableOutput trait.

They also support keying and a number of other customization parameters.

Keying is supported via crypto_mac::Mac trait.

Personalized blocks and hash_many can be supported separately from traits via inherent methods (e.g. variable output was initially implemented via inherent methods and was moved to traits later). It could break synchronization of versions. It will be a bit unfortunate, but it's not a deal breaker in my opinion. Even today it can happen, since blake2 depends on both crypto-mac and digest for public API, so breaking change in either one of them will result in a breaking release for blake2.

If taking a dependency is an issue, I suppose vendoring the relevant parts of the code could also make sense.

It's about improving visibility of crates and reducing a number of groups to which people have to trust. And I would like to avoid copying code around if possible, so vendoring the code will be sub-optimal as well.

@tarcieri
Copy link
Member

@oconnor663 would you like to continue to work on this PR, and if so, can you rebase?

@oconnor663
Copy link
Author

I'd be happy to rebase it on master, sure. I've made a note to take a shot at it this weekend.

@oconnor663 oconnor663 force-pushed the blake2_simd branch 2 times, most recently from fb0ba5e to 0a91489 Compare June 15, 2020 03:04
@oconnor663
Copy link
Author

I've rebased (more like rewritten) the branch on top of current master. As before, I've added support and benchmarks for BLAKE2bp and BLAKE2sp, but I haven't yet added new test vectors for those. There's some complexity due to the fact that my *_simd crates don't support personalization or salting for BLAKE2bp or BLAKE2sp. (I've never seen any implementation that does, and I'm not aware of any test vectors covering those cases, official or otherwise. So I'm hesitant to "innovate" in that direction.)

Replace the internal implementation of BLAKE2b and BLAKE2s with calls to
the blake2b_simd and blake2s_simd crates. Those crates contain optimized
implementations for SSE4.1 and AVX2, and they use runtime CPU feature
detection to select the best implementation.

Running the long-input benchmarks on an Intel i9-9880H with AVX2
support, this change is a performance improvement of about 1.5x for
BLAKE2b and 1.35x for BLAKE2s.

This change deletes the undocumented `with_parameter_block` method, as
the raw parameter block is not exposed by blake2b_simd or blak2s_simd.
Callers who need BLAKE2 tree mode parameters can use the upstream crates
directly. They provide a complete set of parameter methods.

This change also deletes the `finalize_last_node` method. This method
was arguably attached to the wrong types, `VarBlake2b` and `VarBlake2s`,
where it would panic with a non-default output length. It's not very
useful without the other tree parameters, so rather than moving it to
the fixed-length `Blake2b` and `Blake2s` types where it belongs, we just
delete it. This also simplifies the addition of BLAKE2bp and BLAKE2sp
support in the following commit, as those algorithms use the last node
flag internally and cannot expose it.
On an Intel i9-9880H with AVX2 support, both BLAKE2bp and BLAKE2sp are
about 1.75x faster than BLAKE2b. Note that while these algorithms can be
implemented with multi-threading, these implementations from
blake2b_simd and blake2s_simd are single-threaded, using only SIMD
parallelism.

The blake2b_simd and blake2s_simd crates don't support salting or
personalization for BLAKE2bp and BLAKE2sp, so the `with_params` methods
are moved out into blake2b.rs and blake2s.rs.
On x86 targets, SSE4.1 and AVX2 implementations are always compiled.
With the `std` feature enabled, runtime CPU feature detection is used to
select between them. With `std` disabled (e.g. --no-default-features),
the only way to activate SIMD is something like

    export RUSTFLAGS="-C target-cpu=native"
@tarcieri
Copy link
Member

tarcieri commented Nov 2, 2020

@oconnor663 taking a look at this again, I think I also share @newpavlov's concerns that this is relying on external crates.

Would you consider merging your work directly into the blake2 crate?

@oconnor663
Copy link
Author

I've leaned against it in the past, but I could change my mind on that. I guess the best long term outcome would be to deprecate the *_simd crates, and to port their internals into this crate. Maintaining them there vs maintaining them here shouldn't really make much of a difference in practice, as long as we can avoid needing to maintain them in two places. The biggest hurdle for me would be figuring out how to keep something like the Params API that the *_simd crates currently expose. This repo has a hidden with_parameter_block API that kind of provides the same functionality, but in my experience even experts have trouble assembling the parameter block correctly, particularly around endianness issues. Do you think we could port over Params more or less intact?

More broadly/philosophically, I think we have different preference for trait methods vs inherent methods. My preference has been to make update / finalize inherent methods, and to have trait methods wrap them, rather than requiring callers to import the digest traits in cases where they're not treating algorithms generically. Is that something you all are strictly opposed to? What's your current leaning on those sorts of API questions? Also at a very high level, I'm curious if/how digest plans to change when const generics land. That's one reason I've held off from tagging 1.0. (That, and the fact that no one so far has said that they need it to be 1.0.)

One minor issue: I decided to split BLAKE2b/BLAKE2s into two repos in the past, mainly because building all the intrinsics implementations for different SIMD instruction sets takes a non-trivial amount of time, and it's nice not to double it. Presumably we'd be combining them here and paying that cost in build times for callers don't need both. (I think in practice most callers just want BLAKE2b.) That's definitely not the end of the world, but we might want to measure it and make an explicit judgment call.

@tarcieri
Copy link
Member

tarcieri commented Nov 2, 2020

Do you think we could port over Params more or less intact?

Probably. You can always add inherent methods that accept them.

More broadly/philosophically, I think we have different preference for trait methods vs inherent methods. My preference has been to make update / finalize inherent methods, and to have trait methods wrap them, rather than requiring callers to import the digest traits in cases where they're not treating algorithms generically. Is that something you all are strictly opposed to? What's your current leaning on those sorts of API questions?

There's not a dichotomy here: you can always have inherent methods with the same name, and have the traits call the inherent methods (we do this quite a bit in the elliptic-curve crates).

Also at a very high level, I'm curious if/how digest plans to change when const generics land. That's one reason I've held off from tagging 1.0. (That, and the fact that no one so far has said that they need it to be 1.0.)

We also have held off on a 1.0 until const generics land. The goal is definitely to use them ASAP when they become available.

One minor issue: I decided to split BLAKE2b/BLAKE2s into two repos in the past, mainly because building all the intrinsics implementations for different SIMD instruction sets takes a non-trivial amount of time, and it's nice not to double it. Presumably we'd be combining them here and paying that cost in build times for callers don't need both. (I think in practice most callers just want BLAKE2b.) That's definitely not the end of the world, but we might want to measure it and make an explicit judgment call.

Any reason this can't be solved with cargo features?

@tarcieri tarcieri requested review from newpavlov and tarcieri and removed request for newpavlov November 2, 2020 21:51
@oconnor663
Copy link
Author

There's not a dichotomy here: you can always have inherent methods with the same name, and have the traits call the inherent methods (we do this quite a bit in the elliptic-curve crates).

Ah cool, I wasn't sure whether this was something you all were intentionally trying to avoid.

Going more extreme in that direction, how would you feel about an incompatible version bump in this crate that wholesale replaces this crate's API with the existing *_simd ones (plus digest::* implementations)? That might be unnecessarily extreme -- for example, keeping method names unchanged where possible would be kinder to existing callers -- but I'm curious how you'd feel about it as a reference point.

Any reason this can't be solved with cargo features?

That's a good point. Maybe blake2{b,s,bp,sp} could all be features? Enabled by default probably makes the most sense, but I could also see an argument for enabling only blake2b by default.

Another open question here: the blake2*_simd:::many APIs. I put those together back when bao was based on BLAKE2, but since the switch to BLAKE3 I haven't had any use for them. I don't know if there are any callers in the wild, and if not I wonder if they're worth keeping. On the one hand, they share most of their code with the BLAKE2sp and BLAKE2bp implementations, so from a binary size perspective they're not a burden. But on the other hand, that shared code is a bit more complicated than it would be if it wasn't shared. But on the other other hand, it's been working fine for a long time, and the work to delete things simplify it might not be worth it at this point.

@tarcieri
Copy link
Member

tarcieri commented Nov 2, 2020

Going more extreme in that direction, how would you feel about an incompatible version bump in this crate that wholesale replaces this crate's API with the existing _simd ones (plus digest:: implementations)?

Done correctly it shouldn't be "breaking" so long as the types are named the same and the same traits are impl'd.

Merging your existing code in and adding the appropriate trait impls sounds good to me.

@newpavlov
Copy link
Member

newpavlov commented Nov 3, 2020

Do you think we could port over Params more or less intact?

Yes. I don't see a problem in exposing additional functionality not covered by the existing traits via inherent methods and additional types.

Is that something you all are strictly opposed to? What's your current leaning on those sorts of API questions?

Personally I dislike duplicating functionality in both inherent and trait methods. And I think it's worth to keep doing so for consistency sake. Also note that such approach is used not only by our crates, but also by std, e.g. you will not find public "duplicated" methods for types implementing io::Read/Write.

But I do understand why some would prefer to have inherent methods, but I think a better solution would be to have "inherent traits" in the language.

@oconnor663
Copy link
Author

but I think a better solution would be to have "inherent traits" in the language.

Yeah strongly agreed on that one.

Also note that such approach is used not only by our crates, but also by std, e.g. you will not find public "duplicated" methods for types implementing io::Read/Write.

I think std is of two minds here. The io traits avoid duplication, like you said (making them the poster child for inherent traits proposals). But From/Into conversions are sometimes duplicated by inherent methods, as in let v: Vec<T> = slice.into() vs slice.to_vec().

By the way, there are other little differences in the APIs between our two crates that will come up as we look at this more closely. Another one is that the *_simd crates have a couple of fluent method interfaces, including the update method. So for example you can write:

let hash = Params::new()
    .hash_length(16)
    .to_state()
    .update(b"foo")
    .update(b"bar")
    .update(b"baz")
    .finalize();

I don't think the digest traits would want to change to support that kind of thing, but I'm currently in favor of keeping that behavior in inherent methods.

@tarcieri
Copy link
Member

tarcieri commented Nov 3, 2020

...have a couple of fluent method interfaces, including the update method. So for example you can write...

The digest equivalent of this is chain.

@oconnor663
Copy link
Author

Oh neat! Though it looks like we took different positions on the classic debate of -> Self vs -> &mut Self :)

@tarcieri
Copy link
Member

tarcieri commented Nov 3, 2020

The chain method takes self and returns Self rather than &mut self/-> &mut Self so finalize can consume self as well.

@oconnor663
Copy link
Author

Ah yeah, there's another difference. Digest::finalize and Digest::finalize_reset both assume that finalization might need to poison the state in some way, which makes sense for traits that need to abstract over a set of algorithms. In the case of BLAKE2, we know we can finalize efficiently without mutating the state, so blake2*_simd::State::finalize takes &self.

@tarcieri
Copy link
Member

tarcieri commented Nov 3, 2020

It's not just an efficiency problem: we're trying to catch errors like doing update, update, finalize, update using the type system, where the final update is lost.

Digest::finalize_reset takes &mut self

@oconnor663
Copy link
Author

Update after a few months: I'm still supportive of merging these crates, but I'm going to be busy with other things for the foreseeable future. If anyone else wants to tackle this, I'm happy to make suggestions, review PRs, and update README's as needed.

@tarcieri
Copy link
Member

tarcieri commented Feb 2, 2021

@oconnor663 great!

I probably won't have time to look myself in the short term, but I think it shouldn't be too difficult to make this transition in an almost purely additive manner, copying over your existing crates and then adding digest trait impls in a purely additive manner. If things do overlap in weird ways we can potentially address that.

If I do have some time to take a look at this, I'll make a mention on this issue before proceeding.

@tarcieri
Copy link
Member

tarcieri commented Feb 2, 2021

Question for both @oconnor663 and @newpavlov: should we preserve the original git history or not?

@oconnor663
Copy link
Author

After a merger, I'd certainly keep the original history around at https://github.com/oconnor663/blake2_simd, maybe as an "archived" repo or maybe just with some prominent comments and links about how development has moved. So I don't feel strongly about whether the same history is migrated over vs whether everything gets squashed.

@tarcieri
Copy link
Member

tarcieri commented Feb 2, 2021

Cool, in that case, it sounds like we could just do a single-commit import of the current sources.

@tarcieri
Copy link
Member

tarcieri commented Feb 4, 2021

I started poking at this locally.

The first consideration is that RustCrypto owns the blake2 and blake2s crates, but not blake2b. I think ideally we could just s/blake2b_simd/blake2b/ and s/blake2s_simd/blake2s/ but unfortunately we only own the latter so this isn't possible.

I have reached out to the current owner of the blake2b crate (which has all releases yanked) in order to see if we can get that name. If so, I think the merger will be much more straightforward.

Failing that, locally I am working on trying to shoehorn them both into a blake2 crate with blake2b and blake2s features. This seems less-than-ideal, especially given the current blake2b_simd and blake2s_simd implementations, but otherwise I'm not sure what else to do.

I'll wait a bit to hear back from the blake2b crate owner, and if it seems like we won't be able to get the crate, I'll go ahead and push up my WIP for discussion.

@tarcieri
Copy link
Member

tarcieri commented Feb 4, 2021

Having said all that, I went ahead and tried to do a quick-and-dirty integration: #228

Closing this out in favor of that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants