-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SHA256TREE: a version of SHA-256 that supports chunking #235
Add SHA256TREE: a version of SHA-256 that supports chunking #235
Conversation
Hey @EricBurnett @bergsieker @mostynb @sstriker @moroten,
This PR implements the first part. I will update PR #233 to provide the second part. It would be nice if this change could land in the nearby future, as that will already allow me to release changes on both the Bazel and Buildbarn side to add the new hashing algorithm. Any further changes to do actual chunking can then be limited to just Buildbarn components. |
I have managed to implement this on the Buildbarn side. I will publish these changes as soon as this proposal gets accepted/merged. While working on that, I also did some benchmarking: Notice how for objects of 8 KiB or smaller, BLAKE3CONCAT is about as fast as SHA256 and SHA512. Above 8 KiB it tends to become a lot faster. This is because BLAKE3CONCAT is able to use AVX2 to hash up to eight 1-kilobyte sized chunks in parallel. On my laptop, I manage to compute BLAKE3CONCAT hashes up to 3.7 GB/s, whereas SHA256 maxes out at 500 MB/s. SHA384 and SHA512 obviously perform equally, as they use the same algorithm. It's just that SHA384 is truncated. These measurements are all without using any multi-threading, which would give BLAKE3CONCAT an even bigger advantage. |
While doing test builds against a Buildbarn cluster supporting BLAKE3CONCAT, also using an updated version of bb_clientd, I noticed that having colon characters in pathnames isn't always safe. For example, |
55dd91f
to
bad83de
Compare
So I've been doing some more benchmarking over the last couple of days. On my MacBookPro16,1 with a 2,3 GHz 8-Core Intel Core i9, I see that BLAKE3CONCAT (blue) provides acceptable performance compared to SHA-256 (green): On ARM64, it's a different story. Implementing BLAKE3CONCAT in NEON and running it on a Mac Studio (model Mac13,2), I see that BLAKE3CONCAT (blue) is significantly slower than SHA-256 (green); about 10 times as slow for <8 KB objects. The reason being that the M1 CPU in the Mac Studio implements the ARMv8 cryptographic extension, meaning that there are native CPU instructions available for computing SHA-256 hashes. My assumption is that we will see similar performance characteristics on x86 systems that implements the Intel SHA extensions. So my takeaway is that BLAKE3 is optimized towards large objects, making it a bad fit for our use case. The CAS tends to contain lots of small objects as well. Because of that, I have decided to update my proposal to add a hashing algorithm that's essentially a mixture between SHA-256 and BLAKE3, named SHA256TREE. It uses SHA-256's compression function in combination with BLAKE3's tree structure. This means that we can make use of native hardware instructions where available, while still being able to cause a significant speedup when only generic SIMD features (such as AVX2) are available. In the worst case, SHA256TREE is (16+1+1)/16 = +12.5% slower than plain SHA-256. It is shown in orange in the graphs above. |
b893b3d
to
1997de3
Compare
1997de3
to
400b362
Compare
400b362
to
6473a0f
Compare
As discussed during Tuesday's meeting, the preference was to not alter the shape of For the Bytestream methods, the plan was to still use the scheme where we prefix the hashes with The reason we used the short @juergbi @mostynb @EricBurnett Please let me know whether this approach is acceptable for you. If not, I will fall back to using |
Google's internal build backend uses a content hash function called "PSHA2" which is a tree derivative of SHA-256. Inputs < 1KB are hashed with plain SHA-256. Inputs >= 1KB are striped 4 bytes at a time over 16 lanes, each lane is hashed with SHA-256, and then the lane hashes are combined. Inputs > 2MB are split into 2MB blocks, the blocks are hashed with SHA256x16, and then the block hashes are combined. A vectorized implementation of the SHA256x16 level is available at https://github.com/intel/isa-l_crypto/tree/master/mh_sha256. One peculiarity of the PSHA2 format is that the input length appears in the digest, and the whole-file hash for an input > 2MB is identical to the summary hash except for this length field. (The idea was to allow CAS operations to use the input to the summary hash as a manifest of chunks, and trivially derive the name of the chunk manifest from the name of the whole file, but we haven't yet taken much advantage of this yet.) This means it isn't a general-purpose cryptographic hash, as there are applications where you'd want to conceal the input length. If you're contemplating something like SHA256TREE, perhaps we should publish the spec and the code for PSHA2, since it's already part of the internal branch of bazel.
Intel's own library switches from the dedicated SHA-256 instructions to a vectorized implementation at 6 lanes: My understanding is that the SHA-256 helper instructions on the M1 have twice the throughput of SHA256RNDS2, so the gap with a generic vectorized approach is larger.
Can you use a cheaper variant for very small inputs? (That's what we did in PSHA2; the constant-factor overhead of SHA256x16 isn't worth it for messages where every lane has only one block.) I think BLAKE2bp or BLAKE3 are also reasonable choices here. Your BLAKE3CONCAT is relatively slow on the M1 because it only has a 128-bit vector unit. In Google's builds, most of the content hashing is performed on servers with AVX2 or AVX-512, and clients hash primarily small messages (source files and remote action descriptions). I would guess that server CPUs will continue to have relatively wide vector units, so I'm not sure you should expect accelerated single-stream SHA-256 to catch up. One virtue of single-stream SHA-256 in this context is that you might get a fast implementation without JNI. |
Hey a6f, First of all, thanks for your insightful feedback. I find it interesting how we are all running into these issues, but have had to resort to coming up with in-house solutions. AWS Glacier also has a tree hash that is based on SHA-256. Let me respond to your message in random order.
I think this is fine for the remote execution protocol itself, as CAS objects are identified by their hash and size. Unfortunately, I have seen a fair share of tooling around REv2 that assumes that the hash itself is good enough to act as a key. Maybe we ought to protect users of the protocol from that, though I don't care strongly.
Interesting. Could you tell me more about how the 2 MB block hashes are combined? Is it done like VSO PagedHash, where they are hashed using a sequence? Or is some tree structure used here? For SHA256TREE I decided to use a tree based solution, as I don't really want to commit to a certain block size up front. What's a good block size tends to depend on latency, storage media, etc..
Great! I found some other sources online, but wasn't aware that Intel also published an implementation.
So there's no denying that BLAKE2bp, BLAKE3, etc. can be made to run fast on server hardware that has AVX2/AVX-512. However, for us it's also pretty important to have a hashing function that works well at desk/on workers. As we obviously use a lot of Mac hardware for that, we need something that performs well both on x86 and ARM64. I think that something based on SHA-256 is the only serious option there. With regards to single-stream vs. striped, I fully understand that the 16-way striping that PSHA2 performs is a good fit for AVX-512. My only minor concern is that this is likely a pessimisation on ARM64, where the striping needs to be undone to be able to feed the data to the SHA-256 hardware instructions.
Exactly. And I think we also need to keep in mind how algorithms like these can be implemented in languages like Python, Javascript (Node.js), etc.. Those languages have good intrinsics for computing plain SHA-256, but anything else becomes problematic. Even applying/undoing striping might be expensive there.
I'm at this point seriously contemplating it. The nice thing about PSHA2 is that there's already code for Bazel readily available. Being able to reuse that would be nice. My only concerns/unknowns about PSHA2 at this point are:
If it wouldn't be too much of a hassle, could you publish a spec on PSHA2, including some test vectors? I'd be happy to take a look. Doesn't need to be 100% polished; just good enough for me to reproduce. Thanks! |
Apologies for the delay. PSHA2 spec [Edit: corrected link] that a6f references (and led creation of), as discussed in the last monthly. I also personally lean towards favouring SHA256 over adopting something from the Blake family, for reasons of it being already in wide use, getting hardware implementations, and has been strongly validated by the security community. To that end, PSHA2 could be suitable for use as-is if it meets your needs, probably with a differently-defined framing for use in the REAPI - e.g. I think we could define a PSHA2_RAW encoding that's the CHUNK_HASH value directly without the framing (the |
Hey! The spec linked above doesn’t seem to be public yet. I just requested access for myself, but maybe it should be opened up altogether. |
Apologies, please check again - I accidentally had the wrong link originally but it should be openable now. For convenience, here's the link again. |
Hi @a6f and @EricBurnett, Thanks for sending me a copy of the PSHA2 specification. Really appreciated! Even though I think the design of PSHA2 makes a lot of sense, it looks like it lacks some of the flexibility that I am looking for. To clarify, I would like to use some hashing scheme that allows decomposition at many different levels of 2^k, and give both the client and server the freedom to pick the 2^k that works best for them. For example, for a centralised CAS it is likely sufficient to decompose large objects into ~2 MB chunks. But on a client that offers access to the CAS through a virtual file system (FUSE/NFSv4), it would be nice if objects on the client could get decomposed all the way to the page/block device sector size. Just like how modern file systems like ZFS/Btrfs provide data integrity and deduplication. With PSHA2, we would only be capable of decomposing objects into chunks that are either 2 MiB, 128 GiB or 4 PB in size, which is far too coarse for what I'm trying to achieve. With that in mind, I would like to go ahead with this PR as is, namely to add the SHA256TREE digest function. If Google is at some point interested in adding support for PSHA2 to the Remote Execution protocol, I would be in favour, but it would not be in scope for this PR. That said, once both #235 and #236 land, it should be easier to add new algorithms going forward. |
Buildbarn has invested heavily in using virtual file systems. Both on the worker and client side it's possible to lazily fault in data from the CAS. As Buildbarn implements checksum verification where needed, randomly accessing large files may be slow. To address this, this change adds support for composing and decomposing CAS objects, using newly added ConcatenateBlobs() and SplitBlobs() operations. If implemented naively (e.g., using SHA-256), these operations would not be verifiable. To rephrase: when merely given the checksum of smaller objects, there is no way to obtain that of its concatenated version. This is why we suggest that these operations are only used in combination with SHA256TREE (see bazelbuild#235). With these new operations present, there is no true need to use the Bytestream protocol any longer. Writes can be performed by uploading smaller parts through BatchUpdateBlobs(), followed by calling ConcatenateBlobs(). Conversely, reads of large objects can be performed by calling SplitBlobs() and downloading individual parts through BatchReadBlobs(). For compatibility, we still permit the Bytestream protocol to be used. This is a decision we can revisit in REv3.
Buildbarn has invested heavily in using virtual file systems. Both on the worker and client side it's possible to lazily fault in data from the CAS. As Buildbarn implements checksum verification where needed, randomly accessing large files may be slow. To address this, this change adds support for composing and decomposing CAS objects, using newly added ConcatenateBlobs() and SplitBlobs() operations. If implemented naively (e.g., using SHA-256), these operations would not be verifiable. To rephrase: when merely given the checksum of smaller objects, there is no way to obtain that of its concatenated version. This is why we suggest that these operations are only used in combination with SHA256TREE (see bazelbuild#235). With these new operations present, there is no true need to use the Bytestream protocol any longer. Writes can be performed by uploading smaller parts through BatchUpdateBlobs(), followed by calling ConcatenateBlobs(). Conversely, reads of large objects can be performed by calling SplitBlobs() and downloading individual parts through BatchReadBlobs(). For compatibility, we still permit the Bytestream protocol to be used. This is a decision we can revisit in REv3.
Buildbarn has invested heavily in using virtual file systems. Both on the worker and client side it's possible to lazily fault in data from the CAS. As Buildbarn implements checksum verification where needed, randomly accessing large files may be slow. To address this, this change adds support for composing and decomposing CAS objects, using newly added ConcatenateBlobs() and SplitBlobs() operations. If implemented naively (e.g., using SHA-256), these operations would not be verifiable. To rephrase: when merely given the checksum of smaller objects, there is no way to obtain that of its concatenated version. This is why we suggest that these operations are only used in combination with SHA256TREE (see bazelbuild#235). With these new operations present, there is no true need to use the Bytestream protocol any longer. Writes can be performed by uploading smaller parts through BatchUpdateBlobs(), followed by calling ConcatenateBlobs(). Conversely, reads of large objects can be performed by calling SplitBlobs() and downloading individual parts through BatchReadBlobs(). For compatibility, we still permit the Bytestream protocol to be used. This is a decision we can revisit in REv3.
I deliberated quite a bit about the block size for PSHA2. I ended up picking 2MB for a few reasons:
I also debated whether to use 4-byte or 64-byte lanes. It cost about 4% more to transpose the input (when using AVX2), and I expected only slow fallback implementations would prefer it in 64-byte lanes. This was decided in 2018; if I had the M1 as a target, I might have chosen differently. |
cae7c20
to
f2fa032
Compare
In PR bazelbuild#233 I proposed the addition of two new ContentAddressableStorage methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random access it large CAS objects, while still providing a way to very data integrity. As part of that change, I added a new digest function to help with that, named SHA256TREE. This PR adds just this digest function, without bringing in any support for chunking. This will be done separately, as it was requested that both these features landed independently. I have also included test vectors for the SHA256TREE digest function. I have derived these by implementing three different versions in the Go programming language: - One version that uses regular arithmetic in Go. - One version for x86-64 that uses AVX2. - One version for ARM64 that uses the ARMv8 cryptography extensions. All three versions behave identically.
f2fa032
to
79efb10
Compare
Looking forward to this PR as well -- both for SHA256 support but also the digest_function changes. Does remote_asset.proto need changes as well to support disambiguating digest function? |
👍
That sounds like a good idea. It looks like the protocol is currently oblivious of digest functions. It also offers no way for clients/servers to negotiate on a digest function (read: there is no equivalent of GetCapabilities()). How is a digest function selected right now? Is there some implicit assumption that the CAS only supports exactly 1 digest function, and that the Remote Asset API uses that one unconditionally? Unfortunately, I don't use this protocol myself, so I'm slightly out of the loop here. Maybe it's best if this PR is merged as is, and that people interested in using the Remote Asset API have a wider discussion on this topic? I suppose the best solution would be if the Remote Asset API gained its own GetCapabilities() method, and that clients interested in using both the CAS and a Remote Asset service intersect the list of supported digest functions to pick the one they want. |
I'm happy to take a look at the remote_asset API once this PR is merged. |
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. Change-Id: I8302a6c724d7c89a2f7b227e7204d20178ae9bda
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction.
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction.
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction.
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction.
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction.
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. Closes #17772. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. Closes bazelbuild#17772. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. Closes bazelbuild#17772. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673 (cherry picked from commit 0a8380b)
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. Closes #17772. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673 (cherry picked from commit 0a8380b)
In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673 (cherry picked from commit 0a8380b) Signed-off-by: Brentley Jones <github@brentleyjones.com>
…19049) In the following PR we extended most of the REv2 *Request messages to take an explicit digest function: bazelbuild/remote-apis#235 This eliminates the ambiguity between digest functions that have the same hash length (e.g., MD5 and MURMUR3, SHA256 and SHA256TREE). This change extends the REv2 client in Bazel to set the digest_function field explicitly, so that the server does not need to use any heuristics to pick a digest function when processing requests. As we are now going to call DigestUtil.getDigestFunction() far more frequently, alter DigestUtil to precompute the value upon construction. This change was tested by letting Bazel forward all traffic through an instance of bb_clientd that was patched up to require the use of `digest_function`. PiperOrigin-RevId: 520074622 Change-Id: Ie92dc368c502b9bc3fddef56a5eb0a238760f673 (cherry picked from commit 0a8380b) Signed-off-by: Brentley Jones <github@brentleyjones.com> Co-authored-by: Ed Schouten <eschouten@apple.com>
In PR #233 I proposed the addition of two new ContentAddressableStorage methods (ConcatenateBlobs and SplitBlobs) that allow one to gain random access it large CAS objects, while still providing a way to very data integrity. As part of that change, I added a new digest function to help with that, named
BLAKE3CONCATSHA256TREE.This PR adds just this digest function, without bringing in any support for chunking. This will be done separately, as it was requested that both these features landed independently.
I have also included test vectors for the
BLAKE3CONCATSHA256TREE digest function.I have derived these by modifying the BLAKE3 reference implementation written in Rust, and rerunning the tool that emits the official test vectors:https://github.com/BLAKE3-team/BLAKE3/blob/master/test_vectors/test_vectors.jsonFurthermore, I have been able to validate the newly obtained test vectors using a custom
BLAKE3CONCATSHA256TREE implementation that I have written in Go, which will become part of Buildbarn.