SYCL accelerated BLAKE3 Hash Implementation
In recent times I've been exploring data parallel programming domain using SYCL, which is a heterogeneous accelerator programming API. Few weeks back I completed writing Zk-STARK friendly Rescue Prime Hash using SYCL, then I decided to take a look at BLAKE3, because blake3's algorithmic construction naturally lends itself for heavy parallelism. Compared to Rescue Prime Hash, BLAKE3 should be able to much better harness accelerator's compute capability when input size is relatively large ( say >= 1MB ).
SYCL -backed Rescue Prime implementation shines when there are lots of (short) indepedent inputs and multiple Rescue Prime Hashes can be executed independently on each of them, because Rescue Prime can be vectorized but doesn't provide with good scope of (multi-threaded/ OpenCL work-item based) parallelism inherently.
On the other hand SYCL implementation of BLAKE3 performs good when (single) input size is >= 1MB, then each 1KB chunk of input can be compressed parallelly --- very good fit for data parallel acceleration. After that BLAKE3 is simply Binary Merkle Tree construction, which itself is highly parallelizable, though multi-phase kernel enqueue required (increasing host-device interaction) due to hierarchical structure of Binary Merkle Tree, which results into data dependence.
In following implementation I heavily use SYCL2020's USM, which allows me to work with much familiar pointer arithmetics. I also use SYCL's vector intrinsics ( i.e. 4 -element array of type sycl::uint4
) for representing/ operating on hash state of BLAKE3. Another way to accelerate BLAKE3 (as proposed in specification) is compressing multiple chunks in parallel by distributing hash state of those participating chunks across 16 vectors, each with N -lanes, where N = # -of chunks being compressed together. N can generally be {2, 4, 8, 16}. I've implemented that scheme under namespace blake3::v2::*
, while simpler variant is placed under namespace blake3::v1::*
.
I've also written Binary Merklization implementation using BLAKE3 2-to-1 hash function, which takes N -many leaf nodes of some binary tree and produces all intermediate nodes. Note, here N = 2 ^ i | i = {1, 2, ...}
. For binary merklization, each BLAKE3 hash invocation takes 64 -bytes of input and produces 32 -bytes of output. Those 64 -bytes of input is nothing but two concatenated BLAKE3 digests.
I strongly suggest you go through (hyperlinked below) BLAKE3 specification's section 5.3 for understanding where I got this idea from.
I followed BLAKE3 specification and used Rust reference implementation as my guide while writing SYCL implementation.
Note, at this moment to keep Merkle Tree construction both easy and simple, this SYCL implementation can only generate BLAKE3 digest when input has power of 2 -many chunks, given each chunk of size 1KB. That means minimum input size should be 2KB, after that it can be increased as 4KB, 8KB ....
If input size is not >= 1MB, you probably don't want to use this implementation, because submitting job ( read enqueuing kernels ) to accelerator is not cheap and all those (required) ceremonies might defeat the whole purpose and essence of acceleration.
- Ensure you've Intel SYCL/ DPC++ compiler toolchain. See here for downloading precompiled binaries.
- If you happen to be interested in running on Nvidia GPU; you have to compile Intel's open-source llvm-based SYCL implementation from source; see here.
- For running test cases, which uses Rust Blake3 implementation for assertion, you'll need to have Rust
cargo
toolchain installed; get that here - I'm on
$ lsb_release -d
Description: Ubuntu 20.04.3 LTS
- Using Intel's SYCL/ DPC++ compiler version
$ dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2022.0.1/linux/bin-llvm
- For CUDA backend on Nvidia Tesla V100 GPU, I used Intel's
clang++
version
$ clang++ --version
clang version 14.0.0 (https://github.com/intel/llvm dc9bd3fafdeacd28528eb4b1fef3ad9b76ef3b92)
Target: x86_64-unknown-linux-gnu
Thread model: posix
- I'm on
rustc
version
$ rustc --version
rustc 1.59.0-nightly (efec54529 2021-12-04)
- You'll also need
make
utility for running test/ benchmark etc. - For formatting
C++
source consider usingclang-format
tool
make format
This is a header only library; so clone this repo and include blake3.hpp in your SYCL project.
// Find full example https://github.com/itzmeanjan/blake3/blob/1de036a/test/src/main.cpp
#include "blake3.hpp"
#include <iostream>
int main() {
sycl::device d{ sycl::default_selector{} }; // choose sycl device
sycl::queue q{ d }; // make sycl queue
// @note
// At this moment only power of 2 -many chunks are supported
// meaning input size will be `chunk_count * chunk_size` -bytes
//
// chunk_size = 1024 bytes
// chunk_count = 2^i, where i = {1, 2, ...}
// allocate input/ output memory
// fill input with data
// see https://github.com/itzmeanjan/blake3/blob/095e80f/test/src/main.cpp#L15-L37
// invoke hasher; last argument denotes execution doesn't need to be timed
blake3::v1::hash(q, in_d, i_size, chunk_count, wg_size, out_d, nullptr); // either
blake3::v2::hash(q, in_d, i_size, chunk_count, wg_size, out_d, nullptr); // or
// see https://github.com/itzmeanjan/blake3/blob/095e80f/test/src/main.cpp#L40-L43
// deallocate heap memory
return 0;
}
For Binary Merklization implementation consider including merklize.hpp into your SYCL project. You may want to see this for example.
For executing accompanying test cases run
BLAKE3_SIMD_LANES=2 make; make clean
BLAKE3_SIMD_LANES=4 make; make clean
BLAKE3_SIMD_LANES=8 make; make clean
BLAKE3_SIMD_LANES=16 make; make clean
which prepares random input of 1MB; then applies BLAKE3 using Rust implementation and both of my SYCL implementations of BLAKE3. Finally both of these 32 -bytes digests are asserted. It also asserts BLAKE3 2-to-1 hashing implementation which is used for Binary Merklization. ✅
Implementation | Comment |
---|---|
blake3::v1::hash(...) |
Each SYCL work-item compresses one and only one chunk |
blake3::v2::hash(...) |
Each SYCL work-item can compress either 2/ 4/ 8/ 16 contiguous chunks; selectable using BLAKE3_SIMD_LANES |
blake3::v1::merge(...) |
Takes 64 -bytes input ( two BLAKE3 digests ) and produces 32 -bytes output digest, it's called BLAKE3 2-to-1 hashing, which is used in Binary Merklization |
For running test cases inside Docker container (without installing any dependencies on your host, expect docker
itself) consider using Dockerfile provided with.
Build image
docker build -t blake3-test . # can be time consuming
Then run test cases inside container
docker run blake3-test
Following benchmark results denote what was
- kernel execution time
- time required to transfer input bytes to device
- time needed to transfer 32 -bytes digest back to host
when computing BLAKE3 hash ( v1 & v2 ) using SYCL implementation and input was of given size on first column. Input is generated on host; then explicitly transferred to accelerator because I'm using sycl::malloc_host
and sycl::malloc_device
for heap allocation; finally computed BLAKE3 digest ( i.e. 32 -bytes ) is transferred back to host. None of these data transfer costs are included in kernel execution time. For benchmarking purposes, I enable profiling in SYCL queue and sum of all differences between kernel enqueue event's start and end times are taken. I've also used a static SYCL work-group size of 32 for each of these executions rounds; total of 8 rounds are executed for each row before taking average of obtained kernel execution time/ host <-> device data transfer time.
Below I'm presenting benchmark results of Binary Merklization using BLAKE3 2-to-1 hashing. Four columns which are shown are as follows
Field | Description |
---|---|
leaf count | input binary tree's leaf count [ note, this is always power of 2 ] |
execution time | time spent executing all kernels which are enqueued for computing all intermediate nodes of specified binary tree with N -many leaf nodes |
host-to-device data tx cost | time required to transfer (leaf_count << 5) -bytes random input to accelerator [ because each leaf node is a BLAKE3 digest ] |
device-to-host data tx cost | time spent on transferring back all (leaf_count - 1) -many intermediate nodes back to host |
I prepare random input of (leaf_count << 5) -bytes on host, which is explicitly transferred to accelerator using SYCL USM API. As soon as input is ready to be operated on, binary merklization begins and computes all intermediate nodes of Merkle Tree in multiple rounds. At end, all these intermediate nodes are brought back to host. I've enabled SYCL queue profiling, which I make use of for timing all events, I get after enqueuing commands i.e. data transfer/ kernel execution etc..
Note, this Binary Merklization implementation only works with leaf count which is power of 2 value.
For all these benchmarking, I'm using static SYCL work-group size 32. [ changing it to runtime decision should be explored ! ]