Fastest GPU kernels, written from scratch.

Matrix Multiplication

Matrix multiplication of square bf16 matrices, accumulated in fp32.

N=4096
Kernel: 763 TFLOPs
cuBLAS: 716 TFLOPs

N=8192
Kernel: 808 TFLOPs
cuBLAS: 795 TFLOPs

make matmul && out/matmul

Example kernels are in examples/matmul/ and orchestration is in matmul.cu

We compute sum of 2^30 elements.

make sum && out/sum

Kernel: 3240.11 GB/s
cub Library: 3193 GB/s

Example kernels are in sum.cu

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples/matmul		examples/matmul
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
logs.txt		logs.txt
matmul.cu		matmul.cu
sum.cu		sum.cu