- Target hardware: GraySkull e75
- Install software stack for TT-Metalium: https://github.com/tenstorrent/tt-metal/blob/main/INSTALLING.md
- Follow step 1 to 4.
$ cd tenstorrent-tiny-examples
$ git submodule update --init --recursive
$ export ARCH_NAME=grayskull
$ export TT_METAL_HOME=$PWD/external/tt-metal
$ mkdir build
$ cd build
$ cmake -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_C_COMPILER=clang-17 -GNinja ..
$ ninja install
$ cd conv-example-tt-grayskull
$ export ARCH_NAME=grayskull
$ export TT_METAL_HOME=$PWD/external/tt-metal
$ ./bin/tiny_tt_examples
$ find ./src/ -name "*.h" -or -name "*.cpp" -exec clang-format -style=file -i {} \;
IEEE Standard for Floating-Point Arithmetic (IEEE 754) specifies that a 32-bits float consists of:
- 1 bit for sign.
- 8 bits for exponent.
- 23 bits for fraction (mantissa).
bfloat16
consists of:
- 1 bit for sign.
- 8 bits for exponent.
- 7 bits for fraction (mantissa).
tt-metalium keeps bfloat16
in a unsigned 16-bits integer. We can convert a float to bfloat16
by right-shifting it (simply dropping the lower 16-bits of the float value). We can convert
bfloat16
to a float by left-shifting it (adding lower 16-bits of zeros).
When we run a matrix multiplication A * B
where A
and B
are matrices, we have to run
dot-projects of a row of A
with all columns of B
. If we naively implement it, our program
can access each row of A
as many as the number of columns of B
. Ideally, we want to keep
a row of A
in registers (or fast data storage like L1 cache), but we will experience the
shortage of the fast data storage (register spilling issue).
"Tiling" is a technique to access a sub-matrix (we call it "tile") of A
and a sub-matrix
of B
for the multiplication. In that way, we can keep all elements of the tiles into the
limited fast data storage, and we can also conduct all multiplications between elements
on the tiles. By sliding tiles, we can continue.
When we have multiple layers of cache like GPU, we can utilize more storage classes depending on thread groups. See this blog post from Nvidia for details.
Other references are:
A Tensix core on a Tenstorrent device has a matrix engine. It supports the matrix multiplication whose dimension is 32 by 32. Tenstorrent named this unit matrix "tile". I guess the term is from the "tiling" of matrix multiplication.
Note that Tenstorrent documents call a group of tiles "block" when the "block" is the size of "tile" in the above tiling technique.
The tilization means reorder the elements of an array or vector like std::vector
to provide the data in the order of tiles to Tensix cores.
For example, for a simpler explanation, let me assume a dimension of a tile is
4 by 4, and we want to tilize a matrix A
whose values are
1 1 1 1 5 5 5 5
2 2 2 2 6 6 6 6
3 3 3 3 7 7 7 7
4 4 4 4 8 8 8 8
that is kept by a std::vector<int>
like {1,1,1,1,5,5,5,5,2,2,2,2,6,6,6,6,...}
.
We have two tiles
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
and
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
so after tilizing it must be {1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,...}
.
The only difference of the tilization on Tenstorrent device from the above example is the dimension of the tile matrix. As explained above, it is 32 by 32.
The untilization is the opposite conversion.
src/4_single_tile_matmul/single_tile_matmul.cpp
implements a matrix
multiplication between two tiles. It uses a single Tensix core {0, 0}
.
src/4_single_tile_matmul/kernels/single_tile_matmul_reader.cpp
reads two input matrices from device DRAM and passes them to CB::c_in0
and
CB::c_in1
buffers that are on L1 cache.
On its compute kernel src/4_single_tile_matmul/kernels/single_tile_matmul.cpp
,
it uses matmul_tiles(..)
that conducts DST = A * B
like
DST = tile on CB::c_in0 * tile on CB::c_in1
.
Note that it uses DST
, so we have to acquire_dst(..)
and release_dst(..)
.
When I asked why we need to "acquire" DST
register, Tenstorrent folks answered
that we have 3 RISC-V processors on a single Tensix core, and they can cause a
data race (access to DST
in a random order).
Also note that I guess matmul_tiles(..)
conducts DST += A * B
from this
example. It is a compute kernel to multiply
two large matrices with multiple tiles on each row/column. In that case, for each
output tile, we have to slide all tiles on the corresponding row of A
and all
tiles on the corresponding column of B
, and we have to add all the mat-mul
results between their tiles. In the kernel, we do not have "add" operation, but
it works. That's why I concluded that matmul_tiles(..)
conducts DST += A * B
instead of DST = A * B
.
src/multicast_matmul.cpp
is now working:
- Create two matrices that have dimensions
number of cores * tile height (32)
bytile width
andtile width
bynumber of cores * tile height (32)
. - Each i-th core reads i-th row tile of the first matrix (
A
) and i-th column of the second matrix (B
). - The reader kernel iterates
i
from[0, number of cores)
. Wheni
is the core itself (I gave core id to all of them from[0, number of cores)
), it becomes "sender" of the multicast. Otherwise, it becomes "receiver".- "sender" sends the tile it reads for the i-th column of
B
to all other cores. - Each compute kernel runs the mat-mul for two tiles for the i-th row of
A
and the i-th column ofB
.
- "sender" sends the tile it reads for the i-th column of
A compute kernel looks like a single kernel, but tt::tt_metal::CreateKernel(..)
actually generates three kernels: UNPACK, MATH, PACK kernels.
You can check it with the following steps:
- For a simple TT program, intentionally add a syntax error to the compute kernel.
- Running the program will fail with compile errors like:
cd /path/to/conv-example-tt-grayskull/external/tt-metal//built/2052/kernels/simple_multicast/1584599061800683236/trisc2/ && /path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++ -mgrayskull -march=rv32iy -mtune=rvtt-b1 -mabi=ilp32 -std=c++17 -flto -ffast-math -fno-use-cxa-atexit -fno-exceptions -Wall -Werror -Wno-unknown-pragmas -Wno-error=multistatement-macros -Wno-error=parentheses -Wno-error=unused-but-set-variable -Wno-unused-variable -Wno-unused-function -O3 -DARCH_GRAYSKULL -DTENSIX_FIRMWARE -DLOCAL_MEM_EN=0 -DDEBUG_PRINT_ENABLED -DUCK_CHLKC_PACK -DNAMESPACE=chlkc_pack -DCOMPILE_FOR_TRISC=2 -DKERNEL_BUILD -I. -I.. -I/path/to/conv-example-tt-grayskull/external/tt-metal// -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/include -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/inc -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/inc/debug -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/inc/grayskull -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/inc/grayskull/grayskull_defines -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/inc/grayskull/noc -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/umd/device/grayskull -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/common -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/llk_io -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/tt_llk_grayskull/common/inc -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/tt_llk_grayskull/llk_lib -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/inc -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/common -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/llk_io -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/llk_api -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/ckernels/grayskull/metal/llk_api/llk_sfpu -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/sfpi/include -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/firmware/src -I/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/tt_llk_grayskull/llk_lib -c -o trisck.o /path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/hw/firmware/src/trisck.cc
- If you check the directory e.g.,
/path/to/conv-example-tt-grayskull/external/tt-metal//built/2052/kernels/simple_multicast/1584599061800683236/trisc2/
you will see the build artifact. For example,trisc2/trisc2.elf
is an ELF binary for RISC-V. We can run/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-objdump
and/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-readelf
to investigate the binary. - We can use
-E
option for the above/path/to/conv-example-tt-grayskull/external/tt-metal//tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++
command to get the preprocess result.- Unpack kernel option:
-DUCK_CHLKC_UNPACK -DNAMESPACE=chlkc_unpack
- Math kernel option:
-DUCK_CHLKC_MATH -DNAMESPACE=chlkc_math
- Pack kernel option:
-DUCK_CHLKC_PACK -DNAMESPACE=chlkc_pack
- Unpack kernel option:
Interestingly, for the following kernel (simple_multicast kernel):
static inline SliceRange hw_all() {
return SliceRange{.h0 = 0, .h1 = 32, .hs = 1, .w0 = 0, .w1 = 32, .ws = 1};
}
namespace NAMESPACE {
void MAIN {
acquire_dst(tt::DstMode::Tile);
cb_wait_front(tt::CB::c_in0, /* number of tiles */ 1);
copy_tile_to_dst_init_short();
copy_tile(tt::CB::c_in0, 0, /* DST */ 0);
#if TINY_DEBUG
DPRINT_UNPACK(DPRINT << TSLICE(tt::CB::c_in0, 0, hw_all()) << ENDL());
#endif
cb_pop_front(tt::CB::c_in0, /* number of tiles */ 1);
// PACK(( llk_pack_hw_configure_disaggregated<false,
// DST_ACCUM_MODE>(tt::CB::c_out0) ));
PACK((llk_pack_init(tt::CB::c_out0)));
// PACK(( llk_setup_outputs() ));
PACK((llk_pack_dest_init<false, DST_ACCUM_MODE>(tt::CB::c_out0)));
LOG(DPRINT << "[COMPUTE] pack tile" << ENDL());
cb_reserve_back(tt::CB::c_out0, /* number of tiles */ 1);
pack_tile(/* DST */ 0, tt::CB::c_out0);
cb_push_back(tt::CB::c_out0, /* number of tiles */ 1);
#if TINY_DEBUG
DPRINT_PACK(DPRINT << TSLICE(tt::CB::c_out0, 0, hw_all()) << ENDL());
#endif
release_dst(tt::DstMode::Tile);
LOG(DPRINT << "[COMPUTE] done" << ENDL());
}
} // namespace NAMESPACE
The PACK kernel uses only the packing part of APIs. For example,
cb_wait_front(..)
is not related to packing. The preprocessed
cb_wait_front(..)
for PACK kernel is empty:
inline __attribute__((always_inline)) void cb_wait_front(uint32_t cbid, uint32_t ntiles) {
;
}
On the other hand, the UNPACK kernel has the following
cb_wait_front(..)
:
inline __attribute__((always_inline)) void cb_wait_front(uint32_t cbid, uint32_t ntiles) {
( llk_wait_tiles(cbid, ntiles) );
}
To implement this different preprocessed results, tt-metal uses
PACK(..), MATH(..), UNPACK(..)
macro.
Another interesting part is acquire_dst(tt::DstMode mode)
.
The UNPACK kernel has an empty one:
inline __attribute__((always_inline)) void acquire_dst(tt::DstMode mode) {
;
;
}
The MATH kernel waits for DEST available:
inline __attribute__((always_inline)) void acquire_dst(tt::DstMode mode) {
( llk_math_wait_for_dest_available() );
;
}
The UNPACK kernel waits for the end of MATH kernel:
inline __attribute__((always_inline)) void acquire_dst(tt::DstMode mode) {
;
( llk_packer_wait_for_math_done() );
}
Its implementation matches the preprocessed code:
ALWI void acquire_dst(tt::DstMode mode) {
MATH(( llk_math_wait_for_dest_available() ));
PACK(( llk_packer_wait_for_math_done() ));
}
Based on the implementation of acquire_dst(..)
, if we use it,
we can guess it executes UNPACK, MATH, PACK in order.
- At the moment I am writing this document, I am working for Google. Based on Google's open-source policy, I open-sourced this project with Apache 2.0 license, but this is my personal side project and this project is not related to any project I am working on at Google.