This repo is inspired by a series of recent posts by François Fleuret on X. The goal is to implement the PScan algorithm in a simple yet efficient way
Let's consider a tensor
And let
Where
Let's denote a binary upper triangular matrix of size
Note that
Knowing that
Following the recursion, for every
Now lets denote
Or using torch
notation:
Y = X + torch.bmm(M, X)
Unfortunately, the product torch.bmm(M, X)
requires
First, note the first row of
M = torch.cat([torch.zeros(N, 1, D), Z], dim=1)
The matrix
Then
Z = Z_ * L_T
By def torch.log
in the complex space, then
Now let's express
or in PyTorch:
D[n, :, :] = torch.cat([U_t, torch.zeros(T - t, T)], dim=0)
Z_[n, t, :] = D[n, :, :] @ torch.log(A)[n, :]
It turns out this product can be computed efficiently using either FFT or CumSum
. Turns out we can extend matrix CumSum
. However, even though torch.bmm(M, X)
entirely.
First let's have a look at
or, in other words:
As we said earlier, the product of any vector on the matrix A_log.cumsum(dim=-1)
as CS
, then Z[n, T - 1, :] = CS[:, -1].unsqueeze(-1) + A_log - CS
. Now let's note the following fact, for any
Z[n, t - 2, :] = (Z[n, t - 1, :] - ln(A[n, t - 1])) * torch.cat([torch.ones(t), torch.zeros(T-t)])
So this leads to the conclusion that knowing torch.cat([torch.ones(t), torch.zeros(T-t)])
. Let's now rewrite this in a tensor form. Lets denote
where
or
Moreover, we know that we can compute the product of
We know that both
Y = X + (L_T * torch.exp(V.unsqueeze(1) - W.unsqueeze(2))) @ X
Let's have a closer look for any fixed
or
The final step is to use the following trick. For any vector
Or using the CumSum
operation:
CumSum
we can achieve different performances. In the next few sections, we explain different CumSum
algorithms, their tradeoffs, and their complexities.
See pytorch official docs for more details
Turns out we can extend matrix FFT.md
To run the code pytorch
and tqdm
are required. The recommended way is to use docker, the image can be built running:
docker build -it scan .
Here we compare cusum
implementation with the one proposed by François Fleuret. Benchmarking conducted with NVIDIA V100, see Dockerfile for the environment details, and benchmark.py.