parallelize part of logits processing #5

kroggen · 2023-08-03T06:39:12Z

Apply the temperature and softmax to the logits using the GPU

The argmax and sample functions were not changed

This has a drastic improvement in speed, of ~76%, when the temperature is not 0

Tested using the stories110M.bin model with an RTX 3090 with PCIE 4.0 at 24GB/s

kroggen · 2023-08-03T06:43:42Z

Note: the above increase in performance is with all the other PRs merged in a single file.

kroggen · 2023-08-03T19:08:50Z

Now I also implemented the argmax on the GPU, so it increase performance when using temperature = 0

ankan-ban · 2023-08-04T04:23:31Z

I knew logits processing on CPU is a bottleneck and I was planning to address this and so far I have been benchmarking with temp = 0. Thank you for the change. Will hopefully spend some time in testing/understanding it and merge soon.

kroggen · 2023-08-04T05:20:38Z

The argmax function can be enhanced further. The return value can be allocated once on the GPU together with the other arrays on the RunState, and released with them.

I just made it fast, but noticed that it is working as expected. The result is the same as before, with temperature 0

kroggen · 2023-08-04T05:31:58Z

There is a simpler way to implement parallel argmax, using the atomicMax that performs an atomic operation:

__global__ void argmax32_kernel(const float* __restrict__ v, int n, int* max_pos, float* max_val) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < n) {
        atomicMax(max_val, v[tid]);
        __syncthreads();
        if(*max_val == v[tid]){
            *max_pos = tid;
        }
    }
}

max_val and max_pos are allocated on GPU.

In this case it can be launched with many blocks.

But I suspected that it would be slower because of the many threads stuck at the lock and just one making a step at a time (how I imagine it works, maybe I am wrong)

I did not benchmark both to compare though

kroggen · 2023-08-09T02:27:12Z

My suggestion: merge this PR and then later you can attempt other approaches for even higher performance

parallelize part of logits processing

edde656

argmax on the GPU

2e6e2b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize part of logits processing #5

parallelize part of logits processing #5

kroggen commented Aug 3, 2023

kroggen commented Aug 3, 2023

kroggen commented Aug 3, 2023

ankan-ban commented Aug 4, 2023

kroggen commented Aug 4, 2023 •

edited

Loading

kroggen commented Aug 4, 2023 •

edited

Loading

kroggen commented Aug 9, 2023

parallelize part of logits processing #5

Are you sure you want to change the base?

parallelize part of logits processing #5

Conversation

kroggen commented Aug 3, 2023

kroggen commented Aug 3, 2023

kroggen commented Aug 3, 2023

ankan-ban commented Aug 4, 2023

kroggen commented Aug 4, 2023 • edited Loading

kroggen commented Aug 4, 2023 • edited Loading

kroggen commented Aug 9, 2023

kroggen commented Aug 4, 2023 •

edited

Loading

kroggen commented Aug 4, 2023 •

edited

Loading