Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize part of logits processing #5

Open
wants to merge 2 commits into
base: opt
Choose a base branch
from

Conversation

kroggen
Copy link

@kroggen kroggen commented Aug 3, 2023

Apply the temperature and softmax to the logits using the GPU

The argmax and sample functions were not changed

This has a drastic improvement in speed, of ~76%, when the temperature is not 0

Tested using the stories110M.bin model with an RTX 3090 with PCIE 4.0 at 24GB/s

@kroggen
Copy link
Author

kroggen commented Aug 3, 2023

Note: the above increase in performance is with all the other PRs merged in a single file.

@kroggen
Copy link
Author

kroggen commented Aug 3, 2023

Now I also implemented the argmax on the GPU, so it increase performance when using temperature = 0

@ankan-ban
Copy link
Owner

I knew logits processing on CPU is a bottleneck and I was planning to address this and so far I have been benchmarking with temp = 0. Thank you for the change. Will hopefully spend some time in testing/understanding it and merge soon.

@kroggen
Copy link
Author

kroggen commented Aug 4, 2023

The argmax function can be enhanced further. The return value can be allocated once on the GPU together with the other arrays on the RunState, and released with them.

I just made it fast, but noticed that it is working as expected. The result is the same as before, with temperature 0

@kroggen
Copy link
Author

kroggen commented Aug 4, 2023

There is a simpler way to implement parallel argmax, using the atomicMax that performs an atomic operation:

__global__ void argmax32_kernel(const float* __restrict__ v, int n, int* max_pos, float* max_val) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < n) {
        atomicMax(max_val, v[tid]);
        __syncthreads();
        if(*max_val == v[tid]){
            *max_pos = tid;
        }
    }
}

max_val and max_pos are allocated on GPU.

In this case it can be launched with many blocks.

But I suspected that it would be slower because of the many threads stuck at the lock and just one making a step at a time (how I imagine it works, maybe I am wrong)

I did not benchmark both to compare though

@kroggen
Copy link
Author

kroggen commented Aug 9, 2023

My suggestion: merge this PR and then later you can attempt other approaches for even higher performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants