Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: Add support for RWKV v7 architecture #11452

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

MollySophia
Copy link
Collaborator

@MollySophia MollySophia commented Jan 27, 2025

@BlinkDL 's explanation of RWKV v7:
RWKV-7 as a meta-in-context learner
Also there are plenty of tests on trained models (currently 0.1B and 0.4B) posted on his x account. Larger models are coming too in several days.

Current available RWKV v7 model repos in HF format:
https://huggingface.co/SmerkyG/RWKV7-Goose-0.1B-World2.8-HF (not an official published one, tensor names are expected to change in the future)
https://huggingface.co/mollysama/rwkv-7-world-0b4-hf
https://huggingface.co/mollysama/rwkv-7-world-1b5-hf
https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 (hybrid model with rwkv v7 "attn" and qwen2.5 7B's mlp, distilled from qwen2.5)

This PR contains:

  • GGML_OP_L2_NORM that applies pytorch-style l2 normalization, along the rows. Tested with CPU, CUDA, SYCL, Vulkan, Metal backends.
  • GGML_OP_RWKV_WKV7 which is the core of the RWKV v7 architecture. Implemented the naive recurrent wkv7 kernel in CPU, CUDA, SYCL, Vulkan, Metal.
  • Support inference of RWKV7 and ARWKV7 models.
  • Simple Metal kernel for the old WKV6.
  • Skip unused tokens in last layer ffn computation for rwkv models. (8000tps -> 8100tps prefilling for 7B v7 model)

TODO:

  • (within this PR or in the future) Implement chunkwise wkv7 (and possibly wkv6 as well) as per flash-linear-attention's impl.

Note: Current benchmark of ARWKV7-7B f16

# molly @ molly-workstation in ~/llama.cpp on git:rwkv-v7 x [9:49:42] 
$ ./build-test/bin/llama-bench -m ../ARWKV-7B-Preview-0_1-NoG/ARWKV-7B-Preview-0_1-NoG-F16.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         pp512 |      8105.20 ± 15.34 |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         tg128 |         50.62 ± 0.01 |

build: 76219859 (4579)

which is way faster than RWKV v6 7B when prefilling (still a bit slower than Qwen2.5 7B).

MollySophia and others added 11 commits January 27, 2025 12:22
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
There isn't much peformance gain though. Just for more op coverage

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
@MollySophia MollySophia marked this pull request as ready for review January 27, 2025 13:33
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 27, 2025
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
@MollySophia MollySophia marked this pull request as draft January 27, 2025 14:09
@MollySophia MollySophia marked this pull request as ready for review January 28, 2025 09:10
@MollySophia
Copy link
Collaborator Author

Update: added support for fla-hub's rwkv7 hf model format. (https://huggingface.co/fla-hub/rwkv7-1.5B-world)

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
@ggerganov
Copy link
Owner

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

@MollySophia
Copy link
Collaborator Author

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

That’s great! I can help with that too

@ggerganov
Copy link
Owner

Great, keep a look at the #11213 PR. It's still very messy, but I hope it will soon start to make sense.

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
They passes on my m2 and m4 devices :|

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants