-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: TypeError: 'NoneType' object is not callable when loading Gemma 2 9B with new 0.5.1 version #6169
Comments
Hmmm which version are you using for FlashInfer? Is it the latest release? https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.8? |
In the docker i assume yes, also in my python env i use v0.0.8 |
Using python and installing manually the v0.0.7 of flashinfer i got this:
This is the result of collect env:
|
In a python shell, can you run from flashinfer import BatchDecodeWithPagedKVCacheWrapper
from flashinfer.decode import CUDAGraphBatchDecodeWithPagedKVCacheWrapper
from flashinfer.prefill import BatchPrefillWithPagedKVCacheWrapper to see what error it raises? It must be some ImportError that caused these to be set to None. |
Actually I can repro this, it seems that we missed FlashInfer installation in Docker Image. Adding in now. |
No problem executing this |
The |
The flash infer version must be 0.8.0, not 0.7.0 |
Ok I have updated the docker image on the hub with pre-installed FlashInfer 0.8.0. It should help resolve this! |
Thanks, i can confirm, it works now :) Also my local installation with python now works (i missdownloaded the right version of flashinfer) A little OOT but is it normal that the speed in tokens/s is the same as llama.cpp version of gemma 9b fp16? |
Your current environment
Idk how to run it inside a docker
🐛 Describe the bug
Simply run the following command
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<my secret>" --env "VLLM_ATTENTION_BACKEND=FLASHINFER" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model google/gemma-2-9b-it
It doesn't work.
Log:
I'm on a Google Cloud VM with Nvidia L4, 4 cores and 32GB RAM, using the DeepLearning 12.1 CUDA image provided by Google. The same error come out when running via standard python using the following command:
python -m vllm.entrypoints.openai.api_server --model "google/gemma-2-9b-it"
Obviously already setted the env for enable FlashInfer backend and my hugging face key
The text was updated successfully, but these errors were encountered: