-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) #13987
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
…ode in SimpleConnector Signed-off-by: Mathis Felardos <mathis@mistral.ai>
ee70a61
to
b6544a0
Compare
In the current implementation I am assuming there is no chunked prefill so that prefill and decode jobs won't appear in the same batch. |
I understood that, at least for the Decode node. Is it also the case for the Prefill node? Anyway, I think a warning and/or set |
I guess you are referring to the case where the prefill node performs chunked prefill. Just want to clarify that chunked prefill + disaggregated prefill can be useful but from what I heard it is not the default usecase so we deprioritized the support. But yeah, it's definitely better to have a warning there. |
Does it work with cuda graphs? I saw this piece of code that indicates that a padding can be added when cuda graphs are used. Could it create silent bugs like this one? KV cache receive is synchronous in decoding node so it is compatible with cuda graph. KV cache send is asynchronous, but since prefill node does not actually use cuda graph I guess this is OK. |
No. |
Is there a reason why we need min_length slicing? This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
…e in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (vllm-project#13987) Signed-off-by: Mathis Felardos <mathis@mistral.ai>
Hello 👋 ,
While experimenting with disaggregated serving with nccl and vllm, I encountered a bug when running mistral-large.
After a lot of debugging I realized that the inflight batching was the issue. On the decode node you can have a request that is in prefilling (from which the KVCache needs to be fetched) that is being batched with decoding requests.
The current implementation is bugged, here an example:
vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py
Lines 214 to 216 in 58d1b2a
S1
is a prefilling one,S2
andS3
are decodes onesmodel_input.attn_metadata.seq_lens
will however have values that look like that[18, 17, 16]
query_lens
(here) will have values like this:[18, 1, 1]
input_tokens_tensor
is 20 (18 + 1 + 1). So usingseq_lens
seems wrong in the first place.vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py
Lines 231 to 238 in 58d1b2a
So, in this example, the KVCache of
S1
will be properly fetched. But, when reaching S2 thecurrent_tokens
ofinput_tensor
will be the sliceinput_tensor[18:35]
... which doesn't makes sense because the size of input_tensor is 20. But pytorch won't throw an IndexError and will return a tensor of size 2 with the tokens ofS2
andS3
. It will then creates a hangs in the prefill node, because theself.select()
call is done with invalid tokens.I think the patch is pretty self explanatory otherwise. But do not hesitate if you have questions. 😉
I have however some follow-up questions:
What would the proper patch to support inflight batching here? it feels a bit dangerous to modify
model_input
like you suggest here:vllm/vllm/distributed/kv_transfer/kv_connector/simple_connector.py
Line 289 in 58d1b2a
Is there a reason why we need
min_length
slicing? I thought about using a dict instead of a queue in SimpleBuffer, but this line confuses me:vllm/vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py
Lines 76 to 77 in 58d1b2a
Does it work with cuda graphs? I saw this piece of code that indicates that a padding can be added when cuda graphs are used. Could it create silent bugs like this one?
vllm/vllm/attention/backends/flash_attn.py
Line 494 in 58d1b2a
Is there an issue in the SimpleConnector if chunked prefill is used by the
prefill
node? I have seen this in the LMCache connector that suggests that a problem might be present. But I'm not sure if that's also the case of SimpleConnector:vllm/vllm/distributed/kv_transfer/kv_connector/lmcache_connector.py
Line 70 in 58d1b2a
Will SimpleConnector be replaced by the LMCache Connector replace entirely at some point?