-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some improvements for KV caching #1891
Some improvements for KV caching #1891
Conversation
69d6d6f
to
a65a96d
Compare
Can somebody help with failing tests? I don't understand why tests for Windows are failing, but pass for all other systems. And I also don't understand why the GPU tests are failing. |
Hello @mseeger Thank you for another PR.
yeah, there is always something with Windows.
I'll check it tomorrow. |
Hello @mseeger It's quite a PR 🫠 🙂. (I'll take a look why GPU tests are failing later.) |
7f2c2ce
to
3226323
Compare
OK, I reacted to comments. I also did a small change in |
Cool, we are almost there 🙂. On my side I'll try to find and fix issues with failing GPU tests, hopefully this year 😃. |
- Shrink buffers returned by KVCache to just cover input_pos entries - Refactor child classes of model.py classes to avoid copy and paste
3226323
to
3702b03
Compare
Overall, the issue with GPU+Thunder is something specific to the latter. Thanks again for the PR (and for the patience 😊). Happy New Year! 🚀 |
@mseeger unfortunately the "Shrink buffers returned by KVCache to just cover input_pos entries" does not work as intended: Introducing data-dependent control flow (i.e. here making a tensor size out of a tensor value) typically breaks compilation just as much as using CPU integers. The other aspect here is that in my experience, the performance impact of this has been extremely limited if the attention implementation works reasonably, so I would have a tendency to revert this part of the change. |
Hello, my original PR did not make a tensor size out of a tensor value. In fact, this is why But sure, if you like to revert this (the other changes I think are useful). I am already working on something that would make this obsolete. |
My intent is that |
With what I am preparing, KV caches would receive a proper abstract definition (with a very simple implementation for what is currently given, namely the exact KV cache), and this |
if input_pos_maxp1 > self.max_seq_length:
raise ValueError(f"Positions in 'input_pos' must be in [0,{self.max_seq_length})")
mask = mask[..., :input_pos_maxp1] So in these lines, if I know these things are subtle. In #1912 we specifically drop the input_pos_maxp1 this from the generate when dealing with ThunderModules. |
(And needless to add, I do greatly appreciate your work on KVCaches in LitGPT.) |
model.py
, in particular removeforward
copies