-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine large LLM with small LLM for faster inference #630
Comments
One idea, which would require adding extra model parameters and doing some training: Because llama (and other transformer language models) are residual networks, i.e. each block adds to the output of the previous instead of replacing it. This can be thought of each block refining the model's current understanding of what token will occur next. Maybe what might be possible is to add a linear layer after each block, one that produces its own estimate for the token distribution. The last block already has this; it is the output of the network. I'm thinking if each block can estimate the token probability, maybe you could compute it at each block, and stop early if it's converged. That is, if the subsequent blocks are not appreciably changing the token probabilities we can just stop early. I did an experiment where I limited llama.cpp to only use the first N blocks of the network. it is surprisingly coherent at 27/32 blocks. Deepmind has a paper called "pondernet" that has a similar idea, but I believe doing this would require more changes to the model. |
If this approach worked reliably, couldn't you just run only the first few layers from a big model and just skip the rest? I actually played with this kind of thing a while, randomly disabling chunks of layers, only evaluating every other layer, stuff like that. Unfortunately, it seemed to make a pretty big difference in quality and also doing something like only evaluating the first 10 layers of a 30 layer model produced pretty incoherent output. It may depend on the model and which layers are skipped. Unscientifically, it seemed like the first layers were the most important. |
@ggerganov Relevant paper https://arxiv.org/pdf/2303.08112.pdf |
@ggerganov I think point 2 is possible, it sounds similar to Confident Adaptive Language Modeling (CALM) from Google AI. Instead of a smaller pre-trained LLM they trained an early-exit classifier on intermediate layers which allowed them achieve roughly 50% reduction in inference time. |
I think that might like the full output using huge layers be interrupted after the first some layers, Like if output 65b is reasonable, Output65b * layers7b have/ layers s65b have May not creat a reasonable result, because the 65b output need the rest layers to be fully adjusted. |
I had the following idea: And if I understand correctly, you don't need to load all layers for such experiments. You can use the mapped address of the mmap function to load and unload the layers you want to determine. This would also mean that a large model can be used with a small amount of RAM. Just load the layers in chunks, save the result in scratchpad and using the result as input to the next chunks of layers. Loading and unloading the chunks would of course affect performance. I'm also not sure if this would work, as my understanding of the ggml library is that the calculations are done lazily instead of eagerly. Is this correct? |
Hello I had the same idea: meta-llama/llama#256 There are many ways to modify this algorithm including using the small LMM to predict also the second most likely word and so on. |
This idea of using a small model to predict multiple steps ahead and batch the scoring of generated sequence is called Speculative Sampling. Here's a relevant paper: https://arxiv.org/abs/2302.01318 It's a completely different method than CALM that tries to train a model to output a prediction of the next token at each layer of the LLM and exit early if the confidence is high enough. Both methods could be implemented separately and might even work together. I think it's worth it to create another issue for optimization of inference through early exit (CALM-style). |
BTW I tried Speculative Sampling, and while the small model guessed correct about 70% of the time so it could jump ahead 2 tokens at a time I think it didn't quite compensate the extra time needed to infer a double batch. But the proof of concept worked and it would definitely give a speed boost of about 60% if you had enough GPUs. (FUnny I thought I invented something new but that paper came out a month ago!) |
OK. I thought of a new idea. Called "chaining". Which is a modification of Speculative Sampling but chaining more models together. You take models of size 1B, 4B, 16B The idea is that if a model half the size runs twice as quick you can chain together these models because 1+1/2+1/4+1/8+....=2 You could play around with the numbers of models, their sizes, how many steps each one does etc. to get the best speed. Even better you could use a genetic algorithm to get the best set up. These speculative sampling techniques will work best with low temperatures. |
I've been considering something similar but with other non LLM networks using the same technique, so thank you for posting. I'm happy to have read the other comments regarding papers which extend the concepts. """If this estimate indicates a high probability for that token (i.e. above some threshold) - we stop and directly say that this is the new token. At this point we would have consumed (5ms for the small LLM + ~50ms for the large LLM)""" In this case would if you wanted to continue, then would you insert that chosen token as an input token to allow the large model to continue with proper state of logits? Would this also require more static topk/rand values, and verification of best values for temp across each model (the small, and large)? I'm guessing the newer API for setting/getting states can be used to rewind the 10% and reinsert the token you've chosen with comparison to the smaller LLM. |
IMO The paper https://arxiv.org/abs/2302.01318 overcomplicates things by calculating conditional probabilities etc. I can't see any advantage in calculating all these conditional probabilities. Unless someone can prove me wrong? |
As mentioned by @etiennebalit, this idea is related to speculative sampling (and what pauldog suggested at the end here is exactly that I believe). Speculative sampling was recently landed in HF transformers (they called it assisted generation), so that might be useful as a reference: huggingface/transformers#22211 Their implementation uses a heuristic to dynamically figure out how many tokens to generate at a time. |
So to recap, essentially the way this works is that we generate a sequence using the smaller model, and then evaluate it as a batch with the larger model. Then we check the logits token by token, and if we would have generated the same token with the larger model we know that we can keep it, otherwise we have to discard the rest of the sequence and continue from that point. I see a few problems with this:
Using BLAS may not be an option because the batches would have to be too big. Currently we only use BLAS when the batch size is at least 32, and with such a large batch a failure in one of the first tokens would be very costly. |
For most commercial GPUs it won't be faster. This is for situations where you have a rack of GPUs and you can max them out to get past the "theoretical limit". i.e. GPUs can only run so fast, but with this trick you can get past this speed limit. e.g. you could put one batch on each GPU. With this method, "theoretically" you could run a language model arbitrarily fast. But in practice you would need too many GPUs. |
To expedite the process, we can bypass evaluating the larger model if the smaller one is sufficiently confident. This can be identified by the decoder's low entropy distribution and high-valued logits. This could handle the most grammar-controlled part of sentences. |
You could do this but it will give different results in some cases. In other words, that method doesn't just speed things up, but is really a different model. Therefor the model will be slightly worse. (Maybe not great deal) |
Transformers became the leading model because it's highly parallel. During training we can get the logits for the whole sequence in parallel: you can efficiently train it on more tokens/s than any other model (today). Speculative generation basically seeks the same advantage by making inference more like training, where you already "know" the tokens you want to get the logits for.
Right. Although I wouldn't use the word "batches". Maybe I'm just arguing semantics here but in the transformer context, batching usually refers to checking multiple different sequences in parallel. So to be really clear, that's not a requirement for speculative sampling because with a batch size of 1 you can still get the parallelism time savings of checking multiple tokens in a single sequence at once, if you know what to put in that sequence. (And if you have lots of compute to spare you could combine that with batching on top and check multiple predicted sequences in parallel.)
You can make your own! You could use 2 bit quantization on the 7B model for example, or even lower. The only thing that matters is that your assistant model gets it right often enough to pay for itself.
You're right: if you are already utilising 100% of your FLOPs, speculative sampling won't help you. There's gotta be enough underutilised compute you can fill up with parallel token evaluation to make up for the cost of running the assistant.
Okay this is more speculative but a fun question to consider. What if we did have infinite GPUs, could we get infinite speed? I think unfortunately not. Even in a best case scenario where the assistant model is 100% correct in predicting the output of the full model, you still have to wait for it to predict X tokens, one at a time. Past some point adding more GPUs won't make the assistant model go any faster because the kernel launch time and data transfer delays overtake the benefit of adding more parallelisation to the evaluation. And the assistant model can't generate the second token until it's generated the first token. You could have another layer of speculation on top with an even smaller model, but at some point you won't be able to find a smaller model that is correct sufficiently often. (Given a truly infinite budget I guess you could just try every possible sequence in parallel but then you don't need an assistant model at all, just brute force all possible continuations on the big model, you've got inf FLOPs!) So roughly with speculative sampling you'll approach In theory the big model evaluates all the tokens in a sequence "at once", but as you rightly point out, that's only true for a machine with lots of multiprocessing spare capacity. In the context of running on CPUs like with llama.cpp, you'd need some bizarre number of CPU cores for this to hold true even for the 7B model. Every extra token in the sequence adds uhm, back of the napkin calculation here, at least Bottom line is that I don't think it's helpful for CPU eval unless you do what @Piezoid said and say "okay that generation is 'good enough' for some definition of good, I'm going to skip the big model run". But even that has limited opportunity because if you're not confident about the whole generation you're eventually going to consult with the big model and due to the autoregressive nature of transformers it's going to have to evaluate the tokens you "skipped" then anyhow. I'm not a contributor here, just interested in the project, so someone else is probably in a better position to answer whether llama.cpp is able to use much of the theoretical FLOPs of the CPU. Only if there's a bunch of spare capacity being left on the table due to sequential bottlenecks speculative sampling would be worth a look, I think. |
Well the point is if you have enough GPUs, you could check every combination of 1 to 20 tokens (there are about maybe 10^90 such sequences!) . Run each one on a different GPU. Then from this data you would predict 20 tokens at once. So "theoretically" with 10^90 GPUs, you could predict 20 tokens a time in the same time you could predict 1. This just shows that theoretically you can make a LLM run as fast as you like. But practically this is another matter. However using this silly method the number of GPUs you need wouldn't even fit in the observable Universe. With using a smaller model you can reduce the amount of GPUs you need by guessing only the most likely sequences instead of all 10^90 with the expense of sometimes missing the actual sequence. I have some doubts whether a 2bit Llama model would output anything useful or just garbage. But there is probably someway to truncate the llama model somehow. |
So technically you'd only skip the topk/rand part. Rewinding the logits to insert the lower models token (if you even limit at lower layers) might cost too much cpu time considering the 200-500meg copy (state). Even if you attempted to optimize by using intel instructions for SHA in quarter regions of the state, or modified the ggml to keep track of changes during the initial X layers.. I'm not sure if it would really be worth it in the end. |
Can you guys make a cache layer for the most frequently words? I guess that would be fast in many ways. |
You mean given input X, always output token Y? You can stick a hash map in there containing the most frequent starting contexts and the greedy output, but the value would rapidly diminish as the context size grows beyond a few tokens. Soon enough you have unfathomable numbers of possible inputs. Maybe it's possible to find a few situations where basically nothing in the context matters except the last few tokens in terms of picking the next one, e.g maybe |
Yep that's what I meant by brute-forcing it. And if you have such vast resources you can just run lots of independent instances of the full size model, no need for any tricks like speculative sampling. Like you could get an LLM to generate 2048 tokens in a single go. Even then, that's not infinite speed! As you point out, at the limit you won't get it to be faster than generating 1 token. The layers still have to be processed sequentially even if the sequence itself is processed in parallel on your infinite GPU array.
It's been tried! Results not great so far, yet not sure I'd bet against it given the rapid improvement on 3 and 4 bit quantisation. Of course even if it's roughly in the right ballpark eventually, such a model might still be useless as an assistant model. Need pretty good prediction accuracy to be worth it. |
Well... Later today.. I plan on saving the state of general starting context/prompts and loading that to speed up using more than what I have on my current test platform, It is just a hack so I don't have to wait to load the entire 'prompt' or 'roles' if I attempt them. I think the save/load state will work great for that but beyond static starting points then we're back to other strategies like this would have been.. |
@ejones Great job :> thank you sir! |
I think the method is very well described here: https://huggingface.co/blog/assisted-generation, and there are some benchmarks with real-life gaining on different GPUs and tasks. TL;DR: it's helpful most times even when the small model runs in greedy mode and the large model doesn't. |
I've found a new way to do this which doesn't involve batching so is very fast.
This way you can predict up to 4 tokens at a time. This assumes that the probability vector at position N is not much affected by the tokens after N+1. I don't know how true this is but it seems to work quite well for low temperature text. I have tested it with Cerbras models and get about 25% speed up and up to 2-3x speed up for very predictable text. One could also keep track of the probabilities of the tokens given by the tiny model and stop prematurely if the probabilities get too low as to become useless. |
We should add a PoC / example of this. Can use 117M GPT-2 as a helper LLM: https://github.com/ggerganov/ggml/tree/master/examples/gpt-2 But at ~5ms / token it is still quite inefficient. Probably using @xaedes's baby LLaMAs would be highly effective: ggerganov/ggml#8 (comment) Another alternative is using the TinyStories models |
As a proof of concept you could just generate some text using the greedy algorithm. Then store this text. Then run it again using your stored text as your "predictions" of the tiny model. You should get back the same text but just faster. This will prove the algorithm works for a "perfect" fast predictor. Then provided this works, you can worry about actually finding a suitable tiny model later. For a good tiny model, it would be best to have something with the same tokenization (otherwise you'd have to retokenize the text which slows things down). I don't know which ones would be compatible with Llama. Perhaps just using Llama but with a very small window might be suitable for some cases. Or one could build a very simple model and train it using Llama as input. Doesn't even have to be a transformer model. This concept would work well with a prompt like "The alphabet: A, B, C, D, " (which could run 10x faster as its very predictable) |
I have no doubts that the approach works. Intuitively, this approach has the potential to speed-up drastically source-code generation since code is deterministic to a large extend. One can potentially have tiny C++ / Python / Javascript models and switch between these tiny models based on the current source file to achieve an efficient version of a local Github Copilot for example |
I wonder if a simple markov-chain model works for the small one. |
It would work to some extent. Maybe using the previous two tokens to predict the next one. One problem is the model gets rapidly big once you use more than about 2 tokens. One could alleviate this problem a bit by only storing probabilities that reach a certain threshold. e.g. We would store that "peng" is usually followed by "uin". But not store what might come after "the". It's worth a try but I think even a very small attention based model would do much better. Since if your text is about unicorns it would have a good guess that the word following "the" would be "unicorn" which a Markov model won't do. A Markov model would be approximately the same as an attention model with a very small window I believe. It would definitely help having different Markov models depending on the domain. On the other hand a Markov model will be very fast. So it's worth considering even if it just gives a 5% speed up. (To be precise, because we are using the greedy algorithm which selects the highest probability each time, this would be equivalent to a look up table in a database). |
yep. key is tok0+tok1 -> value |
100MiB is nothing really 😀 My guess is it would be a speed up of 5%. But still better than a slap in the face with a wet fish. 🐟 |
If local CPU RAM >> GPU RAM, could you get much speedup using model that fits on GPU and then feeding those results into model that fits on CPU (per this thread)? |
I found this article on twitter with a PoC for a similar idea, not sure how helpful it will be: |
These papers are super interesting. Although I don't think I understand the constants well enough. I've been obsessing over these little AMD 6600M boxes. 32GB CPU RAM + 8GB GPU RAM + 1TB SSD in a quiet, attractive form factor you can put next to TV in living room for $712. Assuming thermal issues can be addressed this vendor could use AMD 6800M. 64GB CPU RAM + 12GB GPU RAM + 2TB SSD in a quiet, attractive form factor for not much more. So, assuming that leaving bigger model in CPU RAM and smaller one in GPU RAM gets you 4x speedup that would be cool. What's unclear, however, if using CPU for smaller model would work just as well? Skimming papers I can't tell. Thus my comment about constants. https://twitter.com/wait_sasha/status/1652340638745247745?t=hLnS4khr1meYkZEvwmkAGg&s=19 |
just enough to run 65B q3_K_S |
tok0+tok1->value(tok3) will only works once, because tok4 ~p(tok4|tok1,tok2,tok3) not tok4 ~p(tok4|tok1,tok2), |
An example of using a small LLM to draft tokens for a larger LLM is demonstrated here: #2926 |
Just fyi, here is someone that currently trains a 1.1B Llama model. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
So I was thinking about the following idea.
It is probably completely bogus, but I would definitely investigate it when and if I had the time to, so maybe someone else would be interested as well.
Large LLM takes a lot of time to perform token inference. Lets say it takes 500ms per token.
A small LLM (or some other approach) can infer a token very fast. Lets say < 5ms.
Lets assume that the small LLM is correct 80-90% of the time.
The idea is the following:
In the described process, I would reach step 4 only for 10-20% of the tokens, but for the rest - I will take the shortcut in step 3.
Hence, I will have an efficient inference with a Large LLM.
Obviously, the biggest question is if step 2 is possible at all.
I suppose the answer is "no", but who knows.
The text was updated successfully, but these errors were encountered: