You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I heard about a new feature coming to Llama, where a method is used to speed up a model's inference. The benefit ranges, around 2x the speed, but probably closer to 1.5x. How it works is that a big model like 34b, uses a smaller draft model like 7b to sample input. The Github thread has video showing the performance benefits of the method.
My thoughts immediately jumped to Airoboros's Llmoe. Would it be possible to integrate a "Inference" Llmoe into vanilla Airoboros to benefit from speculative sampling?
One of the people posting in the Llama Github mentioned that chaining draft models might have potential. Something like a 3b->7b->13b->34b->70b. My gut says that much like b-parameters, there would probably be a sweet spot in the amount of draft models and their respective sizes in that configuration.
Fortunately, I believe that it would be relatively easy to objectively test multiple permutations - the metric is speed, which can be recorded easily. Provided that speculative sampling doesn't impact output quality, it should be a painless concept to test.
I heard about a new feature coming to Llama, where a method is used to speed up a model's inference. The benefit ranges, around 2x the speed, but probably closer to 1.5x. How it works is that a big model like 34b, uses a smaller draft model like 7b to sample input. The Github thread has video showing the performance benefits of the method.
My thoughts immediately jumped to Airoboros's Llmoe. Would it be possible to integrate a "Inference" Llmoe into vanilla Airoboros to benefit from speculative sampling?
speculative : PoC for speeding-up inference via speculative sampling
The text was updated successfully, but these errors were encountered: