Support for local instances of llama 3 and changes to support reusing trie and homomorphism #74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Here are some changes I've made for a project I'm working on. If you like any of them I can split out those changes to merge in.
Change the name checking for llama3 models to be looser so that local instances of the weights (specified by a path) or finetuned versions of the models can be picked up properly.
Support for specifying the device to use for tensors during the masking process. I needed this because the LRU cache was holding GPU memory indefinitely. I could have also just moved scores to CPU before calling the masking function or shrunk the cache size so this is non-essential. Perhaps the cache size should be customizable.
Added support for passing in the trie and homomorphism. In my application we have a server serving many requests and the overhead of rebuilding the trie each time is unnecessary. This allows you to reuse some of the expensive operations on initialization.