Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Maybe? fixes #768 (discuss)
Original Issue
The suggestion to execute
reduced_vocabulary
and model loading in parallel presents challenges:models.llamacpp
, the model must be loaded before accessing the tokenizer, preventing any benefits.models.transformers_vision
would require complex changes.So I looked at
reduced_vocabulary
and saw a performance issue similar to one I've seen previously. I applied a similar fix from before (described below) and it works well.Problem
In
main
,reduced_vocabulary
constructs a numba List in pure-python mode. There is a serious performance issue fornumba.typed.List.append()
calls in pure-python mode.reduced_vocabulary
is run once per model load. It is more annoying now that models are starting to have 100,000 - 200,000 token vocabularies.Solution
@njit
, convert the dictionary to a list of tuples, where each tuple contains (normalized_token: unicode_type
,token_ids: int64[:]
).Benchmarks
New benchmarks
Old benchmarks
Open Questions
Why is the benchmark for compiling numba faster? This raises an eyebrow.