-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BPE cache to improve efficiency of BPE Tokenizers #590
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to measure the perf more since there are some extra efforts around bpe method now.
std::list<std::string> dq; | ||
|
||
// Store references of keys in cache for efficiency | ||
std::unordered_map<std::string, std::list<std::string>::iterator> references; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the map unordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because that one is just for the iterators for the keys (address of keys). dq
is what we are using as a queue for the Queue+Map LRU implementation.
// We use a LRU cache algorithm for the same in C++ in order to save compute. | ||
|
||
// Current cache capacity is set to a relatively small 500 in order to support mobile platforms. | ||
LRUCache bpe_cache = LRUCache(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be used for next Compute call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - if we move it outside to use it for the next Compute call it throws errors because the type of BPE tokenizer (GPT2/CLIP/Roberta) is different on different calls and the tests fail because ids/offsets are wrong.
Added an average tokenization time in microseconds using |
// We use a LRU cache algorithm for the same in C++ in order to save compute. | ||
|
||
// Current cache capacity is set to a relatively small 500 in order to support mobile platforms. | ||
LRUCache bpe_cache = LRUCache(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a flag to enable/disable cache feature
@@ -26,6 +27,10 @@ const char BpeModelConf::kModel_GPT2[] = "GPT2"; | |||
const char BpeModelConf::kModel_Roberta[] = "Roberta"; | |||
const char BpeModelConf::kModel_CLIP[] = "CLIP"; | |||
|
|||
// We specifically measure performance for tokenization as our BPE implementation includes a number of optimizations | |||
long long total_tokenization_time = 0.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these time measurement code should be removed.
|
||
class LRUCache { | ||
// Store keys of cache | ||
std::list<std::string> dq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given it's a limited capacity cache, the container can be a vector here.
@@ -230,3 +230,70 @@ class TokenWithRegularExp { | |||
private: | |||
std::u32string_view m_text; | |||
}; | |||
|
|||
class LRUCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you find some high-quality C++ LRUCahe in GitHub for a reference?
HF also implements a cache for BPE: https://github.com/huggingface/transformers/blob/6f316016877197014193b9463b2fd39fa8f0c8e4/src/transformers/models/gpt2/tokenization_gpt2.py#L216C6-L216C6
We use a LRU cache algorithm for the same in C++.