Add BPE cache to improve efficiency of BPE Tokenizers #590

sayanshaw24 · 2023-11-02T22:07:21Z

HF also implements a cache for BPE: https://github.com/huggingface/transformers/blob/6f316016877197014193b9463b2fd39fa8f0c8e4/src/transformers/models/gpt2/tokenization_gpt2.py#L216C6-L216C6

We use a LRU cache algorithm for the same in C++.

wenbingl

Need to measure the perf more since there are some extra efforts around bpe method now.

wenbingl · 2023-11-02T22:22:14Z

operators/tokenizer/bpe_utils.hpp

+  std::list<std::string> dq;
+
+  // Store references of keys in cache for efficiency
+  std::unordered_map<std::string, std::list<std::string>::iterator> references;


why is the map unordered?

because that one is just for the iterators for the keys (address of keys). dq is what we are using as a queue for the Queue+Map LRU implementation.

operators/tokenizer/bpe_utils.hpp

wenbingl · 2023-11-02T22:25:51Z

operators/tokenizer/bpe_kernels.cc

+  // We use a LRU cache algorithm for the same in C++ in order to save compute.
+
+  // Current cache capacity is set to a relatively small 500 in order to support mobile platforms.
+  LRUCache bpe_cache = LRUCache(500);


can it be used for next Compute call?

no - if we move it outside to use it for the next Compute call it throws errors because the type of BPE tokenizer (GPT2/CLIP/Roberta) is different on different calls and the tests fail because ids/offsets are wrong.

sayanshaw24 · 2023-11-03T00:52:39Z

Need to measure the perf more since there are some extra efforts around bpe method now.

Added an average tokenization time in microseconds using std::chrono that is currently printed out in Compute after completing all calls to Tokenize. In the future, we can refactor this to make it cleaner and potentially even add an optional output for it.

wenbingl · 2023-11-03T17:18:32Z

operators/tokenizer/bpe_kernels.cc

+  // We use a LRU cache algorithm for the same in C++ in order to save compute.
+
+  // Current cache capacity is set to a relatively small 500 in order to support mobile platforms.
+  LRUCache bpe_cache = LRUCache(500);


Add a flag to enable/disable cache feature

wenbingl · 2023-11-03T17:19:15Z

operators/tokenizer/bpe_kernels.cc

@@ -26,6 +27,10 @@ const char BpeModelConf::kModel_GPT2[] = "GPT2";
 const char BpeModelConf::kModel_Roberta[] = "Roberta";
 const char BpeModelConf::kModel_CLIP[] = "CLIP";

+// We specifically measure performance for tokenization as our BPE implementation includes a number of optimizations
+long long total_tokenization_time = 0.0;


these time measurement code should be removed.

wenbingl · 2023-11-03T17:30:47Z

operators/tokenizer/bpe_utils.hpp

+
+class LRUCache {
+  // Store keys of cache
+  std::list<std::string> dq;


given it's a limited capacity cache, the container can be a vector here.

wenbingl · 2023-11-03T17:32:28Z

operators/tokenizer/bpe_utils.hpp

@@ -230,3 +230,70 @@ class TokenWithRegularExp {
 private:
  std::u32string_view m_text;
 };
+
+class LRUCache {


Can you find some high-quality C++ LRUCahe in GitHub for a reference?

Add BPE cache to improve efficiency

8d831e5

sayanshaw24 requested a review from a team as a code owner November 2, 2023 22:07

clean up

eeb56e5

wenbingl reviewed Nov 2, 2023

View reviewed changes

address comments

8a93c0f

wenbingl reviewed Nov 3, 2023

View reviewed changes

sayanshaw24 closed this Dec 22, 2023

wenbingl deleted the sayanshaw/bpe-cache branch February 8, 2024 23:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BPE cache to improve efficiency of BPE Tokenizers #590

Add BPE cache to improve efficiency of BPE Tokenizers #590

sayanshaw24 commented Nov 2, 2023

wenbingl left a comment

wenbingl Nov 2, 2023

sayanshaw24 Nov 2, 2023

wenbingl Nov 2, 2023

sayanshaw24 Nov 2, 2023

sayanshaw24 commented Nov 3, 2023

wenbingl Nov 3, 2023

wenbingl Nov 3, 2023

wenbingl Nov 3, 2023

wenbingl Nov 3, 2023

Add BPE cache to improve efficiency of BPE Tokenizers #590

Add BPE cache to improve efficiency of BPE Tokenizers #590

Conversation

sayanshaw24 commented Nov 2, 2023

wenbingl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayanshaw24 commented Nov 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment