Skip to content

Commit

Permalink
update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
cahya-wirawan committed Aug 17, 2024
1 parent b2336e8 commit 80a4d9e
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,4 @@ docs/_build/
tmp/
tools/rwkv_tokenizers_bpe.ipynb
data/wiki-en-tiny.jsonl
target/
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,8 @@ tokenizer is around 17x faster than the original tokenizer and 9.6x faster than

![performance-comparison](data/performance-comparison.png)

We compared also the multithreading/batch encoding performance using the [Huggingface Tokenizers comparison script](https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py):
We compared also the multithreading/batch encoding performance using a [script](tools/test_tiktoken-huggingface-rwkv.py)
which based on the [Huggingface Tokenizers](https://github.com/huggingface/tokenizers):
![performance-comparison](data/performance-comparison-multithreading.png)

*The simple English Wikipedia dataset can be downloaded as jsonl file from
Expand Down
6 changes: 3 additions & 3 deletions tools/test_tiktoken-huggingface-rwkv.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,20 +78,20 @@ def benchmark_batch(model: str, documents: list[str], num_threads: int, document
enc.encode_ordinary_batch(documents, num_threads=num_threads)
end = time.perf_counter_ns()

readable_size, unit = format_byte_size(num_bytes / (end - start) * 1e9)
readable_size, unit = format_byte_size(int(num_bytes / (end - start) * 1e9))
print(f"tiktoken \t{readable_size}/s")


start = time.perf_counter_ns()
hf_enc.encode_batch_fast(documents)
end = time.perf_counter_ns()
readable_size, unit = format_byte_size(num_bytes / (end - start) * 1e9)
readable_size, unit = format_byte_size(int(num_bytes / (end - start) * 1e9))
print(f"huggingface \t{readable_size}/s")

start = time.perf_counter_ns()
rwkv_enc.encode_batch(documents)
end = time.perf_counter_ns()
readable_size, unit = format_byte_size(num_bytes / (end - start) * 1e9)
readable_size, unit = format_byte_size(int(num_bytes / (end - start) * 1e9))
print(f"rwkv \t\t{readable_size}/s")


Expand Down

0 comments on commit 80a4d9e

Please sign in to comment.