diff --git a/README.md b/README.md index 4e13cef..38dd117 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ $ cd rwkv-tokenizer $ pytest ``` -We did a performance comparison on [the simple English Wikipedia dataset 20220301.en](https://huggingface.co/datasets/legacy-datasets/wikipedia) among following tokenizer: +We did a performance comparison on [the simple English Wikipedia dataset 20220301.en](https://huggingface.co/datasets/legacy-datasets/wikipedia)* among following tokenizer: - The original RWKV tokenizer (BlinkDL) - Huggingface implementaion of RWKV tokenizer - Huggingface LLama tokenizer @@ -55,6 +55,9 @@ tokenizer is around 17x faster than the original tokenizer and 9.6x faster than ![performance-comparison](data/performance-comparison.png) +*The simple English Wikipedia dataset can be downloaded as jsonl file from +https://huggingface.co/datasets/cahya/simple-wikipedia/resolve/main/simple-wikipedia.jsonl?download=true + ## Bugs ~~There are still bugs where some characters are not encoded correctly.~~ The bug have been fixed in the version 0.3.0. *This tokenizer is my very first Rust program, so it might still have many bugs and silly codes :-)* \ No newline at end of file