Skip to content

izikeros/count_tokens

Repository files navigation

Count tokens

img

Simple tool that has one purpose - count tokens in a text file.

Table of Contents

Requirements

This package is using tiktoken library for tokenization.

Installation

For usage from command line install the package in isolated environment with pipx:

pipx install count-tokens

or install it in your current environment with pip.

pip install count-tokens

Usage

Open terminal and run:

count-tokens document.txt

You should see something like this:

File: document.txt
Encoding: cl100k_base
Number of tokens: 67

if you want to see just the tokens count run:

count-tokens document.txt --quiet

and the output will be:

67

To use count-tokens with other than default cl100k_base encoding use the additional input argument -e or --encoding:

count-tokens document.txt -e r50k_base

NOTE: tiktoken supports three encodings used by OpenAI models:

Encoding name OpenAI models
o200k_base gpt-4o, gpt-4o-mini
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2) GPT-3 models like davinci

(source: OpenAI Cookbook)

Approximate number of tokens

In case you need the results a bit faster and you don't need the exact number of tokens you can use the --approx parameter with w to have approximation based on number of words or c to have approximation based on number of characters.

count-tokens document.txt --approx w

It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.

Programmatic usage

from count_tokens.count import count_tokens_in_file

num_tokens = count_tokens_in_file("document.txt")
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.")

for both functions you can use encoding parameter to specify the encoding used by the model:

from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")

Default value for encoding is cl100k_base.

Related Projects

  • tiktoken - tokenization library used by this package
  • ttok - count and truncate text based on tokens

Credits

Thanks to the authors of the tiktoken library for open sourcing their work.

License

MIT © Krystian Safjan.