Make tokenize CLI tool have nicer command line arguments. #6188

Noeda · 2024-03-20T23:00:17Z

Tokenize CLI tool is one of the tools in examples/*. It's a pretty short and simple tool that takes arguments like this:

  tokenize MODEL_FILENAME PROMPT [--ids]

And it would load the model, read the prompt, and then print a list of tokens it interpreted, or if --ids was given, just the integer values.

This changeset makes the command a bit more sophisticated with more options:

mikkojuola@Mikkos-Mac-Studio ~/llama.cpp> ./build/bin/tokenize
usage: ./build/bin/tokenize [options]

The tokenize program tokenizes a prompt using a given model,
and prints the resulting tokens to standard output.

It needs a model file, a prompt, and optionally other flags
to control the behavior of the tokenizer.

Invoke './build/bin/tokenize' like this:

    ./build/bin/tokenize MODEL_FNAME PROMPT [--ids]

  or this:

    ./build/bin/tokenize [options], where options are:

    -h, --help                           print this help and exit
    -m MODEL_PATH, --model MODEL_PATH    path to model.
    --ids                                if given, only print numerical token IDs, and not token strings.
    -f PROMPT_FNAME, --file PROMPT_FNAME read prompt from a file.
    -p PROMPT, --prompt PROMPT           read prompt from the argument.
    --stdin                              read prompt from standard input.
    --no-bos                             do not ever add a BOS token to the prompt, even if normally the model uses a BOS token.
    --log-disable

It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?

Motivation: I've been using this tool for my own tests with tokenization divergence investigations. I find it useful to do quick ad-hoc tests on text tokenization and comparisons. In particular I wanted it to behave nice if you give it a filename or pipe into it from stdin.

I took my hacks and cleaned them up into nicer looking command line arguments, following the style and argument names of some other CLI tools I saw. Also in general I added some error checking etc. so you are more likely to get a readable error than a segfault if you did something wrong.

Draft because I need to test some of the argument combinations and also Windows, and I want to see the CI results on GitHub here. I think the stdin reading as it is written might be sketchy on Windows, if you try to physically type letters, which would now become a feature of tokenize.

(std::cin does not have .is_open()? Got really confused when I was trying to write code to check did we read from stdin properly without syscall failures and trying to figure out if the code is checking syscall failures in a waterproof way. I'm a C programmer not a C++ one dammit)

Noeda · 2024-03-20T23:14:16Z

Just noticed the CI doesn't run...maybe because it's my first PR and I'm not on an allowlist? Do I have a way to run the compilation tests somehow myself?

Edit: Oh er just as I wrote this I see things building. NVM.

examples/tokenize/tokenize.cpp

Before this commit, tokenize was a simple CLI tool like this: tokenize MODEL_FILENAME PROMPT [--ids] This simple tool loads the model, takes the prompt, and shows the tokens llama.cpp is interpreting. This changeset makes the tokenize more sophisticated, and more useful for debugging and troubleshooting: tokenize [-m, --model MODEL_FILENAME] [--ids] [--stdin] [--prompt] [-f, --file] [--no-bos] [--log-disable] It also behaves nicer on Windows now, interpreting and rendering Unicode from command line arguments and pipes no matter what code page the user has set on their terminal.

Noeda · 2024-03-26T21:26:39Z

Added bunch of stuff since last commit, as part of wrestling with Windows cmd.exe console crap:

A lot of new code to interpret and render characters properly on Windows. Good lord Windows is annoying. But you get correctly interpreted and rendered text now out-of-box without setting any code pages (although you may have to set a font depending on what text you use).
--ids now prints in a format that parses directly as Python or JSON (useful for sketchy pipe shenanigans)
--log-disable to silence stderr (consistent with main)
Fixed the style, like the * pointer stuff to be more consistent with the rest of the codebase.
Prints "failed utf-8 decode" on tokens that don't parse as UTF-8. Seems like a fairly common thing to happen with modern models where individual tokens don't decode to valid UTF-8. I made it print hex codes instead so you see the bytes it wants to decode as, even if we can't render them properly.

I noticed midway that we had similar code handling Windows stuff in common/console.cpp. Not exactly what I needed for the tokenize but I added a TODO comment about it, and made the Windows bits a bit more general so maybe a later contribution has an easier time moving that to common code.

--help looks like this now:

shannon@junko ~/llama.cpp/build/bin> ./tokenize --help
usage: ./tokenize [options]

The tokenize program tokenizes a prompt using a given model,
and prints the resulting tokens to standard output.

It needs a model file, a prompt, and optionally other flags
to control the behavior of the tokenizer.

Invoke './tokenize' like this:

    ./tokenize MODEL_FNAME PROMPT [--ids]

  or this:

    ./tokenize [options], where options are:

    -h, --help                           print this help and exit
    -m MODEL_PATH, --model MODEL_PATH    path to model.
    --ids                                if given, only print numerical token IDs, and not token strings.
                                         The output format looks like [1, 2, 3], i.e. parseable by Python.
    -f PROMPT_FNAME, --file PROMPT_FNAME read prompt from a file.
    -p PROMPT, --prompt PROMPT           read prompt from the argument.
    --stdin                              read prompt from standard input.
    --no-bos                             do not ever add a BOS token to the prompt, even if normally the model uses a BOS token.
    --log-disable                        disable logs. Makes stderr quiet when loading the model.

Some checking that all is good

Verified that tokens are interpreted the same way on cmd.exe and Linux, and also that Windows renders tokens correctly (when valid UTF-8):

Windows cmd.exe, with --prompt こんにちは

Reading from a file looks fine:

Piping on Windows works too:

Checked that we get same IDs for こんにちは on Mac (got same on Linux too):

If the CI doesn't complain and there's no other feedback to fix I'm done with the PR.

examples/tokenize/tokenize.cpp

ggerganov

It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?

Yes, let's remove the old style arguments to simplify

…guments. It must now be invoked with long --model, --prompt etc. arguments only. Shortens the code.

examples/tokenize/tokenize.cpp

mofosyne · 2024-05-22T07:33:10Z

Appears everyone is happy so far with this PR and will merge once CI issue in main branch is all sorted.

ggerganov approved these changes Mar 21, 2024

View reviewed changes

examples/tokenize/tokenize.cpp Outdated Show resolved Hide resolved

Noeda force-pushed the tokenizer-nicer-args branch from 0e5b526 to cd7b5f7 Compare March 26, 2024 21:13

Noeda marked this pull request as ready for review March 26, 2024 21:13

Noeda requested a review from ggerganov March 26, 2024 21:31

cebtenzzre reviewed Mar 26, 2024

View reviewed changes

examples/tokenize/tokenize.cpp Show resolved Hide resolved

examples/tokenize/tokenize.cpp Outdated Show resolved Hide resolved

style fix: strlen(str) == 0 --> *str == 0

a837649

ggerganov approved these changes Mar 27, 2024

View reviewed changes

Simplify tokenize.cpp; by getting rid of handling positional style ar…

71a0867

…guments. It must now be invoked with long --model, --prompt etc. arguments only. Shortens the code.

jdh1166 approved these changes Apr 25, 2024

View reviewed changes

Merge branch 'master' into tokenizer-nicer-args

877f059

ggerganov reviewed May 9, 2024

View reviewed changes

examples/tokenize/tokenize.cpp Outdated Show resolved Hide resolved

mofosyne added enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024

tokenize.cpp: iostream header no longer required

4aaeb42

github-actions bot added the examples label May 22, 2024

mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 22, 2024

mofosyne merged commit 5768433 into ggerganov:master May 25, 2024
57 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tokenize CLI tool have nicer command line arguments. #6188

Make tokenize CLI tool have nicer command line arguments. #6188

Noeda commented Mar 20, 2024 •

edited

Loading

Noeda commented Mar 20, 2024 •

edited

Loading

Noeda commented Mar 26, 2024 •

edited

Loading

ggerganov left a comment

mofosyne commented May 22, 2024

Make tokenize CLI tool have nicer command line arguments. #6188

Make tokenize CLI tool have nicer command line arguments. #6188

Conversation

Noeda commented Mar 20, 2024 • edited Loading

Noeda commented Mar 20, 2024 • edited Loading

Noeda commented Mar 26, 2024 • edited Loading

Some checking that all is good

ggerganov left a comment

Choose a reason for hiding this comment

mofosyne commented May 22, 2024

Noeda commented Mar 20, 2024 •

edited

Loading

Noeda commented Mar 20, 2024 •

edited

Loading

Noeda commented Mar 26, 2024 •

edited

Loading