-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tokenize CLI tool have nicer command line arguments. #6188
Conversation
Just noticed the CI doesn't run...maybe because it's my first PR and I'm not on an allowlist? Do I have a way to run the compilation tests somehow myself? Edit: Oh er just as I wrote this I see things building. NVM. |
Before this commit, tokenize was a simple CLI tool like this: tokenize MODEL_FILENAME PROMPT [--ids] This simple tool loads the model, takes the prompt, and shows the tokens llama.cpp is interpreting. This changeset makes the tokenize more sophisticated, and more useful for debugging and troubleshooting: tokenize [-m, --model MODEL_FILENAME] [--ids] [--stdin] [--prompt] [-f, --file] [--no-bos] [--log-disable] It also behaves nicer on Windows now, interpreting and rendering Unicode from command line arguments and pipes no matter what code page the user has set on their terminal.
0e5b526
to
cd7b5f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?
Yes, let's remove the old style arguments to simplify
…guments. It must now be invoked with long --model, --prompt etc. arguments only. Shortens the code.
Appears everyone is happy so far with this PR and will merge once CI issue in main branch is all sorted. |
Tokenize CLI tool is one of the tools in
examples/*
. It's a pretty short and simple tool that takes arguments like this:And it would load the model, read the prompt, and then print a list of tokens it interpreted, or if --ids was given, just the integer values.
This changeset makes the command a bit more sophisticated with more options:
It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?
Motivation: I've been using this tool for my own tests with tokenization divergence investigations. I find it useful to do quick ad-hoc tests on text tokenization and comparisons. In particular I wanted it to behave nice if you give it a filename or pipe into it from
stdin
.I took my hacks and cleaned them up into nicer looking command line arguments, following the style and argument names of some other CLI tools I saw. Also in general I added some error checking etc. so you are more likely to get a readable error than a segfault if you did something wrong.
Draft because I need to test some of the argument combinations and also Windows, and I want to see the CI results on GitHub here. I think the stdin reading as it is written might be sketchy on Windows, if you try to physically type letters, which would now become a feature of
tokenize
.(
std::cin
does not have.is_open()
? Got really confused when I was trying to write code to check did we read fromstdin
properly without syscall failures and trying to figure out if the code is checking syscall failures in a waterproof way. I'm a C programmer not a C++ one dammit)