Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert tokenizer of SmolLM model as accepted by executorch #6813

Open
Arpit2601 opened this issue Nov 13, 2024 · 6 comments
Open

How to convert tokenizer of SmolLM model as accepted by executorch #6813

Arpit2601 opened this issue Nov 13, 2024 · 6 comments
Assignees
Labels
module: extension Issues related to code under extension/ module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Arpit2601
Copy link

Arpit2601 commented Nov 13, 2024

Hi,
I am trying to convert SmolLm-135M-Instruct model to .pte format and then run on an android device.
I have been successful in converting the model but executorch requires the tokenizer in either .bin format or .model format which can then be converted into .bin format. But on huggingface tokenizer.model or tokenizer.bin files are not present.

How would I go about converting the tokenizer.json file into the appropriate format.

cc @mergennachin @byjlw

@GregoryComer GregoryComer added module: examples Issues related to demos under examples/ module: extension Issues related to code under extension/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed module: examples Issues related to demos under examples/ labels Nov 13, 2024
@larryliu0820
Copy link
Contributor

@guangy10 do you know the answer to this?

@guangy10
Copy link
Contributor

I tried it a while ago. tokenizer.save_pretrained will save the json format, even with legacy=Trueit doesn't save to the format that can be accepted by the llama_runner. I was trying to use the tokenizers.save for the convention as shown below, which is WIP and I haven't got a chance to back on it

import os
import argparse
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
def create_tokenizer_model(input_dir, output_file):
    vocab_file = os.path.join(input_dir, "vocab.json")
    merges_file = os.path.join(input_dir, "merges.txt")
    # Create BPE model from files
    bpe = BPE.from_file(vocab_file, merges_file)
    
    # Create tokenizer
    tokenizer = Tokenizer(bpe)
    
    # Set pre-tokenizer
    tokenizer.pre_tokenizer = Whitespace()
    
    # Save the tokenizer model
    tokenizer.save(output_file)
    print(f"Tokenizer model saved to {output_file}")
    # Verify the tokenizer
    loaded_tokenizer = Tokenizer.from_file(output_file)
    test_text = "Hello, world! This is a test."
    encoded = loaded_tokenizer.encode(test_text)
    decoded = loaded_tokenizer.decode(encoded.ids)
    print(f"Test encoding/decoding: '{test_text}' -> '{decoded}'")
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Create a tokenizer")
    parser.add_argument("--input_dir", help="Directory containing vocab.json and merges.txt")
    parser.add_argument("--output", default="tokenizer.model", help="Output file name (default: tokenizer.model)")
    args = parser.parse_args()
    create_tokenizer_model(args.input_dir, args.output)

@Arpit2601
Copy link
Author

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

@guangy10
Copy link
Contributor

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

Will keep you posted when I get back to this work

@shirishgone
Copy link

So as of today, LlamaRunner only runs Llama models.
Are there any steps on how to run other models like Qwen 0.5 model or others.

@arjunsingri
Copy link

@Arpit2601

Were you successful in getting it to work on executorch? If yes, are you able to share the steps.

@mergennachin mergennachin added the module: user experience Issues related to reducing friction for users label Feb 4, 2025
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch DevX Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: extension Issues related to code under extension/ module: user experience Issues related to reducing friction for users triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: To triage
Development

No branches or pull requests

7 participants