How to convert tokenizer of SmolLM model as accepted by executorch #6813

Arpit2601 · 2024-11-13T11:19:13Z

Hi,
I am trying to convert SmolLm-135M-Instruct model to .pte format and then run on an android device.
I have been successful in converting the model but executorch requires the tokenizer in either .bin format or .model format which can then be converted into .bin format. But on huggingface tokenizer.model or tokenizer.bin files are not present.

How would I go about converting the tokenizer.json file into the appropriate format.

cc @mergennachin @byjlw

larryliu0820 · 2024-11-13T18:26:57Z

@guangy10 do you know the answer to this?

guangy10 · 2024-11-13T18:52:51Z

I tried it a while ago. tokenizer.save_pretrained will save the json format, even with legacy=Trueit doesn't save to the format that can be accepted by the llama_runner. I was trying to use the tokenizers.save for the convention as shown below, which is WIP and I haven't got a chance to back on it

import os
import argparse
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
def create_tokenizer_model(input_dir, output_file):
    vocab_file = os.path.join(input_dir, "vocab.json")
    merges_file = os.path.join(input_dir, "merges.txt")
    # Create BPE model from files
    bpe = BPE.from_file(vocab_file, merges_file)
    
    # Create tokenizer
    tokenizer = Tokenizer(bpe)
    
    # Set pre-tokenizer
    tokenizer.pre_tokenizer = Whitespace()
    
    # Save the tokenizer model
    tokenizer.save(output_file)
    print(f"Tokenizer model saved to {output_file}")
    # Verify the tokenizer
    loaded_tokenizer = Tokenizer.from_file(output_file)
    test_text = "Hello, world! This is a test."
    encoded = loaded_tokenizer.encode(test_text)
    decoded = loaded_tokenizer.decode(encoded.ids)
    print(f"Test encoding/decoding: '{test_text}' -> '{decoded}'")
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Create a tokenizer")
    parser.add_argument("--input_dir", help="Directory containing vocab.json and merges.txt")
    parser.add_argument("--output", default="tokenizer.model", help="Output file name (default: tokenizer.model)")
    args = parser.parse_args()
    create_tokenizer_model(args.input_dir, args.output)

Arpit2601 · 2024-11-14T07:38:58Z

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

guangy10 · 2024-11-14T22:22:47Z

Thanks @guangy10 for sharing your WIP script - I tried iterating on it but it would be great if you can share some pointers to get it working.

Will keep you posted when I get back to this work

shirishgone · 2025-01-13T20:40:39Z

So as of today, LlamaRunner only runs Llama models.
Are there any steps on how to run other models like Qwen 0.5 model or others.

arjunsingri · 2025-01-31T03:52:51Z

@Arpit2601

Were you successful in getting it to work on executorch? If yes, are you able to share the steps.

GregoryComer assigned larryliu0820 Nov 13, 2024

larryliu0820 assigned guangy10 Nov 13, 2024

This was referenced Jan 9, 2025

How to run Qwen using Executorch? #7467

Open

Export to ExecuTorch huggingface/transformers#32253

Open

mergennachin added the module: user experience Issues related to reducing friction for users label Feb 4, 2025

github-project-automation bot added this to ExecuTorch DevX Feb 4, 2025

github-project-automation bot moved this to To triage in ExecuTorch DevX Feb 4, 2025

guangy10 mentioned this issue Feb 20, 2025

Add qwen 2.5 #8355

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert tokenizer of SmolLM model as accepted by executorch #6813

How to convert tokenizer of SmolLM model as accepted by executorch #6813

Arpit2601 commented Nov 13, 2024 •

edited by pytorch-bot bot

Loading

larryliu0820 commented Nov 13, 2024

guangy10 commented Nov 13, 2024

Arpit2601 commented Nov 14, 2024

guangy10 commented Nov 14, 2024

shirishgone commented Jan 13, 2025

arjunsingri commented Jan 31, 2025

How to convert tokenizer of SmolLM model as accepted by executorch #6813

How to convert tokenizer of SmolLM model as accepted by executorch #6813

Comments

Arpit2601 commented Nov 13, 2024 • edited by pytorch-bot bot Loading

larryliu0820 commented Nov 13, 2024

guangy10 commented Nov 13, 2024

Arpit2601 commented Nov 14, 2024

guangy10 commented Nov 14, 2024

shirishgone commented Jan 13, 2025

arjunsingri commented Jan 31, 2025

Arpit2601 commented Nov 13, 2024 •

edited by pytorch-bot bot

Loading