Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab size mismatch #3900

Closed
eswarthammana opened this issue Nov 2, 2023 · 10 comments
Closed

Vocab size mismatch #3900

eswarthammana opened this issue Nov 2, 2023 · 10 comments
Labels

Comments

@eswarthammana
Copy link

Llama_2_7B-chat
vocab size mismatch (model has -1 but tokenizer.model has 32000)

issue

@yaashwardhan
Copy link

This is an easy fix.

There should be a .json file (probably params.json) inside the llama-2-7b-chat folder.
Open the json file and set the "vocab_size": to 32000 from -1.

Here is my params.json file:

{"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": 32000}

@eswarthammana
Copy link
Author

Thanks for the fix, i did downloaded older version repo where i have not faced this issue few days back. :)

@ishowshao
Copy link

same issue, why meta write -1 ?

@lorddoskias
Copy link

I also hit this and I think the correct way to fix it is in the convert script to simply remove the "vocab_size" if it equals -1 which will result in getting it from the tok_embeddings.weight.

lorddoskias pushed a commit to lorddoskias/llama.cpp that referenced this issue Nov 6, 2023
When vocab_size is detected to be -1 simply remove its value from the
parsed params.json and fallback to using the tok_embeddings.weight.

Fixes  ggerganov#3900
lorddoskias pushed a commit to lorddoskias/llama.cpp that referenced this issue Nov 6, 2023
When vocab_size is detected to be -1 simply remove its value from the
parsed params.json and fallback to using the tok_embeddings.weight.

Fixes  ggerganov#3900
@kaiwren
Copy link

kaiwren commented Nov 16, 2023

+1 I ran into this also (Exception: Vocab size mismatch (model has -1, but ../llama/tokenizer.model has 32000)) today with llama-2-7b.

@guertsen
Copy link

guertsen commented Nov 23, 2023

Had the same issue with llama-2-7b.

@rsbepvb
Copy link

rsbepvb commented Nov 26, 2023

Confirming this fixed the issue with the most recent Llama download

@ursachec
Copy link

ursachec commented Dec 7, 2023

Ran into a similar issue today, using the current tip of master bcc0eb4:

$ python3 llama.cpp/convert.py ./Magicoder-S-DS-6.7B/ --outtype f16 --outfile magicoder-S-DS-6.7B.FP16.gguf
//...
Exception: Vocab size mismatch (model has 32256, but Magicoder-S-DS-6.7B/tokenizer.model combined with Magicoder-S-DS-6.7B/added_tokens.json has 32022)

@FotieMConstant
Copy link

I confirm this fixed the issue with Llama2 7b models

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants