sha256 check sums to verify original and converted model data #338

gjmulder · 2023-03-20T20:19:11Z

Not a developer, so my git-fu is a bit rusty. Hopefully this pull request covers everything?!?

Add shadow ./model.sha256 dir containing a dir for each model and a corresponding checklist.sha256 containing sha256 sums of the *.pth bin and *json files
Add script ./model.sha256/chk_sha256sums.sh to walk user-supplied ./models subdir and run sha256sum against above files to diff checklist.sha256 for each model
Update README.md with corresponding instructions

prusnak · 2023-03-20T21:38:02Z

Good idea, but I have one suggestions.

How about we put all hashes into a single file models/SHA256SUMS?

That's more standard way how to do it and you can then perform a single check using:

sha256sum -c models/SHA256SUMS

on Linux, or on macOS:

shasum -a 256 -c models/SHA256SUMS

That way you can drop the shellscript completely.

gjmulder · 2023-03-21T14:09:43Z

@prusnak Thx! RTFM on sha256sum would have saved me an hour of unnecessary scripting. 🤦

I put the SHA256SUMS in the root dir as models is user-provided. I'm also not sure if there's a naming convention for the alpaca models, so I am implicitly proposing one.

llama.cpp$ sha256sum -c SHA256SUMS
models/13B/consolidated.00.pth: OK
models/13B/consolidated.01.pth: OK
models/30B/consolidated.00.pth: OK
models/30B/consolidated.01.pth: OK
models/30B/consolidated.02.pth: OK
models/30B/consolidated.03.pth: OK
models/65B/consolidated.00.pth: OK
models/65B/consolidated.01.pth: OK
models/65B/consolidated.02.pth: OK
models/65B/consolidated.03.pth: OK
models/65B/consolidated.04.pth: OK
models/65B/consolidated.05.pth: OK
models/65B/consolidated.06.pth: OK
models/65B/consolidated.07.pth: OK
models/7B/consolidated.00.pth: OK
models/13B/params.json: OK
models/30B/params.json: OK
models/65B/params.json: OK
models/7B/params.json: OK
models/alpaca-30B/params.json: OK
models/13B/ggml-model-f16.bin: OK
models/13B/ggml-model-f16.bin.1: OK
models/30B/ggml-model-f16.bin: OK
models/30B/ggml-model-f16.bin.1: OK
models/30B/ggml-model-f16.bin.2: OK
models/30B/ggml-model-f16.bin.3: OK
models/65B/ggml-model-f16.bin: OK
models/65B/ggml-model-f16.bin.1: OK
models/65B/ggml-model-f16.bin.2: OK
models/65B/ggml-model-f16.bin.3: OK
models/65B/ggml-model-f16.bin.4: OK
models/65B/ggml-model-f16.bin.5: OK
models/65B/ggml-model-f16.bin.6: OK
models/65B/ggml-model-f16.bin.7: OK
models/7B/ggml-model-f16.bin: OK
models/13B/ggml-model-q4_0.bin: OK
models/13B/ggml-model-q4_0.bin.1: OK
models/30B/ggml-model-q4_0.bin: OK
models/30B/ggml-model-q4_0.bin.1: OK
models/30B/ggml-model-q4_0.bin.2: OK
models/30B/ggml-model-q4_0.bin.3: OK
models/65B/ggml-model-q4_0.bin: OK
models/65B/ggml-model-q4_0.bin.1: OK
models/65B/ggml-model-q4_0.bin.2: OK
models/65B/ggml-model-q4_0.bin.3: OK
models/65B/ggml-model-q4_0.bin.4: OK
models/65B/ggml-model-q4_0.bin.5: OK
models/65B/ggml-model-q4_0.bin.6: OK
models/65B/ggml-model-q4_0.bin.7: OK
models/7B/ggml-model-q4_0.bin: OK
models/alpaca-13B/ggml-model-q4_0.bin: OK
models/alpaca-30B/ggml-model-q4_0.bin: OK
models/alpaca-7B/ggml-model-q4_0.bin: OK

prusnak · 2023-03-21T14:31:16Z

I put the SHA256SUMS in the root dir as models is user-provided

Great!

Can you please sort the file (according to filenames) using sort -k 2 SHA256SUMS > foo ; mv foo SHA256SUMS and update the PR?

README.md

mqy · 2023-03-21T14:59:13Z

Good job, thanks a lot :)

I personally recommend renaming SHA256SUMS as models.sha256, with the following reasons:

Prefix with models, so it's clear that this file belongs to models.
Just like .sha1, it's clear that the .sha256 file contains sh256 checksum(s).
Prefer lower case if possible, because upper case is a bit hard to read.

prusnak · 2023-03-21T15:38:01Z

The filename SHA256SUMS is pretty standard and people are used to it. Examples:

Also later we can add more hashes to the file, not strictly related to the models, so naming it models.sha256 seems wrong.

sw · 2023-03-21T19:38:15Z

Can you add --ignore-missing to the sha256sum command line in the readme? Many people won't have all files, the output is less noisy that way. (This is for Linux, don't know about MacOS).

I had a check failure with models/7B/ggml-model-q4_0.bin, I think a recent commit may have lead to some floating point rounding differences? I have b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82 for that file.

gjmulder · 2023-03-21T20:51:39Z

Can you add --ignore-missing to the sha256sum command line in the readme? Many people won't have all files, the output is less noisy that way. (This is for Linux, don't know about MacOS).

I had a check failure with models/7B/ggml-model-q4_0.bin, I think a recent commit may have lead to some floating point rounding differences? I have b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82 for that file.

@sw I just regenerated everything from the *.pth files and now our checksums agree:

llama.cpp$ git log | head -1
commit 353ec251a42491f5192c48561da4b444ef67f23c
llama.cpp$ grep b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82 SHA256SUMS 
b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82  models/7B/ggml-model-q4_0.bin

prusnak · 2023-03-21T21:06:07Z

(This is for Linux, don't know about MacOS).

Yes, it works for macOS too. I updated my suggestion here to contain the option: #338 (comment)

…fy the downloads Hashes created using: sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS

prusnak · 2023-03-21T22:10:36Z

I went ahead and implemented the suggestions from above and rebased/squashed on top of the current master.

prusnak · 2023-03-21T22:19:31Z

Thanks @gjmulder for computing the hashes. Merged!

anzz1 · 2023-03-21T22:49:33Z

Not all of these checksums seem to be correct. Are they calculated with the "v2" new model format after the tokenizer change? PR: #252 Issue: #324

For example, "models/alpaca-7B/ggml-model-q4_0.bin"

v1: 1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13
v2: 8d5562ec1d8a7cfdcf8985a9ddf353339d942c7cf52855a92c9ff59f03b541bc

The SHA256SUMS file has the old v1 hash.
Maybe using a naming scheme like "ggml2-model-q4_0.bin" would be good to differentiate between the versions and avoid confusion.

mqy · 2023-03-21T23:06:23Z

I can confirm ggml-model-q4* in 7B and 13B mismatch that in SHA256SUMS. Local checksums:

b85058443e89dabdf674d5018d979f0d682977f8413f05b5fd235d36d7a8ff82 models/7B/ggml-model-q4_0.bin
d3ab1548a2d19d989c1e7ea1130ab8c6300c75941a434a2d333ef564f989131b models/13B/ggml-model-q4_0.bin
38b705ce6c5baba4bb6056f11189a4ad21b70591258422450e30495b6ccd8521 models/13B/ggml-model-q4_0.bin.1

Other files in 7B and 13B are correct. I regenerated mismatched q4 files with latest program (make clean; make).

prusnak · 2023-03-22T08:45:39Z

Please open new pull requests if something is wrong.

Maybe using a naming scheme like "ggml2-model-q4_0.bin" would be good to differentiate between the versions and avoid confusion.

We should keep only the latest hashes in the SHA256SUM file, generated by the latest version of the tools in the repo. Introducing any versioning scheme can lead to even more confusion. And if you need to check the older hashes you can still check the earlier versions of the SHA256SUM file. Ideally, the same commit which changes the file format will also regenerate hashes.

gjmulder · 2023-03-22T10:18:46Z

Not all of these checksums seem to be correct. Are they calculated with the "v2" new model format after the tokenizer change? PR: #252 Issue: #324

For example, "models/alpaca-7B/ggml-model-q4_0.bin"

v1: 1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13 v2: 8d5562ec1d8a7cfdcf8985a9ddf353339d942c7cf52855a92c9ff59f03b541bc

The SHA256SUMS file has the old v1 hash. Maybe using a naming scheme like "ggml2-model-q4_0.bin" would be good to differentiate between the versions and avoid confusion.

Yes, that was why I was delaying merging this pull request. See the model magic and versioning discussion in #352:

llama.cpp/models$ cat chk_versions.sh 
#!/bin/sh

for B in *B/ggml-model*bin*; do
	xxd $B | head -1 | awk -v model=$B '{printf("Model: %30s, magic: 0x%8s, version: 0x%4s\n", model, $3$2, $4)}'
done
llama.cpp/models$ ./chk_versions.sh | sort -nk 2
Model:  alpaca-7B/ggml-model-q4_0.bin, magic: 0x67676c6d, version: 0x007d
Model: alpaca-13B/ggml-model-q4_0.bin, magic: 0x67676c6d, version: 0x007d
Model: alpaca-30B/ggml-model-q4_0.bin, magic: 0x67676c6d, version: 0x007d
Model:          7B/ggml-model-f16.bin, magic: 0x6767666d, version: 0x0100
Model:         7B/ggml-model-q4_0.bin, magic: 0x6767666d, version: 0x0100
Model:         13B/ggml-model-f16.bin, magic: 0x6767666d, version: 0x0100
Model:        13B/ggml-model-q4_0.bin, magic: 0x6767666d, version: 0x0100
Model:       13B/ggml-model-f16.bin.1, magic: 0x6767666d, version: 0x0100
Model:      13B/ggml-model-q4_0.bin.1, magic: 0x6767666d, version: 0x0100
Model:         30B/ggml-model-f16.bin, magic: 0x6767666d, version: 0x0100
Model:        30B/ggml-model-q4_0.bin, magic: 0x6767666d, version: 0x0100
Model:       30B/ggml-model-f16.bin.1, magic: 0x6767666d, version: 0x0100
Model:       30B/ggml-model-f16.bin.2, magic: 0x6767666d, version: 0x0100
Model:       30B/ggml-model-f16.bin.3, magic: 0x6767666d, version: 0x0100
Model:      30B/ggml-model-q4_0.bin.1, magic: 0x6767666d, version: 0x0100
Model:      30B/ggml-model-q4_0.bin.2, magic: 0x6767666d, version: 0x0100
Model:      30B/ggml-model-q4_0.bin.3, magic: 0x6767666d, version: 0x0100
Model:         65B/ggml-model-f16.bin, magic: 0x6767666d, version: 0x0100
Model:        65B/ggml-model-q4_0.bin, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.1, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.2, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.3, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.4, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.5, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.6, magic: 0x6767666d, version: 0x0100
Model:       65B/ggml-model-f16.bin.7, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.1, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.2, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.3, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.4, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.5, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.6, magic: 0x6767666d, version: 0x0100
Model:      65B/ggml-model-q4_0.bin.7, magic: 0x6767666d, version: 0x0100

gjmulder added documentation Improvements or additions to documentation enhancement New feature or request model Model specific labels Mar 20, 2023

gjmulder assigned gjmulder and unassigned gjmulder Mar 20, 2023

gjmulder changed the title ~~sha256 check sums to verify original and converted model data (#238)~~ sha256 check sums to verify original and converted model data Mar 20, 2023

gjmulder mentioned this pull request Mar 21, 2023

Update IPFS links to quantized alpaca with new tokenizer format #352

Merged

prusnak reviewed Mar 21, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

Add SHA256SUMS file and instructions to README how to obtain and veri…

3d9c404

…fy the downloads Hashes created using: sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS

prusnak force-pushed the add-sha256sums branch from 16c06f7 to 3d9c404 Compare March 21, 2023 22:10

prusnak approved these changes Mar 21, 2023

View reviewed changes

prusnak merged commit da0e9fe into master Mar 21, 2023

prusnak deleted the add-sha256sums branch March 21, 2023 22:19

anzz1 mentioned this pull request Mar 21, 2023

SHA256 checksums correctness #374

Closed

AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024

Fix mirostatv2. (ggerganov#338)

56995ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sha256 check sums to verify original and converted model data #338

sha256 check sums to verify original and converted model data #338

gjmulder commented Mar 20, 2023 •

edited

Loading

prusnak commented Mar 20, 2023

gjmulder commented Mar 21, 2023 •

edited

Loading

prusnak commented Mar 21, 2023 •

edited

Loading

mqy commented Mar 21, 2023

prusnak commented Mar 21, 2023

sw commented Mar 21, 2023 •

edited

Loading

gjmulder commented Mar 21, 2023

prusnak commented Mar 21, 2023

prusnak commented Mar 21, 2023

prusnak commented Mar 21, 2023

anzz1 commented Mar 21, 2023 •

edited

Loading

mqy commented Mar 21, 2023 •

edited

Loading

prusnak commented Mar 22, 2023 •

edited

Loading

gjmulder commented Mar 22, 2023

sha256 check sums to verify original and converted model data #338

sha256 check sums to verify original and converted model data #338

Conversation

gjmulder commented Mar 20, 2023 • edited Loading

prusnak commented Mar 20, 2023

gjmulder commented Mar 21, 2023 • edited Loading

prusnak commented Mar 21, 2023 • edited Loading

mqy commented Mar 21, 2023

prusnak commented Mar 21, 2023

sw commented Mar 21, 2023 • edited Loading

gjmulder commented Mar 21, 2023

prusnak commented Mar 21, 2023

prusnak commented Mar 21, 2023

prusnak commented Mar 21, 2023

anzz1 commented Mar 21, 2023 • edited Loading

mqy commented Mar 21, 2023 • edited Loading

prusnak commented Mar 22, 2023 • edited Loading

gjmulder commented Mar 22, 2023

gjmulder commented Mar 20, 2023 •

edited

Loading

gjmulder commented Mar 21, 2023 •

edited

Loading

prusnak commented Mar 21, 2023 •

edited

Loading

sw commented Mar 21, 2023 •

edited

Loading

anzz1 commented Mar 21, 2023 •

edited

Loading

mqy commented Mar 21, 2023 •

edited

Loading

prusnak commented Mar 22, 2023 •

edited

Loading