Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Windows KL divergence calculations #5273

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

kalomaze
Copy link
Contributor

@kalomaze kalomaze commented Feb 2, 2024

In #5166 I detailed how I was getting an unexpected issue with KL divergence segfaulting / not working.

I think I have identified the issue. Presumably related to how the file is written / read?

From: https://stackoverflow.com/questions/26993086/what-the-point-of-using-stdios-basebinary

  • "under Unix, there is no distinction; both are identical. Under Windows, '\n' internally will be mapped to the two character sequence CR, LF (0x0D, 0x0A) externally, and 0x1A will be interpreted as an end of file when reading"

If I make the modifications in this PR, it allows me to use KL divergence as intended on Windows (I am testing with the same model twice for debugging):

kl_divergence: 0.40 seconds per pass - ETA 0.12 minutes

chunk        PPL          ln(PPL(Q)/PPL(base))          KL-Divergence           Same top
   1        8.1417      -0.00002 ┬▒    0.00000      -0.00001 ┬▒    0.00000    1.00000 ┬▒ 0.00000
   2       11.9650      -0.00002 ┬▒    0.00000      -0.00001 ┬▒    0.00000    1.00000 ┬▒ 0.00000
   3       12.0453      -0.00017 ┬▒    0.00016      -0.00000 ┬▒    0.00000    0.99869 ┬▒ 0.00131
   4       11.8829      -0.00013 ┬▒    0.00012      -0.00001 ┬▒    0.00000    0.99902 ┬▒ 0.00098
   5       12.6807      -0.00011 ┬▒    0.00010      -0.00001 ┬▒    0.00000    0.99922 ┬▒ 0.00078
   6       11.1073      -0.00009 ┬▒    0.00008      -0.00001 ┬▒    0.00000    0.99935 ┬▒ 0.00065
   7       11.8471      -0.00008 ┬▒    0.00007      -0.00001 ┬▒    0.00000    0.99944 ┬▒ 0.00056
   8       11.5785      -0.00008 ┬▒    0.00006      -0.00001 ┬▒    0.00000    0.99951 ┬▒ 0.00049
   9       10.6265      -0.00007 ┬▒    0.00005      -0.00001 ┬▒    0.00000    0.99956 ┬▒ 0.00044
  10       11.1318      -0.00006 ┬▒    0.00005      -0.00001 ┬▒    0.00000    0.99961 ┬▒ 0.00039
  11       10.6147      -0.00006 ┬▒    0.00004      -0.00001 ┬▒    0.00000    0.99964 ┬▒ 0.00036
  12       10.1512      -0.00005 ┬▒    0.00004      -0.00001 ┬▒    0.00000    0.99967 ┬▒ 0.00033
  13       10.0229      -0.00005 ┬▒    0.00004      -0.00001 ┬▒    0.00000    0.99970 ┬▒ 0.00030
  14       10.1798      -0.00006 ┬▒    0.00010      -0.00000 ┬▒    0.00000    0.99916 ┬▒ 0.00049
  15       10.2974      -0.00005 ┬▒    0.00009      -0.00000 ┬▒    0.00000    0.99922 ┬▒ 0.00045
  16       10.5658      -0.00002 ┬▒    0.00012       0.00001 ┬▒    0.00000    0.99926 ┬▒ 0.00042
  17       10.4636      -0.00002 ┬▒    0.00011       0.00001 ┬▒    0.00000    0.99931 ┬▒ 0.00040
  18       10.3634      -0.00008 ┬▒    0.00012       0.00001 ┬▒    0.00000    0.99891 ┬▒ 0.00049
  19       10.3194      -0.00005 ┬▒    0.00012       0.00001 ┬▒    0.00000    0.99856 ┬▒ 0.00055

===== KL-divergence statistics
Average:   0.000013 ┬▒  0.000001
Median :  -0.000007
Maximum:   0.002893
KLD_99 :   0.000360
KLD_95 :   0.000142
KLD_90 :   0.000065
Minimum:  -0.000045
KLD_01 :  -0.000033
KLD_05 :  -0.000026
KLD_10 :  -0.000023

However, there is another issue with how perplexity tokenizes on Windows that is not fixed by this PR.

If you do not use -bf and instead use -f (as was recommended), it will tokenize differently, which leads to higher perplexity when compared to Linux / WSL:

WSL: 13.4018 +/- 0.59528
Windows: 13.9301 +/- 0.62122

(-c 128 was used for both).

This is not a small margin of error difference in ppl.

In terms of total tokens read, it's ~9400 tokens on Windows (without -bf), ~9800 on Linux. Setting -bf means they are equivalent.

Here is how the WSL KL divergence reads:

chunk        PPL          ln(PPL(Q)/PPL(base))          KL-Divergence           Same top
   1        8.1417      -0.00002 ±    0.00000      -0.00001 ±    0.00000    1.00000 ± 0.00000
   2       11.9650       0.00007 ±    0.00044       0.00003 ±    0.00000    0.99608 ± 0.00277
   3       12.0453       0.00007 ±    0.00030       0.00002 ±    0.00000    0.99739 ± 0.00185
   4       11.8819      -0.00004 ±    0.00030       0.00003 ±    0.00000    0.99706 ± 0.00170
   5       12.6798      -0.00003 ±    0.00024       0.00002 ±    0.00000    0.99765 ± 0.00136
   6       11.1075       0.00005 ±    0.00023       0.00002 ±    0.00000    0.99739 ± 0.00131
   7       11.8473       0.00004 ±    0.00020       0.00002 ±    0.00000    0.99776 ± 0.00112
   8       11.5787       0.00003 ±    0.00017       0.00001 ±    0.00000    0.99804 ± 0.00098
   9       10.6266       0.00003 ±    0.00015       0.00001 ±    0.00000    0.99826 ± 0.00087
  10       11.1319       0.00002 ±    0.00014       0.00001 ±    0.00000    0.99843 ± 0.00078
  11       10.6148       0.00002 ±    0.00013       0.00001 ±    0.00000    0.99857 ± 0.00071
  12       10.1513       0.00002 ±    0.00011       0.00001 ±    0.00000    0.99869 ± 0.00065
  13       10.0230       0.00001 ±    0.00011       0.00000 ±    0.00000    0.99879 ± 0.00060
  14       10.1799       0.00001 ±    0.00010       0.00000 ±    0.00000    0.99888 ± 0.00056
  15       10.2975       0.00001 ±    0.00009       0.00000 ±    0.00000    0.99895 ± 0.00052
  16       10.5658       0.00001 ±    0.00009       0.00000 ±    0.00000    0.99902 ± 0.00049
  17       10.4637       0.00001 ±    0.00008       0.00000 ±    0.00000    0.99908 ± 0.00046
  18       10.3641       0.00001 ±    0.00008      -0.00000 ±    0.00000    0.99913 ± 0.00044
  19       10.3198       0.00000 ±    0.00007      -0.00000 ±    0.00000    0.99917 ± 0.00041

===== KL-divergence statistics
Average:  -0.000001 ±  0.000001
Median :  -0.000009
Maximum:   0.001216
KLD_99 :   0.000225
KLD_95 :   0.000060
KLD_90 :   0.000000
Minimum:  -0.000045
KLD_01 :  -0.000033
KLD_05 :  -0.000026
KLD_10 :  -0.000023

@ggerganov ggerganov merged commit 1912211 into ggerganov:master Feb 2, 2024
53 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
@Nexesenex
Copy link
Contributor

Is it possible that your changes derailled a bit the Hellaswag (with the .txt file, not the .bin) computation?

Example of command used :
perplexity -m X:\text-generation-webui\models\MiquMaid-v1-70B.q3_k_m.gguf -f hellaswag_val_full.txt --hellaswag --hellaswag-tasks 1000 -ngl 100 -b 512 -mg 0 -ts 5,2

I used to have 88-90 Hellaswag scores on 70b models (on 400 or 1000 iterations), and now, it dropped to 83-84 (same model, same quant).

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants