-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q4_0 scale selection using RMSE #835
base: master
Are you sure you want to change the base?
Conversation
Very interesting analysis and data 😄 Btw, I've been thinking a little bit about how to determine the scale factor to minimize the RMS and I am fairly certain that there is a straightforward way to compute the optimum value without search. I don't have the formula yet - just a strong intuition lol |
So as mentioned in #397 (comment) I believe I have an RMSE-optimal but very slow implementation of the scaling search... And your implementation gets extremely close!: You posted:
"optimal":
That probably about as good as we can hope for. Full output for verification - if you get a lower RMSE for any layer I have a bug :) quantize-stats outputnote: source model is f16 testing 226 layers with max size 131072000 q4_0::layers.0.attention.wk.weight : rmse 0.00292301, maxerr 0.07012939, 95pct<0.0060, median<0.0018 q4_0::layers.0.attention.wo.weight : rmse 0.00100800, maxerr 0.03466797, 95pct<0.0020, median<0.0008 q4_0::layers.0.attention.wq.weight : rmse 0.00299672, maxerr 0.04935372, 95pct<0.0062, median<0.0016 q4_0::layers.0.attention.wv.weight : rmse 0.00110658, maxerr 0.01175963, 95pct<0.0022, median<0.0008 q4_0::layers.0.feed_forward.w1.weight : rmse 0.00137271, maxerr 0.06394231, 95pct<0.0026, median<0.0012 q4_0::layers.0.feed_forward.w2.weight : rmse 0.00167287, maxerr 0.05469465, 95pct<0.0030, median<0.0014 q4_0::layers.0.feed_forward.w3.weight : rmse 0.00132967, maxerr 0.01823677, 95pct<0.0024, median<0.0012 q4_0::layers.1.attention.wk.weight : rmse 0.00282859, maxerr 0.04061890, 95pct<0.0060, median<0.0016 q4_0::layers.1.attention.wo.weight : rmse 0.00097638, maxerr 0.03619385, 95pct<0.0020, median<0.0008 q4_0::layers.1.attention.wq.weight : rmse 0.00276326, maxerr 0.03581724, 95pct<0.0058, median<0.0016 q4_0::layers.1.attention.wv.weight : rmse 0.00094906, maxerr 0.00811867, 95pct<0.0020, median<0.0008 q4_0::layers.1.feed_forward.w1.weight : rmse 0.00174292, maxerr 0.03725962, 95pct<0.0032, median<0.0014 q4_0::layers.1.feed_forward.w2.weight : rmse 0.00171300, maxerr 0.05225383, 95pct<0.0032, median<0.0014 q4_0::layers.1.feed_forward.w3.weight : rmse 0.00165360, maxerr 0.02176375, 95pct<0.0030, median<0.0014 q4_0::layers.10.attention.wk.weight : rmse 0.00225472, maxerr 0.02979095, 95pct<0.0044, median<0.0016 q4_0::layers.10.attention.wo.weight : rmse 0.00143767, maxerr 0.04605806, 95pct<0.0026, median<0.0012 q4_0::layers.10.attention.wq.weight : rmse 0.00222988, maxerr 0.03421440, 95pct<0.0044, median<0.0016 q4_0::layers.10.attention.wv.weight : rmse 0.00144258, maxerr 0.01459024, 95pct<0.0028, median<0.0012 q4_0::layers.10.feed_forward.w1.weight : rmse 0.00183416, maxerr 0.02703372, 95pct<0.0034, median<0.0014 q4_0::layers.10.feed_forward.w2.weight : rmse 0.00174484, maxerr 0.04180530, 95pct<0.0032, median<0.0014 q4_0::layers.10.feed_forward.w3.weight : rmse 0.00177285, maxerr 0.02142334, 95pct<0.0032, median<0.0014 q4_0::layers.11.attention.wk.weight : rmse 0.00233274, maxerr 0.02713823, 95pct<0.0046, median<0.0018 q4_0::layers.11.attention.wo.weight : rmse 0.00150656, maxerr 0.03012497, 95pct<0.0028, median<0.0012 q4_0::layers.11.attention.wq.weight : rmse 0.00229496, maxerr 0.04412842, 95pct<0.0044, median<0.0018 q4_0::layers.11.attention.wv.weight : rmse 0.00151707, maxerr 0.02018456, 95pct<0.0028, median<0.0012 q4_0::layers.11.feed_forward.w1.weight : rmse 0.00182944, maxerr 0.02489303, 95pct<0.0034, median<0.0014 q4_0::layers.11.feed_forward.w2.weight : rmse 0.00175960, maxerr 0.05431067, 95pct<0.0032, median<0.0014 q4_0::layers.11.feed_forward.w3.weight : rmse 0.00178396, maxerr 0.02583313, 95pct<0.0032, median<0.0014 q4_0::layers.12.attention.wk.weight : rmse 0.00222511, maxerr 0.02594195, 95pct<0.0044, median<0.0016 q4_0::layers.12.attention.wo.weight : rmse 0.00147925, maxerr 0.02380922, 95pct<0.0028, median<0.0012 q4_0::layers.12.attention.wq.weight : rmse 0.00218941, maxerr 0.03447523, 95pct<0.0042, median<0.0016 q4_0::layers.12.attention.wv.weight : rmse 0.00146629, maxerr 0.01153804, 95pct<0.0028, median<0.0012 q4_0::layers.12.feed_forward.w1.weight : rmse 0.00183979, maxerr 0.03383104, 95pct<0.0034, median<0.0014 q4_0::layers.12.feed_forward.w2.weight : rmse 0.00176264, maxerr 0.05683154, 95pct<0.0032, median<0.0014 q4_0::layers.12.feed_forward.w3.weight : rmse 0.00179476, maxerr 0.01740211, 95pct<0.0034, median<0.0014 q4_0::layers.13.attention.wk.weight : rmse 0.00217331, maxerr 0.02676816, 95pct<0.0044, median<0.0016 q4_0::layers.13.attention.wo.weight : rmse 0.00153305, maxerr 0.04341370, 95pct<0.0028, median<0.0012 q4_0::layers.13.attention.wq.weight : rmse 0.00213820, maxerr 0.03543091, 95pct<0.0042, median<0.0016 q4_0::layers.13.attention.wv.weight : rmse 0.00153372, maxerr 0.01126552, 95pct<0.0028, median<0.0012 q4_0::layers.13.feed_forward.w1.weight : rmse 0.00183155, maxerr 0.02292150, 95pct<0.0034, median<0.0014 q4_0::layers.13.feed_forward.w2.weight : rmse 0.00177651, maxerr 0.03530073, 95pct<0.0032, median<0.0014 q4_0::layers.13.feed_forward.w3.weight : rmse 0.00181181, maxerr 0.01798833, 95pct<0.0034, median<0.0016 q4_0::layers.14.attention.wk.weight : rmse 0.00217185, maxerr 0.02497105, 95pct<0.0042, median<0.0016 q4_0::layers.14.attention.wo.weight : rmse 0.00153627, maxerr 0.06212232, 95pct<0.0028, median<0.0012 q4_0::layers.14.attention.wq.weight : rmse 0.00215347, maxerr 0.03887939, 95pct<0.0042, median<0.0016 q4_0::layers.14.attention.wv.weight : rmse 0.00154264, maxerr 0.01345214, 95pct<0.0028, median<0.0012 q4_0::layers.14.feed_forward.w1.weight : rmse 0.00182898, maxerr 0.02304077, 95pct<0.0034, median<0.0014 q4_0::layers.14.feed_forward.w2.weight : rmse 0.00178511, maxerr 0.05890521, 95pct<0.0032, median<0.0014 q4_0::layers.14.feed_forward.w3.weight : rmse 0.00181856, maxerr 0.02665675, 95pct<0.0034, median<0.0016 q4_0::layers.15.attention.wk.weight : rmse 0.00219269, maxerr 0.02394998, 95pct<0.0044, median<0.0016 q4_0::layers.15.attention.wo.weight : rmse 0.00154000, maxerr 0.02813050, 95pct<0.0028, median<0.0012 q4_0::layers.15.attention.wq.weight : rmse 0.00215290, maxerr 0.03628540, 95pct<0.0042, median<0.0016 q4_0::layers.15.attention.wv.weight : rmse 0.00154661, maxerr 0.01409675, 95pct<0.0028, median<0.0012 q4_0::layers.15.feed_forward.w1.weight : rmse 0.00182940, maxerr 0.02419187, 95pct<0.0034, median<0.0014 q4_0::layers.15.feed_forward.w2.weight : rmse 0.00178558, maxerr 0.05858561, 95pct<0.0032, median<0.0014 q4_0::layers.15.feed_forward.w3.weight : rmse 0.00181912, maxerr 0.02241516, 95pct<0.0034, median<0.0016 q4_0::layers.16.attention.wk.weight : rmse 0.00217754, maxerr 0.02458954, 95pct<0.0042, median<0.0016 q4_0::layers.16.attention.wo.weight : rmse 0.00163187, maxerr 0.05107081, 95pct<0.0030, median<0.0014 q4_0::layers.16.attention.wq.weight : rmse 0.00212385, maxerr 0.04119629, 95pct<0.0040, median<0.0016 q4_0::layers.16.attention.wv.weight : rmse 0.00164553, maxerr 0.01337417, 95pct<0.0030, median<0.0014 q4_0::layers.16.feed_forward.w1.weight : rmse 0.00184241, maxerr 0.02344798, 95pct<0.0034, median<0.0016 q4_0::layers.16.feed_forward.w2.weight : rmse 0.00178439, maxerr 0.05552104, 95pct<0.0032, median<0.0014 q4_0::layers.16.feed_forward.w3.weight : rmse 0.00181314, maxerr 0.02277143, 95pct<0.0034, median<0.0016 q4_0::layers.17.attention.wk.weight : rmse 0.00212176, maxerr 0.02422421, 95pct<0.0042, median<0.0016 q4_0::layers.17.attention.wo.weight : rmse 0.00165387, maxerr 0.03002930, 95pct<0.0030, median<0.0014 q4_0::layers.17.attention.wq.weight : rmse 0.00207895, maxerr 0.04604350, 95pct<0.0040, median<0.0016 q4_0::layers.17.attention.wv.weight : rmse 0.00165649, maxerr 0.01419830, 95pct<0.0030, median<0.0014 q4_0::layers.17.feed_forward.w1.weight : rmse 0.00184599, maxerr 0.02392328, 95pct<0.0034, median<0.0016 q4_0::layers.17.feed_forward.w2.weight : rmse 0.00179142, maxerr 0.04622682, 95pct<0.0032, median<0.0014 q4_0::layers.17.feed_forward.w3.weight : rmse 0.00181806, maxerr 0.02359099, 95pct<0.0034, median<0.0016 q4_0::layers.18.attention.wk.weight : rmse 0.00208260, maxerr 0.02502441, 95pct<0.0040, median<0.0016 q4_0::layers.18.attention.wo.weight : rmse 0.00164773, maxerr 0.03822631, 95pct<0.0030, median<0.0014 q4_0::layers.18.attention.wq.weight : rmse 0.00205646, maxerr 0.04051746, 95pct<0.0040, median<0.0016 q4_0::layers.18.attention.wv.weight : rmse 0.00165172, maxerr 0.01335841, 95pct<0.0030, median<0.0014 q4_0::layers.18.feed_forward.w1.weight : rmse 0.00186100, maxerr 0.03084695, 95pct<0.0034, median<0.0016 q4_0::layers.18.feed_forward.w2.weight : rmse 0.00178702, maxerr 0.06258377, 95pct<0.0032, median<0.0014 q4_0::layers.18.feed_forward.w3.weight : rmse 0.00181154, maxerr 0.01813507, 95pct<0.0034, median<0.0016 q4_0::layers.19.attention.wk.weight : rmse 0.00204409, maxerr 0.02549587, 95pct<0.0040, median<0.0016 q4_0::layers.19.attention.wo.weight : rmse 0.00171742, maxerr 0.04106662, 95pct<0.0032, median<0.0014 q4_0::layers.19.attention.wq.weight : rmse 0.00202074, maxerr 0.04685394, 95pct<0.0040, median<0.0016 q4_0::layers.19.attention.wv.weight : rmse 0.00173205, maxerr 0.01311102, 95pct<0.0032, median<0.0014 q4_0::layers.19.feed_forward.w1.weight : rmse 0.00187151, maxerr 0.03121948, 95pct<0.0034, median<0.0016 q4_0::layers.19.feed_forward.w2.weight : rmse 0.00178935, maxerr 0.04564381, 95pct<0.0032, median<0.0014 q4_0::layers.19.feed_forward.w3.weight : rmse 0.00180759, maxerr 0.02138457, 95pct<0.0034, median<0.0016 q4_0::layers.2.attention.wk.weight : rmse 0.00310555, maxerr 0.03675859, 95pct<0.0064, median<0.0020 q4_0::layers.2.attention.wo.weight : rmse 0.00115159, maxerr 0.04546779, 95pct<0.0022, median<0.0010 q4_0::layers.2.attention.wq.weight : rmse 0.00298841, maxerr 0.03752440, 95pct<0.0060, median<0.0020 q4_0::layers.2.attention.wv.weight : rmse 0.00112951, maxerr 0.00926531, 95pct<0.0022, median<0.0010 q4_0::layers.2.feed_forward.w1.weight : rmse 0.00183671, maxerr 0.05353853, 95pct<0.0034, median<0.0016 q4_0::layers.2.feed_forward.w2.weight : rmse 0.00170433, maxerr 0.09649658, 95pct<0.0032, median<0.0014 q4_0::layers.2.feed_forward.w3.weight : rmse 0.00167454, maxerr 0.03201294, 95pct<0.0030, median<0.0014 q4_0::layers.20.attention.wk.weight : rmse 0.00207524, maxerr 0.02473852, 95pct<0.0040, median<0.0016 q4_0::layers.20.attention.wo.weight : rmse 0.00176106, maxerr 0.02588722, 95pct<0.0032, median<0.0014 q4_0::layers.20.attention.wq.weight : rmse 0.00204837, maxerr 0.05462646, 95pct<0.0040, median<0.0016 q4_0::layers.20.attention.wv.weight : rmse 0.00178526, maxerr 0.01499712, 95pct<0.0034, median<0.0014 q4_0::layers.20.feed_forward.w1.weight : rmse 0.00188099, maxerr 0.02917725, 95pct<0.0034, median<0.0016 q4_0::layers.20.feed_forward.w2.weight : rmse 0.00179125, maxerr 0.06890869, 95pct<0.0032, median<0.0014 q4_0::layers.20.feed_forward.w3.weight : rmse 0.00180859, maxerr 0.01596069, 95pct<0.0034, median<0.0016 q4_0::layers.21.attention.wk.weight : rmse 0.00200054, maxerr 0.02908368, 95pct<0.0040, median<0.0014 q4_0::layers.21.attention.wo.weight : rmse 0.00177119, maxerr 0.05007464, 95pct<0.0032, median<0.0014 q4_0::layers.21.attention.wq.weight : rmse 0.00198177, maxerr 0.05149466, 95pct<0.0038, median<0.0014 q4_0::layers.21.attention.wv.weight : rmse 0.00179837, maxerr 0.01333202, 95pct<0.0034, median<0.0014 q4_0::layers.21.feed_forward.w1.weight : rmse 0.00189033, maxerr 0.03076535, 95pct<0.0034, median<0.0016 q4_0::layers.21.feed_forward.w2.weight : rmse 0.00178966, maxerr 0.03637502, 95pct<0.0032, median<0.0014 q4_0::layers.21.feed_forward.w3.weight : rmse 0.00180614, maxerr 0.02140096, 95pct<0.0034, median<0.0016 q4_0::layers.22.attention.wk.weight : rmse 0.00203025, maxerr 0.03339660, 95pct<0.0040, median<0.0016 q4_0::layers.22.attention.wo.weight : rmse 0.00177702, maxerr 0.07931513, 95pct<0.0032, median<0.0014 q4_0::layers.22.attention.wq.weight : rmse 0.00201616, maxerr 0.04454328, 95pct<0.0038, median<0.0016 q4_0::layers.22.attention.wv.weight : rmse 0.00178748, maxerr 0.01423188, 95pct<0.0034, median<0.0014 q4_0::layers.22.feed_forward.w1.weight : rmse 0.00189302, maxerr 0.02517700, 95pct<0.0034, median<0.0016 q4_0::layers.22.feed_forward.w2.weight : rmse 0.00179775, maxerr 0.04281616, 95pct<0.0034, median<0.0016 q4_0::layers.22.feed_forward.w3.weight : rmse 0.00181394, maxerr 0.03024019, 95pct<0.0034, median<0.0016 q4_0::layers.23.attention.wk.weight : rmse 0.00195991, maxerr 0.02972737, 95pct<0.0038, median<0.0014 q4_0::layers.23.attention.wo.weight : rmse 0.00182629, maxerr 0.04887556, 95pct<0.0034, median<0.0016 q4_0::layers.23.attention.wq.weight : rmse 0.00195417, maxerr 0.04232788, 95pct<0.0038, median<0.0014 q4_0::layers.23.attention.wv.weight : rmse 0.00185857, maxerr 0.01577342, 95pct<0.0034, median<0.0016 q4_0::layers.23.feed_forward.w1.weight : rmse 0.00189658, maxerr 0.03308105, 95pct<0.0034, median<0.0016 q4_0::layers.23.feed_forward.w2.weight : rmse 0.00180367, maxerr 0.04928589, 95pct<0.0034, median<0.0016 q4_0::layers.23.feed_forward.w3.weight : rmse 0.00181737, maxerr 0.02468872, 95pct<0.0034, median<0.0016 q4_0::layers.24.attention.wk.weight : rmse 0.00196715, maxerr 0.02162942, 95pct<0.0038, median<0.0014 q4_0::layers.24.attention.wo.weight : rmse 0.00184930, maxerr 0.03620195, 95pct<0.0034, median<0.0016 q4_0::layers.24.attention.wq.weight : rmse 0.00195618, maxerr 0.04705903, 95pct<0.0038, median<0.0014 q4_0::layers.24.attention.wv.weight : rmse 0.00188009, maxerr 0.01770980, 95pct<0.0034, median<0.0016 q4_0::layers.24.feed_forward.w1.weight : rmse 0.00189906, maxerr 0.02117351, 95pct<0.0034, median<0.0016 q4_0::layers.24.feed_forward.w2.weight : rmse 0.00181186, maxerr 0.05899048, 95pct<0.0034, median<0.0016 q4_0::layers.24.feed_forward.w3.weight : rmse 0.00182756, maxerr 0.02068704, 95pct<0.0034, median<0.0016 q4_0::layers.25.attention.wk.weight : rmse 0.00202900, maxerr 0.02362627, 95pct<0.0038, median<0.0016 q4_0::layers.25.attention.wo.weight : rmse 0.00186576, maxerr 0.06477863, 95pct<0.0034, median<0.0016 q4_0::layers.25.attention.wq.weight : rmse 0.00200834, maxerr 0.03808594, 95pct<0.0038, median<0.0016 q4_0::layers.25.attention.wv.weight : rmse 0.00188682, maxerr 0.01595676, 95pct<0.0034, median<0.0016 q4_0::layers.25.feed_forward.w1.weight : rmse 0.00190352, maxerr 0.02079988, 95pct<0.0036, median<0.0016 q4_0::layers.25.feed_forward.w2.weight : rmse 0.00181823, maxerr 0.03286743, 95pct<0.0034, median<0.0016 q4_0::layers.25.feed_forward.w3.weight : rmse 0.00183412, maxerr 0.01735053, 95pct<0.0034, median<0.0016 q4_0::layers.26.attention.wk.weight : rmse 0.00199544, maxerr 0.02733952, 95pct<0.0038, median<0.0016 q4_0::layers.26.attention.wo.weight : rmse 0.00192010, maxerr 0.02563220, 95pct<0.0036, median<0.0016 q4_0::layers.26.attention.wq.weight : rmse 0.00197604, maxerr 0.03735352, 95pct<0.0038, median<0.0016 q4_0::layers.26.attention.wv.weight : rmse 0.00194300, maxerr 0.01509885, 95pct<0.0036, median<0.0016 q4_0::layers.26.feed_forward.w1.weight : rmse 0.00190232, maxerr 0.03396144, 95pct<0.0036, median<0.0016 q4_0::layers.26.feed_forward.w2.weight : rmse 0.00183005, maxerr 0.04354858, 95pct<0.0034, median<0.0016 q4_0::layers.26.feed_forward.w3.weight : rmse 0.00184771, maxerr 0.03059387, 95pct<0.0034, median<0.0016 q4_0::layers.27.attention.wk.weight : rmse 0.00198943, maxerr 0.02681477, 95pct<0.0038, median<0.0016 q4_0::layers.27.attention.wo.weight : rmse 0.00196662, maxerr 0.05517289, 95pct<0.0036, median<0.0016 q4_0::layers.27.attention.wq.weight : rmse 0.00198182, maxerr 0.03899045, 95pct<0.0038, median<0.0016 q4_0::layers.27.attention.wv.weight : rmse 0.00197615, maxerr 0.01669417, 95pct<0.0036, median<0.0016 q4_0::layers.27.feed_forward.w1.weight : rmse 0.00190184, maxerr 0.02731323, 95pct<0.0036, median<0.0016 q4_0::layers.27.feed_forward.w2.weight : rmse 0.00184141, maxerr 0.04620361, 95pct<0.0034, median<0.0016 q4_0::layers.27.feed_forward.w3.weight : rmse 0.00185554, maxerr 0.04153442, 95pct<0.0034, median<0.0016 q4_0::layers.28.attention.wk.weight : rmse 0.00194652, maxerr 0.02850908, 95pct<0.0038, median<0.0014 q4_0::layers.28.attention.wo.weight : rmse 0.00198808, maxerr 0.03118306, 95pct<0.0036, median<0.0016 q4_0::layers.28.attention.wq.weight : rmse 0.00194074, maxerr 0.04092407, 95pct<0.0038, median<0.0014 q4_0::layers.28.attention.wv.weight : rmse 0.00198781, maxerr 0.01527815, 95pct<0.0036, median<0.0016 q4_0::layers.28.feed_forward.w1.weight : rmse 0.00189349, maxerr 0.03170776, 95pct<0.0034, median<0.0016 q4_0::layers.28.feed_forward.w2.weight : rmse 0.00185116, maxerr 0.05222714, 95pct<0.0034, median<0.0016 q4_0::layers.28.feed_forward.w3.weight : rmse 0.00186482, maxerr 0.03631857, 95pct<0.0034, median<0.0016 q4_0::layers.29.attention.wk.weight : rmse 0.00193214, maxerr 0.02202798, 95pct<0.0038, median<0.0014 q4_0::layers.29.attention.wo.weight : rmse 0.00204716, maxerr 0.03959709, 95pct<0.0038, median<0.0016 q4_0::layers.29.attention.wq.weight : rmse 0.00192283, maxerr 0.04244995, 95pct<0.0036, median<0.0014 q4_0::layers.29.attention.wv.weight : rmse 0.00204643, maxerr 0.01519237, 95pct<0.0038, median<0.0016 q4_0::layers.29.feed_forward.w1.weight : rmse 0.00189820, maxerr 0.03314209, 95pct<0.0036, median<0.0016 q4_0::layers.29.feed_forward.w2.weight : rmse 0.00186130, maxerr 0.09802246, 95pct<0.0034, median<0.0016 q4_0::layers.29.feed_forward.w3.weight : rmse 0.00187583, maxerr 0.02655141, 95pct<0.0034, median<0.0016 q4_0::layers.3.attention.wk.weight : rmse 0.00257589, maxerr 0.03777078, 95pct<0.0052, median<0.0018 q4_0::layers.3.attention.wo.weight : rmse 0.00133662, maxerr 0.04435936, 95pct<0.0024, median<0.0012 q4_0::layers.3.attention.wq.weight : rmse 0.00246442, maxerr 0.04611765, 95pct<0.0048, median<0.0018 q4_0::layers.3.attention.wv.weight : rmse 0.00133929, maxerr 0.01030663, 95pct<0.0024, median<0.0012 q4_0::layers.3.feed_forward.w1.weight : rmse 0.00185664, maxerr 0.03087639, 95pct<0.0034, median<0.0016 q4_0::layers.3.feed_forward.w2.weight : rmse 0.00171196, maxerr 0.05057278, 95pct<0.0032, median<0.0014 q4_0::layers.3.feed_forward.w3.weight : rmse 0.00170679, maxerr 0.02278137, 95pct<0.0032, median<0.0014 q4_0::layers.30.attention.wk.weight : rmse 0.00195269, maxerr 0.03295821, 95pct<0.0038, median<0.0016 q4_0::layers.30.attention.wo.weight : rmse 0.00204545, maxerr 0.05445015, 95pct<0.0038, median<0.0016 q4_0::layers.30.attention.wq.weight : rmse 0.00194719, maxerr 0.04063878, 95pct<0.0036, median<0.0016 q4_0::layers.30.attention.wv.weight : rmse 0.00202005, maxerr 0.01512921, 95pct<0.0038, median<0.0016 q4_0::layers.30.feed_forward.w1.weight : rmse 0.00191074, maxerr 0.02958679, 95pct<0.0036, median<0.0016 q4_0::layers.30.feed_forward.w2.weight : rmse 0.00191046, maxerr 0.14257812, 95pct<0.0034, median<0.0016 q4_0::layers.30.feed_forward.w3.weight : rmse 0.00189492, maxerr 0.04852676, 95pct<0.0034, median<0.0016 q4_0::layers.31.attention.wk.weight : rmse 0.00201812, maxerr 0.02451627, 95pct<0.0038, median<0.0016 q4_0::layers.31.attention.wo.weight : rmse 0.00184503, maxerr 0.11907780, 95pct<0.0034, median<0.0014 q4_0::layers.31.attention.wq.weight : rmse 0.00197563, maxerr 0.02724165, 95pct<0.0038, median<0.0016 q4_0::layers.31.attention.wv.weight : rmse 0.00182399, maxerr 0.01841706, 95pct<0.0034, median<0.0014 q4_0::layers.31.feed_forward.w1.weight : rmse 0.00199676, maxerr 0.03135899, 95pct<0.0036, median<0.0016 q4_0::layers.31.feed_forward.w2.weight : rmse 0.00191905, maxerr 0.11260986, 95pct<0.0036, median<0.0016 q4_0::layers.31.feed_forward.w3.weight : rmse 0.00197545, maxerr 0.04486084, 95pct<0.0036, median<0.0016 q4_0::layers.4.attention.wk.weight : rmse 0.00252572, maxerr 0.03471547, 95pct<0.0050, median<0.0018 q4_0::layers.4.attention.wo.weight : rmse 0.00133709, maxerr 0.05675527, 95pct<0.0026, median<0.0012 q4_0::layers.4.attention.wq.weight : rmse 0.00250660, maxerr 0.04748535, 95pct<0.0048, median<0.0018 q4_0::layers.4.attention.wv.weight : rmse 0.00133764, maxerr 0.01021584, 95pct<0.0026, median<0.0012 q4_0::layers.4.feed_forward.w1.weight : rmse 0.00188008, maxerr 0.03756605, 95pct<0.0034, median<0.0016 q4_0::layers.4.feed_forward.w2.weight : rmse 0.00170612, maxerr 0.04783656, 95pct<0.0032, median<0.0014 q4_0::layers.4.feed_forward.w3.weight : rmse 0.00171322, maxerr 0.03393555, 95pct<0.0032, median<0.0014 q4_0::layers.5.attention.wk.weight : rmse 0.00238210, maxerr 0.03174898, 95pct<0.0046, median<0.0018 q4_0::layers.5.attention.wo.weight : rmse 0.00135344, maxerr 0.04260254, 95pct<0.0026, median<0.0012 q4_0::layers.5.attention.wq.weight : rmse 0.00236603, maxerr 0.04248789, 95pct<0.0046, median<0.0018 q4_0::layers.5.attention.wv.weight : rmse 0.00136147, maxerr 0.01390839, 95pct<0.0026, median<0.0012 q4_0::layers.5.feed_forward.w1.weight : rmse 0.00191865, maxerr 0.03069225, 95pct<0.0036, median<0.0016 q4_0::layers.5.feed_forward.w2.weight : rmse 0.00168901, maxerr 0.04306030, 95pct<0.0032, median<0.0014 q4_0::layers.5.feed_forward.w3.weight : rmse 0.00170621, maxerr 0.02728271, 95pct<0.0032, median<0.0014 q4_0::layers.6.attention.wk.weight : rmse 0.00243652, maxerr 0.02662471, 95pct<0.0048, median<0.0018 q4_0::layers.6.attention.wo.weight : rmse 0.00136724, maxerr 0.06586111, 95pct<0.0026, median<0.0012 q4_0::layers.6.attention.wq.weight : rmse 0.00238362, maxerr 0.04891968, 95pct<0.0046, median<0.0018 q4_0::layers.6.attention.wv.weight : rmse 0.00137151, maxerr 0.01011706, 95pct<0.0026, median<0.0012 q4_0::layers.6.feed_forward.w1.weight : rmse 0.00189249, maxerr 0.04025269, 95pct<0.0036, median<0.0016 q4_0::layers.6.feed_forward.w2.weight : rmse 0.00170650, maxerr 0.04733700, 95pct<0.0032, median<0.0014 q4_0::layers.6.feed_forward.w3.weight : rmse 0.00172804, maxerr 0.02388000, 95pct<0.0032, median<0.0014 q4_0::layers.7.attention.wk.weight : rmse 0.00236151, maxerr 0.02808842, 95pct<0.0046, median<0.0018 q4_0::layers.7.attention.wo.weight : rmse 0.00139671, maxerr 0.03219414, 95pct<0.0026, median<0.0012 q4_0::layers.7.attention.wq.weight : rmse 0.00234200, maxerr 0.04630763, 95pct<0.0046, median<0.0018 q4_0::layers.7.attention.wv.weight : rmse 0.00141431, maxerr 0.01132994, 95pct<0.0026, median<0.0012 q4_0::layers.7.feed_forward.w1.weight : rmse 0.00187467, maxerr 0.03225514, 95pct<0.0034, median<0.0016 q4_0::layers.7.feed_forward.w2.weight : rmse 0.00171376, maxerr 0.04287861, 95pct<0.0032, median<0.0014 q4_0::layers.7.feed_forward.w3.weight : rmse 0.00173478, maxerr 0.02548218, 95pct<0.0032, median<0.0014 q4_0::layers.8.attention.wk.weight : rmse 0.00230324, maxerr 0.02782059, 95pct<0.0046, median<0.0016 q4_0::layers.8.attention.wo.weight : rmse 0.00139008, maxerr 0.03478485, 95pct<0.0026, median<0.0012 q4_0::layers.8.attention.wq.weight : rmse 0.00230350, maxerr 0.04038759, 95pct<0.0046, median<0.0016 q4_0::layers.8.attention.wv.weight : rmse 0.00140037, maxerr 0.01309800, 95pct<0.0026, median<0.0012 q4_0::layers.8.feed_forward.w1.weight : rmse 0.00187512, maxerr 0.03340167, 95pct<0.0034, median<0.0016 q4_0::layers.8.feed_forward.w2.weight : rmse 0.00171484, maxerr 0.03771973, 95pct<0.0032, median<0.0014 q4_0::layers.8.feed_forward.w3.weight : rmse 0.00173956, maxerr 0.02236661, 95pct<0.0032, median<0.0014 q4_0::layers.9.attention.wk.weight : rmse 0.00224144, maxerr 0.02832547, 95pct<0.0044, median<0.0016 q4_0::layers.9.attention.wo.weight : rmse 0.00137911, maxerr 0.03645405, 95pct<0.0026, median<0.0012 q4_0::layers.9.attention.wq.weight : rmse 0.00222848, maxerr 0.04025269, 95pct<0.0044, median<0.0016 q4_0::layers.9.attention.wv.weight : rmse 0.00139053, maxerr 0.01049893, 95pct<0.0026, median<0.0012 q4_0::layers.9.feed_forward.w1.weight : rmse 0.00184797, maxerr 0.04046337, 95pct<0.0034, median<0.0016 q4_0::layers.9.feed_forward.w2.weight : rmse 0.00172856, maxerr 0.04580688, 95pct<0.0032, median<0.0014 q4_0::layers.9.feed_forward.w3.weight : rmse 0.00175109, maxerr 0.04849243, 95pct<0.0032, median<0.0014 q4_0::output.weight : rmse 0.00165429, maxerr 0.02467346, 95pct<0.0032, median<0.0014 q4_0::tok_embeddings.weight : rmse 0.00163455, maxerr 0.01590976, 95pct<0.0030, median<0.0014 q4_0 : rmse 0.00184913, maxerr 0.14257812, 95pct<0.0034, median<0.0014 |
Now that the statistics tool has landed in master, I've rebased my branch and updated the tool to accept an @unbounded : I will definitively have a look at your approach, thanks a lot. edit: pulled in your commit and updated the stats tool. It is indeed slow ;-). 80% of time is spent in
|
Use a sweep line approach to scan all configurations of quantization, examining every changeover point where a quantize value changes, and find the optimal scaling for each configuration analytically.
@@ -1,7 +1,11 @@ | |||
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth | |||
0cc0b0a3dc8cd29f005946f8364ac2bbce797e792a40c0fb4114615e4f825976 models/7B/ggml-model-f16.bin | |||
5dec1979849d73e361a8bcc10bc8f53237cbbe435a572882dc87629e011e24b3 models/7B/ggml-model-q4_0.bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove quantized models, since everyone would have their unique quantized models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would be that model generation is deterministic across platforms and SIMD optimizations, so the files should be identical. Of course if you keep your Q4_0 files without updating to minor version 1, this wouldn't match. I might remove it for this PR, but in the long term I think it's a good idea to ensure everyone uses the same inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have generated new quantized model and checksum mathes with yours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, is this checksum is for q4_0 that have no minor version yet?
Edit: Oh, I see, for minor v1. 4 bytes long than previous version 😅
@@ -644,7 +644,7 @@ static bool llama_model_load( | |||
size_t total_size = 0; | |||
model.n_loaded = 0; | |||
|
|||
while (true) { | |||
while (size_t(fin.tellg()) + 12 < file_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather do
int offset = 0;
...
offset += sizeof(total_size) + sizeof(model.n_loaded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_size
and model.n_loaded
are not written or read from the file, so I don't understand why you would use their sizeof
.
I admit that the + 12
could be written better. It is intended to be sizeof(n_dims) + sizeof(length) + sizeof(ftype)
, the next three elements being read.
} else { | ||
fprintf(stderr, "error: %s not in list of implementations\n", argv[i]); | ||
invalid_param = true; | ||
} | ||
} else if (arg == "-v") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add || --verbose
?
Initial perplexity test.
q4_0, MINOR 0, w/o BLAS:
Commit 678e138 (shown as
Final score |
Leaving another comment to let you know final perplexity [655]6.5655. Perplexity discussion for previous results. |
@ivanstepanovftw Thanks for your effort. The first few values match mine exactly, so I'll trust your results. It's good to see at least a small improvement. But as I said in #397, maybe the RMSE of the quantization is a distraction. This method leads to a mean scale value of 8.092, so there will be clipping of the maximum value. I would like to see us experiment with #729 but with more (larger) scale values instead of just 7 or 8. |
This combines some ideas from PR #729 and issue #397 to select a scale factor for Q4_0 with low RMS error.
In order to KISS, I simply made a table of 8 hard-coded values, after analysing the optimum values in steps of 0.1.
The result of that analysis is documented in
examples/quantize/scale.py
and reproduced here:Error statistics (#728):
quantize.cpp run time on 7B:
I introduce a minor version number at the very end of the file.
This allows us to nudge the user to re-create their files without breaking anything.
I had to modify the read loop, as it used to try to read past EOF.
I removed the test of
ggml_quantize_q4_0
which I originally wrote and was quite minimal.This is admittedly lazy, but I couldn't think of a good test right away.
Maybe we just need to provide a model file that's not too big for the CI machines and check for equivalence after quantization.
The alignment macros are a bit of a hack. I don't have Windows to test here and don't want to keep hitting the CI with trial-and-error.
Is there a clean cross-platform way to do it? And come to think about alignment, why are the input
float
s not aligned? (edit: probably becausellama_model_quantize_internal
doesn't usemmap
, let me see if we can force the alignment of the buffers).Currently running perplexity, but it's taking 12 hours here so I may not wait for that.
This does not obsolete #729, as my PR only changes the method for the model generation.
We might still use @unbounded's work and set the scale to -8 instead of +7 for the other uses of the quantization function.