Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: MMQ code deduplication + iquant support #8495

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR deduplicates the MMQ CUDA code and adds support for all i-quants other than qi1_m. The deduplication is done by converting the data to q8_0 or q8_1 with a block size of 32 or 16 upon loading. For the int8 tensor core kernels this has turned out to be faster. For q6_K and for the kernels using __dp4a this was slower so I left those unchanged. All i-quants other than qi1_m can then be supported by simply writing functions that load the data and convert it to q8. For qi1_m there is no support and frankly I don't think it would be worthwhile to add since the quality degradation of that format is very high.

@JohannesGaessler
Copy link
Collaborator Author

Performance
Model GPU Microbatch size Test t/s master t/s PR Speedup
llama 8B IQ1_S - 1.5625 bpw RX 6800 16 pp2048 71.36 245.83 3.45
llama 8B IQ1_S - 1.5625 bpw RX 6800 32 pp2048 126.88 341.29 2.69
llama 8B IQ1_S - 1.5625 bpw RX 6800 64 pp2048 197.06 394.82 2.00
llama 8B IQ1_S - 1.5625 bpw RX 6800 128 pp2048 344.40 495.76 1.44
llama 8B IQ1_S - 1.5625 bpw RX 6800 256 pp2048 527.19 590.11 1.12
llama 8B IQ1_S - 1.5625 bpw RX 6800 512 pp2048 579.49 598.22 1.03
llama 8B IQ1_S - 1.5625 bpw RX 6800 1024 pp2048 590.47 669.40 1.13
llama 8B IQ1_S - 1.5625 bpw RX 6800 2048 pp2048 578.68 613.85 1.06
llama 8B IQ1_S - 1.5625 bpw RTX 3090 16 pp2048 355.54 1208.00 3.40
llama 8B IQ1_S - 1.5625 bpw RTX 3090 32 pp2048 669.66 1916.56 2.86
llama 8B IQ1_S - 1.5625 bpw RTX 3090 64 pp2048 1272.18 2760.97 2.17
llama 8B IQ1_S - 1.5625 bpw RTX 3090 128 pp2048 2115.73 3326.07 1.57
llama 8B IQ1_S - 1.5625 bpw RTX 3090 256 pp2048 3060.69 3768.10 1.23
llama 8B IQ1_S - 1.5625 bpw RTX 3090 512 pp2048 3597.01 3938.13 1.09
llama 8B IQ1_S - 1.5625 bpw RTX 3090 1024 pp2048 4169.33 4040.79 0.97
llama 8B IQ1_S - 1.5625 bpw RTX 3090 2048 pp2048 4194.84 3933.14 0.94
llama 8B IQ1_S - 1.5625 bpw RTX 4090 16 pp2048 537.90 1956.16 3.64
llama 8B IQ1_S - 1.5625 bpw RTX 4090 32 pp2048 1057.79 4015.56 3.80
llama 8B IQ1_S - 1.5625 bpw RTX 4090 64 pp2048 1962.03 6245.68 3.18
llama 8B IQ1_S - 1.5625 bpw RTX 4090 128 pp2048 3690.13 8027.31 2.18
llama 8B IQ1_S - 1.5625 bpw RTX 4090 256 pp2048 5947.65 10379.01 1.75
llama 8B IQ1_S - 1.5625 bpw RTX 4090 512 pp2048 7911.42 11477.14 1.45
llama 8B IQ1_S - 1.5625 bpw RTX 4090 1024 pp2048 9016.49 11101.78 1.23
llama 8B IQ1_S - 1.5625 bpw RTX 4090 2048 pp2048 8936.54 9919.53 1.11
llama 8B IQ1_S - 1.5625 bpw P40 16 pp2048 37.65 389.08 10.33
llama 8B IQ1_S - 1.5625 bpw P40 32 pp2048 74.28 547.24 7.37
llama 8B IQ1_S - 1.5625 bpw P40 64 pp2048 117.27 616.19 5.25
llama 8B IQ1_S - 1.5625 bpw P40 128 pp2048 188.05 719.08 3.82
llama 8B IQ1_S - 1.5625 bpw P40 256 pp2048 295.37 810.74 2.74
llama 8B IQ1_S - 1.5625 bpw P40 512 pp2048 415.42 850.97 2.05
llama 8B IQ1_S - 1.5625 bpw P40 1024 pp2048 495.29 837.45 1.69
llama 8B IQ1_S - 1.5625 bpw P40 2048 pp2048 522.94 800.22 1.53
llama 8B IQ2_S - 2.5 bpw RX 6800 16 pp2048 70.50 212.31 3.01
llama 8B IQ2_S - 2.5 bpw RX 6800 32 pp2048 125.24 310.58 2.48
llama 8B IQ2_S - 2.5 bpw RX 6800 64 pp2048 195.56 379.75 1.94
llama 8B IQ2_S - 2.5 bpw RX 6800 128 pp2048 347.37 479.09 1.38
llama 8B IQ2_S - 2.5 bpw RX 6800 256 pp2048 533.66 565.46 1.06
llama 8B IQ2_S - 2.5 bpw RX 6800 512 pp2048 585.16 572.06 0.98
llama 8B IQ2_S - 2.5 bpw RX 6800 1024 pp2048 595.01 635.16 1.07
llama 8B IQ2_S - 2.5 bpw RX 6800 2048 pp2048 579.77 581.30 1.00
llama 8B IQ2_S - 2.5 bpw RTX 3090 16 pp2048 322.96 1110.28 3.44
llama 8B IQ2_S - 2.5 bpw RTX 3090 32 pp2048 611.47 1727.85 2.83
llama 8B IQ2_S - 2.5 bpw RTX 3090 64 pp2048 1170.72 2544.22 2.17
llama 8B IQ2_S - 2.5 bpw RTX 3090 128 pp2048 1991.36 3084.78 1.55
llama 8B IQ2_S - 2.5 bpw RTX 3090 256 pp2048 2954.34 3477.51 1.18
llama 8B IQ2_S - 2.5 bpw RTX 3090 512 pp2048 3505.89 3657.13 1.04
llama 8B IQ2_S - 2.5 bpw RTX 3090 1024 pp2048 4091.96 3671.58 0.90
llama 8B IQ2_S - 2.5 bpw RTX 3090 2048 pp2048 4143.53 3527.68 0.85
llama 8B IQ2_S - 2.5 bpw RTX 4090 16 pp2048 515.00 2145.23 4.17
llama 8B IQ2_S - 2.5 bpw RTX 4090 32 pp2048 1012.26 3573.84 3.53
llama 8B IQ2_S - 2.5 bpw RTX 4090 64 pp2048 1888.40 5693.41 3.01
llama 8B IQ2_S - 2.5 bpw RTX 4090 128 pp2048 3572.62 7480.11 2.09
llama 8B IQ2_S - 2.5 bpw RTX 4090 256 pp2048 5830.42 9519.30 1.63
llama 8B IQ2_S - 2.5 bpw RTX 4090 512 pp2048 7597.60 9945.81 1.31
llama 8B IQ2_S - 2.5 bpw RTX 4090 1024 pp2048 8510.08 9521.63 1.12
llama 8B IQ2_S - 2.5 bpw RTX 4090 2048 pp2048 7950.37 8228.50 1.03
llama 8B IQ2_S - 2.5 bpw P40 16 pp2048 36.51 348.53 9.55
llama 8B IQ2_S - 2.5 bpw P40 32 pp2048 72.10 475.48 6.59
llama 8B IQ2_S - 2.5 bpw P40 64 pp2048 113.47 587.33 5.18
llama 8B IQ2_S - 2.5 bpw P40 128 pp2048 179.35 699.83 3.90
llama 8B IQ2_S - 2.5 bpw P40 256 pp2048 285.87 790.54 2.77
llama 8B IQ2_S - 2.5 bpw P40 512 pp2048 406.92 830.63 2.04
llama 8B IQ2_S - 2.5 bpw P40 1024 pp2048 486.02 819.36 1.69
llama 8B IQ2_S - 2.5 bpw P40 2048 pp2048 511.33 781.25 1.53
llama 8B IQ2_XS - 2.3125 bpw RX 6800 16 pp2048 71.09 211.20 2.97
llama 8B IQ2_XS - 2.3125 bpw RX 6800 32 pp2048 126.50 307.75 2.43
llama 8B IQ2_XS - 2.3125 bpw RX 6800 64 pp2048 196.70 372.89 1.90
llama 8B IQ2_XS - 2.3125 bpw RX 6800 128 pp2048 344.00 467.61 1.36
llama 8B IQ2_XS - 2.3125 bpw RX 6800 256 pp2048 526.33 553.45 1.05
llama 8B IQ2_XS - 2.3125 bpw RX 6800 512 pp2048 579.11 563.73 0.97
llama 8B IQ2_XS - 2.3125 bpw RX 6800 1024 pp2048 589.49 625.53 1.06
llama 8B IQ2_XS - 2.3125 bpw RX 6800 2048 pp2048 578.51 575.84 1.00
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 16 pp2048 330.14 1128.10 3.42
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 32 pp2048 621.15 1729.91 2.78
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 64 pp2048 1186.35 2510.04 2.12
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 128 pp2048 1982.17 2999.54 1.51
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 256 pp2048 2937.32 3381.81 1.15
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 512 pp2048 3486.24 3555.15 1.02
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 1024 pp2048 3994.26 3628.69 0.91
llama 8B IQ2_XS - 2.3125 bpw RTX 3090 2048 pp2048 4110.74 3545.34 0.86
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 16 pp2048 529.43 2231.90 4.22
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 32 pp2048 1042.73 3674.42 3.52
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 64 pp2048 1935.80 5707.91 2.95
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 128 pp2048 3631.85 7439.87 2.05
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 256 pp2048 5886.87 9381.76 1.59
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 512 pp2048 7850.31 10109.82 1.29
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 1024 pp2048 8984.33 9941.22 1.11
llama 8B IQ2_XS - 2.3125 bpw RTX 4090 2048 pp2048 8908.32 8988.81 1.01
llama 8B IQ2_XS - 2.3125 bpw P40 16 pp2048 37.62 349.74 9.30
llama 8B IQ2_XS - 2.3125 bpw P40 32 pp2048 74.33 464.01 6.24
llama 8B IQ2_XS - 2.3125 bpw P40 64 pp2048 117.46 592.29 5.04
llama 8B IQ2_XS - 2.3125 bpw P40 128 pp2048 189.35 701.98 3.71
llama 8B IQ2_XS - 2.3125 bpw P40 256 pp2048 296.17 791.88 2.67
llama 8B IQ2_XS - 2.3125 bpw P40 512 pp2048 415.48 833.32 2.01
llama 8B IQ2_XS - 2.3125 bpw P40 1024 pp2048 495.55 823.80 1.66
llama 8B IQ2_XS - 2.3125 bpw P40 2048 pp2048 521.07 788.80 1.51
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 16 pp2048 71.13 210.94 2.97
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 32 pp2048 126.59 310.59 2.45
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 64 pp2048 196.74 386.29 1.96
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 128 pp2048 343.90 486.67 1.42
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 256 pp2048 526.43 579.28 1.10
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 512 pp2048 579.59 586.98 1.01
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 1024 pp2048 592.60 656.27 1.11
llama 8B IQ2_XXS - 2.0625 bpw RX 6800 2048 pp2048 580.42 601.83 1.04
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 16 pp2048 345.83 1140.41 3.30
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 32 pp2048 642.45 1856.45 2.89
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 64 pp2048 1224.86 2797.71 2.28
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 128 pp2048 2055.07 3488.32 1.70
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 256 pp2048 2989.49 3936.71 1.32
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 512 pp2048 3516.58 4161.70 1.18
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 1024 pp2048 4078.77 4278.76 1.05
llama 8B IQ2_XXS - 2.0625 bpw RTX 3090 2048 pp2048 4144.55 4168.99 1.01
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 16 pp2048 532.63 2266.50 4.26
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 32 pp2048 1045.01 3885.14 3.72
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 64 pp2048 1944.40 6300.56 3.24
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 128 pp2048 3657.66 8300.52 2.27
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 256 pp2048 5910.69 10773.95 1.82
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 512 pp2048 7885.28 11858.44 1.50
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 1024 pp2048 8996.48 11471.93 1.28
llama 8B IQ2_XXS - 2.0625 bpw RTX 4090 2048 pp2048 8946.50 10195.18 1.14
llama 8B IQ2_XXS - 2.0625 bpw P40 16 pp2048 37.49 363.76 9.70
llama 8B IQ2_XXS - 2.0625 bpw P40 32 pp2048 73.96 534.20 7.22
llama 8B IQ2_XXS - 2.0625 bpw P40 64 pp2048 116.84 592.01 5.07
llama 8B IQ2_XXS - 2.0625 bpw P40 128 pp2048 188.23 701.77 3.73
llama 8B IQ2_XXS - 2.0625 bpw P40 256 pp2048 295.12 792.59 2.69
llama 8B IQ2_XXS - 2.0625 bpw P40 512 pp2048 414.91 834.12 2.01
llama 8B IQ2_XXS - 2.0625 bpw P40 1024 pp2048 494.95 824.74 1.67
llama 8B IQ2_XXS - 2.0625 bpw P40 2048 pp2048 520.41 789.70 1.52
llama 8B IQ3_S - 3.4375 bpw RX 6800 16 pp2048 70.10 211.71 3.02
llama 8B IQ3_S - 3.4375 bpw RX 6800 32 pp2048 124.91 314.11 2.51
llama 8B IQ3_S - 3.4375 bpw RX 6800 64 pp2048 195.43 391.09 2.00
llama 8B IQ3_S - 3.4375 bpw RX 6800 128 pp2048 346.08 493.11 1.42
llama 8B IQ3_S - 3.4375 bpw RX 6800 256 pp2048 530.73 584.64 1.10
llama 8B IQ3_S - 3.4375 bpw RX 6800 512 pp2048 584.42 590.85 1.01
llama 8B IQ3_S - 3.4375 bpw RX 6800 1024 pp2048 590.78 657.73 1.11
llama 8B IQ3_S - 3.4375 bpw RX 6800 2048 pp2048 581.40 601.84 1.04
llama 8B IQ3_S - 3.4375 bpw RTX 3090 16 pp2048 328.90 1003.01 3.05
llama 8B IQ3_S - 3.4375 bpw RTX 3090 32 pp2048 625.84 1685.31 2.69
llama 8B IQ3_S - 3.4375 bpw RTX 3090 64 pp2048 1198.81 2729.18 2.28
llama 8B IQ3_S - 3.4375 bpw RTX 3090 128 pp2048 2030.01 3472.59 1.71
llama 8B IQ3_S - 3.4375 bpw RTX 3090 256 pp2048 2994.96 3966.49 1.32
llama 8B IQ3_S - 3.4375 bpw RTX 3090 512 pp2048 3525.21 4151.76 1.18
llama 8B IQ3_S - 3.4375 bpw RTX 3090 1024 pp2048 4083.91 4201.95 1.03
llama 8B IQ3_S - 3.4375 bpw RTX 3090 2048 pp2048 4072.81 4026.83 0.99
llama 8B IQ3_S - 3.4375 bpw RTX 4090 16 pp2048 501.48 1707.30 3.40
llama 8B IQ3_S - 3.4375 bpw RTX 4090 32 pp2048 989.87 3069.61 3.10
llama 8B IQ3_S - 3.4375 bpw RTX 4090 64 pp2048 1846.74 5703.12 3.09
llama 8B IQ3_S - 3.4375 bpw RTX 4090 128 pp2048 3479.72 7787.52 2.24
llama 8B IQ3_S - 3.4375 bpw RTX 4090 256 pp2048 5721.47 10473.88 1.83
llama 8B IQ3_S - 3.4375 bpw RTX 4090 512 pp2048 7494.42 11184.52 1.49
llama 8B IQ3_S - 3.4375 bpw RTX 4090 1024 pp2048 8488.27 10522.71 1.24
llama 8B IQ3_S - 3.4375 bpw RTX 4090 2048 pp2048 7954.46 8944.74 1.12
llama 8B IQ3_S - 3.4375 bpw P40 16 pp2048 36.40 335.57 9.22
llama 8B IQ3_S - 3.4375 bpw P40 32 pp2048 71.98 521.85 7.25
llama 8B IQ3_S - 3.4375 bpw P40 64 pp2048 112.97 575.59 5.10
llama 8B IQ3_S - 3.4375 bpw P40 128 pp2048 180.55 688.69 3.81
llama 8B IQ3_S - 3.4375 bpw P40 256 pp2048 285.55 780.64 2.73
llama 8B IQ3_S - 3.4375 bpw P40 512 pp2048 406.24 822.90 2.03
llama 8B IQ3_S - 3.4375 bpw P40 1024 pp2048 486.12 810.37 1.67
llama 8B IQ3_S - 3.4375 bpw P40 2048 pp2048 510.18 773.40 1.52
llama 8B IQ3_S mix - 3.66 bpw RX 6800 16 pp2048 73.51 215.02 2.93
llama 8B IQ3_S mix - 3.66 bpw RX 6800 32 pp2048 130.61 311.82 2.39
llama 8B IQ3_S mix - 3.66 bpw RX 6800 64 pp2048 203.35 382.47 1.88
llama 8B IQ3_S mix - 3.66 bpw RX 6800 128 pp2048 346.54 481.26 1.39
llama 8B IQ3_S mix - 3.66 bpw RX 6800 256 pp2048 528.16 571.78 1.08
llama 8B IQ3_S mix - 3.66 bpw RX 6800 512 pp2048 582.26 580.56 1.00
llama 8B IQ3_S mix - 3.66 bpw RX 6800 1024 pp2048 601.59 646.13 1.07
llama 8B IQ3_S mix - 3.66 bpw RX 6800 2048 pp2048 579.66 591.26 1.02
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 16 pp2048 361.53 1031.70 2.85
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 32 pp2048 677.51 1720.06 2.54
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 64 pp2048 1275.19 2740.30 2.15
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 128 pp2048 2104.50 3433.70 1.63
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 256 pp2048 3046.42 3947.68 1.30
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 512 pp2048 3560.99 4182.03 1.17
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 1024 pp2048 4031.65 4195.09 1.04
llama 8B IQ3_S mix - 3.66 bpw RTX 3090 2048 pp2048 4044.28 4033.87 1.00
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 16 pp2048 534.11 1722.54 3.23
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 32 pp2048 1053.03 3116.69 2.96
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 64 pp2048 1952.03 5680.05 2.91
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 128 pp2048 3632.26 7832.42 2.16
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 256 pp2048 5902.53 10448.62 1.77
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 512 pp2048 7710.43 11132.02 1.44
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 1024 pp2048 8591.07 10513.94 1.22
llama 8B IQ3_S mix - 3.66 bpw RTX 4090 2048 pp2048 8046.85 8909.71 1.11
llama 8B IQ3_S mix - 3.66 bpw P40 16 pp2048 40.72 341.19 8.38
llama 8B IQ3_S mix - 3.66 bpw P40 32 pp2048 79.85 509.35 6.38
llama 8B IQ3_S mix - 3.66 bpw P40 64 pp2048 125.93 584.10 4.64
llama 8B IQ3_S mix - 3.66 bpw P40 128 pp2048 199.23 697.92 3.50
llama 8B IQ3_S mix - 3.66 bpw P40 256 pp2048 311.26 787.90 2.53
llama 8B IQ3_S mix - 3.66 bpw P40 512 pp2048 435.21 827.64 1.90
llama 8B IQ3_S mix - 3.66 bpw P40 1024 pp2048 515.06 815.44 1.58
llama 8B IQ3_S mix - 3.66 bpw P40 2048 pp2048 533.73 777.64 1.46
llama 8B IQ3_XS - 3.3 bpw RX 6800 16 pp2048 69.90 216.13 3.09
llama 8B IQ3_XS - 3.3 bpw RX 6800 32 pp2048 124.60 318.86 2.56
llama 8B IQ3_XS - 3.3 bpw RX 6800 64 pp2048 195.19 394.77 2.02
llama 8B IQ3_XS - 3.3 bpw RX 6800 128 pp2048 346.86 500.19 1.44
llama 8B IQ3_XS - 3.3 bpw RX 6800 256 pp2048 532.27 593.68 1.12
llama 8B IQ3_XS - 3.3 bpw RX 6800 512 pp2048 585.22 599.23 1.02
llama 8B IQ3_XS - 3.3 bpw RX 6800 1024 pp2048 593.39 669.32 1.13
llama 8B IQ3_XS - 3.3 bpw RX 6800 2048 pp2048 579.46 611.80 1.06
llama 8B IQ3_XS - 3.3 bpw RTX 3090 16 pp2048 334.22 1046.57 3.13
llama 8B IQ3_XS - 3.3 bpw RTX 3090 32 pp2048 630.74 1727.08 2.74
llama 8B IQ3_XS - 3.3 bpw RTX 3090 64 pp2048 1207.12 2698.07 2.24
llama 8B IQ3_XS - 3.3 bpw RTX 3090 128 pp2048 2040.94 3407.47 1.67
llama 8B IQ3_XS - 3.3 bpw RTX 3090 256 pp2048 3008.15 3913.21 1.30
llama 8B IQ3_XS - 3.3 bpw RTX 3090 512 pp2048 3530.04 4090.38 1.16
llama 8B IQ3_XS - 3.3 bpw RTX 3090 1024 pp2048 4075.24 4123.09 1.01
llama 8B IQ3_XS - 3.3 bpw RTX 3090 2048 pp2048 4050.25 3971.37 0.98
llama 8B IQ3_XS - 3.3 bpw RTX 4090 16 pp2048 503.68 1876.94 3.73
llama 8B IQ3_XS - 3.3 bpw RTX 4090 32 pp2048 993.23 3305.06 3.33
llama 8B IQ3_XS - 3.3 bpw RTX 4090 64 pp2048 1847.53 5785.24 3.13
llama 8B IQ3_XS - 3.3 bpw RTX 4090 128 pp2048 3508.00 7749.75 2.21
llama 8B IQ3_XS - 3.3 bpw RTX 4090 256 pp2048 5738.05 10416.00 1.82
llama 8B IQ3_XS - 3.3 bpw RTX 4090 512 pp2048 7516.13 11222.96 1.49
llama 8B IQ3_XS - 3.3 bpw RTX 4090 1024 pp2048 8479.89 10588.00 1.25
llama 8B IQ3_XS - 3.3 bpw RTX 4090 2048 pp2048 8036.04 8970.67 1.12
llama 8B IQ3_XS - 3.3 bpw P40 16 pp2048 36.22 330.14 9.11
llama 8B IQ3_XS - 3.3 bpw P40 32 pp2048 71.61 486.97 6.80
llama 8B IQ3_XS - 3.3 bpw P40 64 pp2048 112.34 571.37 5.09
llama 8B IQ3_XS - 3.3 bpw P40 128 pp2048 180.55 690.60 3.82
llama 8B IQ3_XS - 3.3 bpw P40 256 pp2048 286.03 781.44 2.73
llama 8B IQ3_XS - 3.3 bpw P40 512 pp2048 403.96 822.79 2.04
llama 8B IQ3_XS - 3.3 bpw P40 1024 pp2048 483.11 812.77 1.68
llama 8B IQ3_XS - 3.3 bpw P40 2048 pp2048 502.66 775.80 1.54
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 16 pp2048 70.02 219.99 3.14
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 32 pp2048 124.66 320.94 2.57
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 64 pp2048 195.18 395.17 2.02
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 128 pp2048 347.01 502.08 1.45
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 256 pp2048 532.19 595.38 1.12
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 512 pp2048 584.08 602.09 1.03
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 1024 pp2048 593.12 673.22 1.14
llama 8B IQ3_XXS - 3.0625 bpw RX 6800 2048 pp2048 579.57 615.21 1.06
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 16 pp2048 331.44 1102.09 3.33
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 32 pp2048 627.19 1794.34 2.86
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 64 pp2048 1197.81 2713.58 2.27
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 128 pp2048 2034.09 3407.33 1.68
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 256 pp2048 2997.43 3869.58 1.29
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 512 pp2048 3528.42 4077.04 1.16
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 1024 pp2048 4084.68 4125.49 1.01
llama 8B IQ3_XXS - 3.0625 bpw RTX 3090 2048 pp2048 4059.76 3948.29 0.97
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 16 pp2048 501.63 2063.41 4.11
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 32 pp2048 991.54 3558.68 3.59
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 64 pp2048 1845.96 5825.96 3.16
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 128 pp2048 3495.96 7700.75 2.20
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 256 pp2048 5744.89 10339.57 1.80
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 512 pp2048 7539.49 11201.61 1.49
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 1024 pp2048 8485.51 10478.22 1.23
llama 8B IQ3_XXS - 3.0625 bpw RTX 4090 2048 pp2048 8033.42 8833.12 1.10
llama 8B IQ3_XXS - 3.0625 bpw P40 16 pp2048 36.13 332.45 9.20
llama 8B IQ3_XXS - 3.0625 bpw P40 32 pp2048 71.44 433.22 6.06
llama 8B IQ3_XXS - 3.0625 bpw P40 64 pp2048 112.15 580.91 5.18
llama 8B IQ3_XXS - 3.0625 bpw P40 128 pp2048 179.08 698.56 3.90
llama 8B IQ3_XXS - 3.0625 bpw P40 256 pp2048 284.98 788.90 2.77
llama 8B IQ3_XXS - 3.0625 bpw P40 512 pp2048 404.95 829.93 2.05
llama 8B IQ3_XXS - 3.0625 bpw P40 1024 pp2048 485.28 818.65 1.69
llama 8B IQ3_XXS - 3.0625 bpw P40 2048 pp2048 511.11 780.55 1.53
llama 8B IQ4_NL - 4.5 bpw RX 6800 16 pp2048 237.62 237.62 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 32 pp2048 333.15 332.78 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 64 pp2048 411.14 411.52 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 128 pp2048 519.85 519.22 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 256 pp2048 619.13 619.33 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 512 pp2048 624.69 624.30 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 1024 pp2048 703.15 702.11 1.00
llama 8B IQ4_NL - 4.5 bpw RX 6800 2048 pp2048 640.50 639.97 1.00
llama 8B IQ4_NL - 4.5 bpw RTX 3090 16 pp2048 1039.39 1039.12 1.00
llama 8B IQ4_NL - 4.5 bpw RTX 3090 32 pp2048 1689.45 1704.30 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 3090 64 pp2048 2593.34 2647.04 1.02
llama 8B IQ4_NL - 4.5 bpw RTX 3090 128 pp2048 3352.62 3417.83 1.02
llama 8B IQ4_NL - 4.5 bpw RTX 3090 256 pp2048 3835.73 3904.69 1.02
llama 8B IQ4_NL - 4.5 bpw RTX 3090 512 pp2048 4007.75 4063.57 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 3090 1024 pp2048 4064.58 4131.09 1.02
llama 8B IQ4_NL - 4.5 bpw RTX 3090 2048 pp2048 3929.10 3996.20 1.02
llama 8B IQ4_NL - 4.5 bpw RTX 4090 16 pp2048 1934.68 1948.50 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 32 pp2048 3418.36 3430.55 1.00
llama 8B IQ4_NL - 4.5 bpw RTX 4090 64 pp2048 5660.17 5708.00 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 128 pp2048 7948.73 8007.61 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 256 pp2048 10499.37 10590.01 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 512 pp2048 11350.47 11415.76 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 1024 pp2048 10996.62 11086.95 1.01
llama 8B IQ4_NL - 4.5 bpw RTX 4090 2048 pp2048 9859.54 9939.39 1.01
llama 8B IQ4_NL - 4.5 bpw P40 16 pp2048 254.23 253.79 1.00
llama 8B IQ4_NL - 4.5 bpw P40 32 pp2048 418.40 417.51 1.00
llama 8B IQ4_NL - 4.5 bpw P40 64 pp2048 581.70 581.08 1.00
llama 8B IQ4_NL - 4.5 bpw P40 128 pp2048 689.91 689.88 1.00
llama 8B IQ4_NL - 4.5 bpw P40 256 pp2048 779.54 779.59 1.00
llama 8B IQ4_NL - 4.5 bpw P40 512 pp2048 819.75 819.67 1.00
llama 8B IQ4_NL - 4.5 bpw P40 1024 pp2048 807.31 807.65 1.00
llama 8B IQ4_NL - 4.5 bpw P40 2048 pp2048 774.01 773.86 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 16 pp2048 235.89 236.12 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 32 pp2048 334.04 333.88 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 64 pp2048 411.76 412.05 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 128 pp2048 520.89 521.56 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 256 pp2048 621.58 620.01 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 512 pp2048 625.26 626.74 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 1024 pp2048 703.03 704.26 1.00
llama 8B IQ4_XS - 4.25 bpw RX 6800 2048 pp2048 640.33 640.59 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 3090 16 pp2048 1056.39 1063.86 1.01
llama 8B IQ4_XS - 4.25 bpw RTX 3090 32 pp2048 1740.15 1733.07 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 3090 64 pp2048 2647.35 2622.94 0.99
llama 8B IQ4_XS - 4.25 bpw RTX 3090 128 pp2048 3385.12 3389.70 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 3090 256 pp2048 3862.03 3855.10 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 3090 512 pp2048 4072.53 4013.54 0.99
llama 8B IQ4_XS - 4.25 bpw RTX 3090 1024 pp2048 4094.69 4073.66 0.99
llama 8B IQ4_XS - 4.25 bpw RTX 3090 2048 pp2048 3973.30 3958.53 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 4090 16 pp2048 2017.53 2014.15 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 4090 32 pp2048 3560.99 3552.70 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 4090 64 pp2048 5817.20 5863.76 1.01
llama 8B IQ4_XS - 4.25 bpw RTX 4090 128 pp2048 8065.31 8114.48 1.01
llama 8B IQ4_XS - 4.25 bpw RTX 4090 256 pp2048 10600.25 10661.71 1.01
llama 8B IQ4_XS - 4.25 bpw RTX 4090 512 pp2048 11437.48 11499.79 1.01
llama 8B IQ4_XS - 4.25 bpw RTX 4090 1024 pp2048 11087.14 11138.35 1.00
llama 8B IQ4_XS - 4.25 bpw RTX 4090 2048 pp2048 9857.00 9977.37 1.01
llama 8B IQ4_XS - 4.25 bpw P40 16 pp2048 257.30 256.98 1.00
llama 8B IQ4_XS - 4.25 bpw P40 32 pp2048 422.13 422.04 1.00
llama 8B IQ4_XS - 4.25 bpw P40 64 pp2048 596.35 596.41 1.00
llama 8B IQ4_XS - 4.25 bpw P40 128 pp2048 702.46 702.45 1.00
llama 8B IQ4_XS - 4.25 bpw P40 256 pp2048 793.89 793.97 1.00
llama 8B IQ4_XS - 4.25 bpw P40 512 pp2048 834.16 834.65 1.00
llama 8B IQ4_XS - 4.25 bpw P40 1024 pp2048 824.37 824.39 1.00
llama 8B IQ4_XS - 4.25 bpw P40 2048 pp2048 788.77 788.88 1.00
llama 8B Q2_K_M RX 6800 16 pp2048 184.39 184.49 1.00
llama 8B Q2_K_M RX 6800 32 pp2048 261.97 261.62 1.00
llama 8B Q2_K_M RX 6800 64 pp2048 286.71 286.75 1.00
llama 8B Q2_K_M RX 6800 128 pp2048 354.95 355.12 1.00
llama 8B Q2_K_M RX 6800 256 pp2048 426.57 426.49 1.00
llama 8B Q2_K_M RX 6800 512 pp2048 439.37 439.73 1.00
llama 8B Q2_K_M RX 6800 1024 pp2048 481.96 481.90 1.00
llama 8B Q2_K_M RX 6800 2048 pp2048 456.30 455.32 1.00
llama 8B Q2_K_M RTX 3090 16 pp2048 1166.52 1189.89 1.02
llama 8B Q2_K_M RTX 3090 32 pp2048 1789.54 1766.28 0.99
llama 8B Q2_K_M RTX 3090 64 pp2048 2404.13 2413.69 1.00
llama 8B Q2_K_M RTX 3090 128 pp2048 2466.32 2463.56 1.00
llama 8B Q2_K_M RTX 3090 256 pp2048 2936.99 2894.32 0.99
llama 8B Q2_K_M RTX 3090 512 pp2048 3185.71 3130.55 0.98
llama 8B Q2_K_M RTX 3090 1024 pp2048 3318.01 3267.26 0.98
llama 8B Q2_K_M RTX 3090 2048 pp2048 3298.97 3236.01 0.98
llama 8B Q2_K_M RTX 4090 16 pp2048 2132.63 2174.54 1.02
llama 8B Q2_K_M RTX 4090 32 pp2048 3637.75 3543.44 0.97
llama 8B Q2_K_M RTX 4090 64 pp2048 5384.80 5421.35 1.01
llama 8B Q2_K_M RTX 4090 128 pp2048 5463.76 5469.76 1.00
llama 8B Q2_K_M RTX 4090 256 pp2048 7514.18 7568.34 1.01
llama 8B Q2_K_M RTX 4090 512 pp2048 9046.31 9100.83 1.01
llama 8B Q2_K_M RTX 4090 1024 pp2048 9150.54 9203.38 1.01
llama 8B Q2_K_M RTX 4090 2048 pp2048 8351.25 8433.37 1.01
llama 8B Q2_K_M P40 16 pp2048 333.96 334.90 1.00
llama 8B Q2_K_M P40 32 pp2048 504.71 505.30 1.00
llama 8B Q2_K_M P40 64 pp2048 572.35 572.35 1.00
llama 8B Q2_K_M P40 128 pp2048 664.58 665.34 1.00
llama 8B Q2_K_M P40 256 pp2048 748.81 749.77 1.00
llama 8B Q2_K_M P40 512 pp2048 788.29 789.55 1.00
llama 8B Q2_K_M P40 1024 pp2048 780.47 781.05 1.00
llama 8B Q2_K_M P40 2048 pp2048 749.82 749.47 1.00
llama 8B Q3_K_S RX 6800 16 pp2048 219.02 219.20 1.00
llama 8B Q3_K_S RX 6800 32 pp2048 294.91 294.50 1.00
llama 8B Q3_K_S RX 6800 64 pp2048 319.29 319.20 1.00
llama 8B Q3_K_S RX 6800 128 pp2048 398.51 397.86 1.00
llama 8B Q3_K_S RX 6800 256 pp2048 473.04 472.50 1.00
llama 8B Q3_K_S RX 6800 512 pp2048 485.44 485.01 1.00
llama 8B Q3_K_S RX 6800 1024 pp2048 532.35 532.02 1.00
llama 8B Q3_K_S RX 6800 2048 pp2048 499.26 498.55 1.00
llama 8B Q3_K_S RTX 3090 16 pp2048 1120.99 1145.06 1.02
llama 8B Q3_K_S RTX 3090 32 pp2048 1821.46 1770.40 0.97
llama 8B Q3_K_S RTX 3090 64 pp2048 2630.97 2595.01 0.99
llama 8B Q3_K_S RTX 3090 128 pp2048 3090.43 3057.20 0.99
llama 8B Q3_K_S RTX 3090 256 pp2048 3528.44 3469.53 0.98
llama 8B Q3_K_S RTX 3090 512 pp2048 3750.13 3699.14 0.99
llama 8B Q3_K_S RTX 3090 1024 pp2048 3796.83 3789.13 1.00
llama 8B Q3_K_S RTX 3090 2048 pp2048 3722.22 3700.16 0.99
llama 8B Q3_K_S RTX 4090 16 pp2048 1915.67 1984.81 1.04
llama 8B Q3_K_S RTX 4090 32 pp2048 3616.89 3406.99 0.94
llama 8B Q3_K_S RTX 4090 64 pp2048 5631.25 5676.50 1.01
llama 8B Q3_K_S RTX 4090 128 pp2048 7662.68 7704.69 1.01
llama 8B Q3_K_S RTX 4090 256 pp2048 9701.83 9817.17 1.01
llama 8B Q3_K_S RTX 4090 512 pp2048 10207.07 10342.54 1.01
llama 8B Q3_K_S RTX 4090 1024 pp2048 10052.96 10195.86 1.01
llama 8B Q3_K_S RTX 4090 2048 pp2048 9032.52 9217.44 1.02
llama 8B Q3_K_S P40 16 pp2048 348.45 350.09 1.00
llama 8B Q3_K_S P40 32 pp2048 491.27 491.32 1.00
llama 8B Q3_K_S P40 64 pp2048 579.54 580.45 1.00
llama 8B Q3_K_S P40 128 pp2048 658.10 659.74 1.00
llama 8B Q3_K_S P40 256 pp2048 736.62 737.28 1.00
llama 8B Q3_K_S P40 512 pp2048 767.50 767.78 1.00
llama 8B Q3_K_S P40 1024 pp2048 757.83 757.35 1.00
llama 8B Q3_K_S P40 2048 pp2048 725.62 726.08 1.00
llama 8B Q4_0 RX 6800 16 pp2048 270.14 270.10 1.00
llama 8B Q4_0 RX 6800 32 pp2048 373.04 373.39 1.00
llama 8B Q4_0 RX 6800 64 pp2048 436.13 436.38 1.00
llama 8B Q4_0 RX 6800 128 pp2048 548.32 548.84 1.00
llama 8B Q4_0 RX 6800 256 pp2048 648.67 648.61 1.00
llama 8B Q4_0 RX 6800 512 pp2048 650.82 651.24 1.00
llama 8B Q4_0 RX 6800 1024 pp2048 733.00 733.43 1.00
llama 8B Q4_0 RX 6800 2048 pp2048 661.55 662.65 1.00
llama 8B Q4_0 RTX 3090 16 pp2048 1274.91 1198.24 0.94
llama 8B Q4_0 RTX 3090 32 pp2048 2061.82 1966.58 0.95
llama 8B Q4_0 RTX 3090 64 pp2048 2759.54 3032.14 1.10
llama 8B Q4_0 RTX 3090 128 pp2048 3581.53 3847.65 1.07
llama 8B Q4_0 RTX 3090 256 pp2048 4116.98 4425.31 1.07
llama 8B Q4_0 RTX 3090 512 pp2048 4252.31 4647.92 1.09
llama 8B Q4_0 RTX 3090 1024 pp2048 4304.76 4723.82 1.10
llama 8B Q4_0 RTX 3090 2048 pp2048 4127.38 4509.07 1.09
llama 8B Q4_0 RTX 4090 16 pp2048 1978.73 1876.44 0.95
llama 8B Q4_0 RTX 4090 32 pp2048 3554.63 3355.43 0.94
llama 8B Q4_0 RTX 4090 64 pp2048 5664.17 5970.01 1.05
llama 8B Q4_0 RTX 4090 128 pp2048 8084.12 8517.50 1.05
llama 8B Q4_0 RTX 4090 256 pp2048 10606.77 11339.83 1.07
llama 8B Q4_0 RTX 4090 512 pp2048 11450.96 12339.56 1.08
llama 8B Q4_0 RTX 4090 1024 pp2048 11159.17 12014.40 1.08
llama 8B Q4_0 RTX 4090 2048 pp2048 10045.09 10706.26 1.07
llama 8B Q4_0 P40 16 pp2048 445.03 445.17 1.00
llama 8B Q4_0 P40 32 pp2048 623.00 622.82 1.00
llama 8B Q4_0 P40 64 pp2048 682.61 686.37 1.01
llama 8B Q4_0 P40 128 pp2048 804.74 806.35 1.00
llama 8B Q4_0 P40 256 pp2048 889.47 891.68 1.00
llama 8B Q4_0 P40 512 pp2048 931.03 928.04 1.00
llama 8B Q4_0 P40 1024 pp2048 910.66 910.83 1.00
llama 8B Q4_0 P40 2048 pp2048 864.26 864.95 1.00
llama 8B Q4_1 RX 6800 16 pp2048 249.70 249.68 1.00
llama 8B Q4_1 RX 6800 32 pp2048 350.53 351.12 1.00
llama 8B Q4_1 RX 6800 64 pp2048 404.63 404.94 1.00
llama 8B Q4_1 RX 6800 128 pp2048 510.45 510.60 1.00
llama 8B Q4_1 RX 6800 256 pp2048 604.85 604.97 1.00
llama 8B Q4_1 RX 6800 512 pp2048 613.09 611.36 1.00
llama 8B Q4_1 RX 6800 1024 pp2048 686.28 686.47 1.00
llama 8B Q4_1 RX 6800 2048 pp2048 625.34 624.51 1.00
llama 8B Q4_1 RTX 3090 16 pp2048 1356.96 1189.59 0.88
llama 8B Q4_1 RTX 3090 32 pp2048 2078.89 2019.15 0.97
llama 8B Q4_1 RTX 3090 64 pp2048 2834.41 2898.86 1.02
llama 8B Q4_1 RTX 3090 128 pp2048 3200.15 3564.47 1.11
llama 8B Q4_1 RTX 3090 256 pp2048 3718.91 4087.10 1.10
llama 8B Q4_1 RTX 3090 512 pp2048 3933.27 4306.84 1.09
llama 8B Q4_1 RTX 3090 1024 pp2048 3994.44 4356.59 1.09
llama 8B Q4_1 RTX 3090 2048 pp2048 3865.66 4228.39 1.09
llama 8B Q4_1 RTX 4090 16 pp2048 1856.51 1794.18 0.97
llama 8B Q4_1 RTX 4090 32 pp2048 3363.47 3367.43 1.00
llama 8B Q4_1 RTX 4090 64 pp2048 5679.60 5787.69 1.02
llama 8B Q4_1 RTX 4090 128 pp2048 7515.86 8199.58 1.09
llama 8B Q4_1 RTX 4090 256 pp2048 9734.11 10777.79 1.11
llama 8B Q4_1 RTX 4090 512 pp2048 10500.00 11575.74 1.10
llama 8B Q4_1 RTX 4090 1024 pp2048 10381.85 11359.30 1.09
llama 8B Q4_1 RTX 4090 2048 pp2048 9315.52 10160.58 1.09
llama 8B Q4_1 P40 16 pp2048 452.04 451.54 1.00
llama 8B Q4_1 P40 32 pp2048 613.36 613.78 1.00
llama 8B Q4_1 P40 64 pp2048 672.80 674.49 1.00
llama 8B Q4_1 P40 128 pp2048 792.70 792.73 1.00
llama 8B Q4_1 P40 256 pp2048 874.01 874.59 1.00
llama 8B Q4_1 P40 512 pp2048 911.85 911.98 1.00
llama 8B Q4_1 P40 1024 pp2048 896.20 897.50 1.00
llama 8B Q4_1 P40 2048 pp2048 851.99 853.72 1.00
llama 8B Q4_K_S RX 6800 16 pp2048 232.12 235.11 1.01
llama 8B Q4_K_S RX 6800 32 pp2048 302.56 300.16 0.99
llama 8B Q4_K_S RX 6800 64 pp2048 338.34 337.82 1.00
llama 8B Q4_K_S RX 6800 128 pp2048 419.28 419.36 1.00
llama 8B Q4_K_S RX 6800 256 pp2048 505.52 505.79 1.00
llama 8B Q4_K_S RX 6800 512 pp2048 519.61 519.68 1.00
llama 8B Q4_K_S RX 6800 1024 pp2048 575.99 575.61 1.00
llama 8B Q4_K_S RX 6800 2048 pp2048 536.95 534.29 1.00
llama 8B Q4_K_S RTX 3090 16 pp2048 1343.17 1207.08 0.90
llama 8B Q4_K_S RTX 3090 32 pp2048 2074.67 2012.79 0.97
llama 8B Q4_K_S RTX 3090 64 pp2048 2793.45 2876.72 1.03
llama 8B Q4_K_S RTX 3090 128 pp2048 3307.29 3477.89 1.05
llama 8B Q4_K_S RTX 3090 256 pp2048 3753.45 3958.73 1.05
llama 8B Q4_K_S RTX 3090 512 pp2048 4008.01 4136.33 1.03
llama 8B Q4_K_S RTX 3090 1024 pp2048 4100.52 4240.41 1.03
llama 8B Q4_K_S RTX 3090 2048 pp2048 3983.15 4147.41 1.04
llama 8B Q4_K_S RTX 4090 16 pp2048 2034.63 1929.39 0.95
llama 8B Q4_K_S RTX 4090 32 pp2048 3635.47 3616.03 0.99
llama 8B Q4_K_S RTX 4090 64 pp2048 5903.51 6054.70 1.03
llama 8B Q4_K_S RTX 4090 128 pp2048 7881.96 8306.72 1.05
llama 8B Q4_K_S RTX 4090 256 pp2048 10203.10 10846.14 1.06
llama 8B Q4_K_S RTX 4090 512 pp2048 11018.34 11640.43 1.06
llama 8B Q4_K_S RTX 4090 1024 pp2048 10919.67 11394.61 1.04
llama 8B Q4_K_S RTX 4090 2048 pp2048 9948.30 10275.31 1.03
llama 8B Q4_K_S P40 16 pp2048 406.86 422.15 1.04
llama 8B Q4_K_S P40 32 pp2048 512.77 511.76 1.00
llama 8B Q4_K_S P40 64 pp2048 626.26 634.30 1.01
llama 8B Q4_K_S P40 128 pp2048 736.58 741.40 1.01
llama 8B Q4_K_S P40 256 pp2048 820.88 828.14 1.01
llama 8B Q4_K_S P40 512 pp2048 860.73 866.48 1.01
llama 8B Q4_K_S P40 1024 pp2048 851.24 856.98 1.01
llama 8B Q4_K_S P40 2048 pp2048 812.18 817.24 1.01
llama 8B Q5_0 RX 6800 16 pp2048 226.68 226.90 1.00
llama 8B Q5_0 RX 6800 32 pp2048 335.70 335.86 1.00
llama 8B Q5_0 RX 6800 64 pp2048 408.24 408.49 1.00
llama 8B Q5_0 RX 6800 128 pp2048 511.78 514.20 1.00
llama 8B Q5_0 RX 6800 256 pp2048 600.37 603.35 1.00
llama 8B Q5_0 RX 6800 512 pp2048 607.64 608.39 1.00
llama 8B Q5_0 RX 6800 1024 pp2048 680.12 680.72 1.00
llama 8B Q5_0 RX 6800 2048 pp2048 619.98 618.63 1.00
llama 8B Q5_0 RTX 3090 16 pp2048 1003.94 991.73 0.99
llama 8B Q5_0 RTX 3090 32 pp2048 1803.58 1795.18 1.00
llama 8B Q5_0 RTX 3090 64 pp2048 2805.45 2760.57 0.98
llama 8B Q5_0 RTX 3090 128 pp2048 3628.44 3539.29 0.98
llama 8B Q5_0 RTX 3090 256 pp2048 4144.18 4022.42 0.97
llama 8B Q5_0 RTX 3090 512 pp2048 4340.63 4209.74 0.97
llama 8B Q5_0 RTX 3090 1024 pp2048 4378.33 4285.99 0.98
llama 8B Q5_0 RTX 3090 2048 pp2048 4244.52 4165.29 0.98
llama 8B Q5_0 RTX 4090 16 pp2048 1628.64 1636.05 1.00
llama 8B Q5_0 RTX 4090 32 pp2048 3097.73 3108.27 1.00
llama 8B Q5_0 RTX 4090 64 pp2048 5350.79 5340.18 1.00
llama 8B Q5_0 RTX 4090 128 pp2048 7948.41 7952.44 1.00
llama 8B Q5_0 RTX 4090 256 pp2048 10636.91 10628.95 1.00
llama 8B Q5_0 RTX 4090 512 pp2048 11566.76 11533.54 1.00
llama 8B Q5_0 RTX 4090 1024 pp2048 11408.82 11393.14 1.00
llama 8B Q5_0 RTX 4090 2048 pp2048 10291.18 10284.40 1.00
llama 8B Q5_0 P40 16 pp2048 360.98 360.80 1.00
llama 8B Q5_0 P40 32 pp2048 534.01 534.56 1.00
llama 8B Q5_0 P40 64 pp2048 630.21 630.77 1.00
llama 8B Q5_0 P40 128 pp2048 732.21 731.99 1.00
llama 8B Q5_0 P40 256 pp2048 815.57 816.01 1.00
llama 8B Q5_0 P40 512 pp2048 853.57 854.14 1.00
llama 8B Q5_0 P40 1024 pp2048 841.69 844.12 1.00
llama 8B Q5_0 P40 2048 pp2048 802.68 804.13 1.00
llama 8B Q5_1 RX 6800 16 pp2048 223.95 224.19 1.00
llama 8B Q5_1 RX 6800 32 pp2048 323.90 323.99 1.00
llama 8B Q5_1 RX 6800 64 pp2048 396.55 395.99 1.00
llama 8B Q5_1 RX 6800 128 pp2048 501.64 501.07 1.00
llama 8B Q5_1 RX 6800 256 pp2048 593.43 593.05 1.00
llama 8B Q5_1 RX 6800 512 pp2048 600.86 602.06 1.00
llama 8B Q5_1 RX 6800 1024 pp2048 674.11 672.95 1.00
llama 8B Q5_1 RX 6800 2048 pp2048 615.90 617.75 1.00
llama 8B Q5_1 RTX 3090 16 pp2048 1064.64 1054.56 0.99
llama 8B Q5_1 RTX 3090 32 pp2048 1695.98 1862.09 1.10
llama 8B Q5_1 RTX 3090 64 pp2048 2615.37 2623.64 1.00
llama 8B Q5_1 RTX 3090 128 pp2048 2993.66 3331.39 1.11
llama 8B Q5_1 RTX 3090 256 pp2048 3469.54 3840.01 1.11
llama 8B Q5_1 RTX 3090 512 pp2048 3704.84 4084.30 1.10
llama 8B Q5_1 RTX 3090 1024 pp2048 3807.73 4188.58 1.10
llama 8B Q5_1 RTX 3090 2048 pp2048 3699.04 4057.33 1.10
llama 8B Q5_1 RTX 4090 16 pp2048 1598.73 1623.80 1.02
llama 8B Q5_1 RTX 4090 32 pp2048 2852.82 3086.21 1.08
llama 8B Q5_1 RTX 4090 64 pp2048 5268.74 5329.76 1.01
llama 8B Q5_1 RTX 4090 128 pp2048 7118.41 7879.82 1.11
llama 8B Q5_1 RTX 4090 256 pp2048 9170.04 10197.00 1.11
llama 8B Q5_1 RTX 4090 512 pp2048 10070.74 11135.93 1.11
llama 8B Q5_1 RTX 4090 1024 pp2048 10107.26 11048.77 1.09
llama 8B Q5_1 RTX 4090 2048 pp2048 9041.00 10022.34 1.11
llama 8B Q5_1 P40 16 pp2048 380.57 380.46 1.00
llama 8B Q5_1 P40 32 pp2048 560.59 560.44 1.00
llama 8B Q5_1 P40 64 pp2048 634.22 634.61 1.00
llama 8B Q5_1 P40 128 pp2048 735.78 735.44 1.00
llama 8B Q5_1 P40 256 pp2048 820.80 821.21 1.00
llama 8B Q5_1 P40 512 pp2048 857.00 858.50 1.00
llama 8B Q5_1 P40 1024 pp2048 845.17 846.75 1.00
llama 8B Q5_1 P40 2048 pp2048 806.42 806.36 1.00
llama 8B Q5_K_S RX 6800 16 pp2048 222.40 222.28 1.00
llama 8B Q5_K_S RX 6800 32 pp2048 306.34 304.92 1.00
llama 8B Q5_K_S RX 6800 64 pp2048 335.10 334.38 1.00
llama 8B Q5_K_S RX 6800 128 pp2048 412.64 411.12 1.00
llama 8B Q5_K_S RX 6800 256 pp2048 500.54 497.92 0.99
llama 8B Q5_K_S RX 6800 512 pp2048 514.36 512.32 1.00
llama 8B Q5_K_S RX 6800 1024 pp2048 570.09 567.52 1.00
llama 8B Q5_K_S RX 6800 2048 pp2048 531.02 528.01 0.99
llama 8B Q5_K_S RTX 3090 16 pp2048 1126.72 1153.65 1.02
llama 8B Q5_K_S RTX 3090 32 pp2048 1884.40 1921.77 1.02
llama 8B Q5_K_S RTX 3090 64 pp2048 2624.13 2738.05 1.04
llama 8B Q5_K_S RTX 3090 128 pp2048 3153.56 3375.75 1.07
llama 8B Q5_K_S RTX 3090 256 pp2048 3628.55 3880.81 1.07
llama 8B Q5_K_S RTX 3090 512 pp2048 3867.52 4088.63 1.06
llama 8B Q5_K_S RTX 3090 1024 pp2048 3973.44 4123.42 1.04
llama 8B Q5_K_S RTX 3090 2048 pp2048 3900.00 4038.70 1.04
llama 8B Q5_K_S RTX 4090 16 pp2048 1751.49 1763.33 1.01
llama 8B Q5_K_S RTX 4090 32 pp2048 3265.15 3264.97 1.00
llama 8B Q5_K_S RTX 4090 64 pp2048 5516.09 5619.49 1.02
llama 8B Q5_K_S RTX 4090 128 pp2048 7639.86 7994.63 1.05
llama 8B Q5_K_S RTX 4090 256 pp2048 9777.04 10356.43 1.06
llama 8B Q5_K_S RTX 4090 512 pp2048 10626.56 11217.82 1.06
llama 8B Q5_K_S RTX 4090 1024 pp2048 10590.88 11099.40 1.05
llama 8B Q5_K_S RTX 4090 2048 pp2048 9626.56 10062.93 1.05
llama 8B Q5_K_S P40 16 pp2048 362.35 354.90 0.98
llama 8B Q5_K_S P40 32 pp2048 470.32 467.13 0.99
llama 8B Q5_K_S P40 64 pp2048 610.47 607.58 1.00
llama 8B Q5_K_S P40 128 pp2048 714.37 709.86 0.99
llama 8B Q5_K_S P40 256 pp2048 787.10 785.50 1.00
llama 8B Q5_K_S P40 512 pp2048 823.34 821.45 1.00
llama 8B Q5_K_S P40 1024 pp2048 815.69 811.96 1.00
llama 8B Q5_K_S P40 2048 pp2048 779.32 778.34 1.00
llama 8B Q6_K RX 6800 16 pp2048 213.90 213.82 1.00
llama 8B Q6_K RX 6800 32 pp2048 277.78 277.62 1.00
llama 8B Q6_K RX 6800 64 pp2048 301.90 297.74 0.99
llama 8B Q6_K RX 6800 128 pp2048 372.97 367.26 0.98
llama 8B Q6_K RX 6800 256 pp2048 448.66 441.62 0.98
llama 8B Q6_K RX 6800 512 pp2048 464.03 457.99 0.99
llama 8B Q6_K RX 6800 1024 pp2048 509.01 501.30 0.98
llama 8B Q6_K RX 6800 2048 pp2048 477.33 471.29 0.99
llama 8B Q6_K RTX 3090 16 pp2048 988.81 983.49 0.99
llama 8B Q6_K RTX 3090 32 pp2048 1700.68 1697.83 1.00
llama 8B Q6_K RTX 3090 64 pp2048 2486.92 2507.75 1.01
llama 8B Q6_K RTX 3090 128 pp2048 3122.39 3105.40 0.99
llama 8B Q6_K RTX 3090 256 pp2048 3584.33 3589.22 1.00
llama 8B Q6_K RTX 3090 512 pp2048 3765.52 3724.15 0.99
llama 8B Q6_K RTX 3090 1024 pp2048 3851.45 3787.85 0.98
llama 8B Q6_K RTX 3090 2048 pp2048 3742.37 3709.72 0.99
llama 8B Q6_K RTX 4090 16 pp2048 1481.40 1476.74 1.00
llama 8B Q6_K RTX 4090 32 pp2048 2849.10 2837.26 1.00
llama 8B Q6_K RTX 4090 64 pp2048 4580.15 4813.37 1.05
llama 8B Q6_K RTX 4090 128 pp2048 7194.88 7194.80 1.00
llama 8B Q6_K RTX 4090 256 pp2048 9333.66 9336.11 1.00
llama 8B Q6_K RTX 4090 512 pp2048 10180.20 10198.62 1.00
llama 8B Q6_K RTX 4090 1024 pp2048 10101.86 10112.81 1.00
llama 8B Q6_K RTX 4090 2048 pp2048 9107.77 9114.95 1.00
llama 8B Q6_K P40 16 pp2048 296.65 339.03 1.14
llama 8B Q6_K P40 32 pp2048 466.47 465.42 1.00
llama 8B Q6_K P40 64 pp2048 591.20 588.15 0.99
llama 8B Q6_K P40 128 pp2048 694.93 688.96 0.99
llama 8B Q6_K P40 256 pp2048 767.21 759.30 0.99
llama 8B Q6_K P40 512 pp2048 796.60 790.63 0.99
llama 8B Q6_K P40 1024 pp2048 780.43 775.64 0.99
llama 8B Q6_K P40 2048 pp2048 737.30 732.64 0.99
llama 8B Q8_0 RX 6800 16 pp2048 248.94 248.89 1.00
llama 8B Q8_0 RX 6800 32 pp2048 352.44 352.37 1.00
llama 8B Q8_0 RX 6800 64 pp2048 436.19 436.36 1.00
llama 8B Q8_0 RX 6800 128 pp2048 549.32 548.83 1.00
llama 8B Q8_0 RX 6800 256 pp2048 651.72 650.96 1.00
llama 8B Q8_0 RX 6800 512 pp2048 658.94 657.47 1.00
llama 8B Q8_0 RX 6800 1024 pp2048 741.38 741.03 1.00
llama 8B Q8_0 RX 6800 2048 pp2048 672.28 671.85 1.00
llama 8B Q8_0 RTX 3090 16 pp2048 936.41 931.48 0.99
llama 8B Q8_0 RTX 3090 32 pp2048 1748.71 1744.51 1.00
llama 8B Q8_0 RTX 3090 64 pp2048 2784.09 2763.87 0.99
llama 8B Q8_0 RTX 3090 128 pp2048 3582.49 3596.13 1.00
llama 8B Q8_0 RTX 3090 256 pp2048 4203.38 4190.36 1.00
llama 8B Q8_0 RTX 3090 512 pp2048 4458.95 4404.63 0.99
llama 8B Q8_0 RTX 3090 1024 pp2048 4557.07 4526.82 0.99
llama 8B Q8_0 RTX 3090 2048 pp2048 4374.10 4356.00 1.00
llama 8B Q8_0 RTX 4090 16 pp2048 1363.12 1363.64 1.00
llama 8B Q8_0 RTX 4090 32 pp2048 2534.92 2534.32 1.00
llama 8B Q8_0 RTX 4090 64 pp2048 4456.73 4461.25 1.00
llama 8B Q8_0 RTX 4090 128 pp2048 6995.94 6999.66 1.00
llama 8B Q8_0 RTX 4090 256 pp2048 10512.92 10503.21 1.00
llama 8B Q8_0 RTX 4090 512 pp2048 11970.30 11965.43 1.00
llama 8B Q8_0 RTX 4090 1024 pp2048 11802.13 11826.15 1.00
llama 8B Q8_0 RTX 4090 2048 pp2048 10575.53 10690.68 1.01
llama 8B Q8_0 P40 16 pp2048 352.76 352.75 1.00
llama 8B Q8_0 P40 32 pp2048 559.77 559.66 1.00
llama 8B Q8_0 P40 64 pp2048 628.48 628.44 1.00
llama 8B Q8_0 P40 128 pp2048 750.48 750.42 1.00
llama 8B Q8_0 P40 256 pp2048 846.09 846.70 1.00
llama 8B Q8_0 P40 512 pp2048 889.12 889.82 1.00
llama 8B Q8_0 P40 1024 pp2048 874.63 875.86 1.00
llama 8B Q8_0 P40 2048 pp2048 835.58 835.01 1.00

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs python python script changes labels Jul 15, 2024
@JohannesGaessler
Copy link
Collaborator Author

Github doesn't let you create an OP with >= 65536 characters but unless I hit that exact number there seems to be no such limits for comments. And if you hit the limit the site just swallows your post and you get to write it a second time. Good design.

@JohannesGaessler JohannesGaessler added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jul 15, 2024
@JohannesGaessler
Copy link
Collaborator Author

One of the CI builds fails with

D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh(154): fatal error C1060: compiler is out of heap space [D:\a\llama.cpp\llama.cpp\build\ggml\src\ggml.vcxproj]

I don't see why this PR would increase the amount of memory used per compilation job so my assumption is that the problem instead has to do with the total number of compilation jobs increasing and the machine not having enough memory to run all of them in parallel.

@Nexesenex
Copy link
Contributor

Nexesenex commented Jul 15, 2024

One of the CI builds fails with

D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh(154): fatal error C1060: compiler is out of heap space [D:\a\llama.cpp\llama.cpp\build\ggml\src\ggml.vcxproj]

I don't see why this PR would increase the amount of memory used per compilation job so my assumption is that the problem instead has to do with the total number of compilation jobs increasing and the machine not having enough memory to run all of them in parallel.

I noticed it on Github action.
On a local CMake in VS, it compiles properly.

Congrats on the MMQ performance boost for IQ Quants, it's bestial!

@JohannesGaessler
Copy link
Collaborator Author

On my desktop machine with up to 32 parallel jobs I don't see a noticeable difference between master and this PR when just manually watching the memory use during the compilation with GGML_NO_CCACHE.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 16, 2024

Seems like ci flaked, I re ran it manually and it just passed. I also noticed that the cuda setup took way longer before.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

I think there are issues with iq4_nl with some row sizes. It doesn't fail with multiples of 256, but it does with some multiples of 32.

  MUL_MAT(type_a=iq4_nl,type_b=f32,m=50,n=23,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 7061.285508280 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=10,n=87,k=480,bs=[1,1],nr=[1,1]): [MUL_MAT] NaN at index 860 (CUDA0=4.644990 CPU=-nan) FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=36,n=119,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 6240433279092.440429688 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=54,n=76,k=256,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=121,n=43,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 2377551.812560449 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=26,n=52,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 106779.003917511 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=45,n=67,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 2251481308242.827636719 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=66,n=27,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 125.033319980 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=58,n=127,k=128,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=107,n=26,k=64,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=77,n=35,k=448,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=90,n=32,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 708726342.766369820 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=66,n=98,k=384,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=109,n=94,k=256,bs=[1,1],nr=[1,1]): OK
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=9,n=77,k=224,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 161051.221285613 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=67,n=66,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 3951379.951192479 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=78,n=30,k=288,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 56198.323614190 > 0.000500000 FAIL
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=88,n=58,k=448,bs=[1,1],nr=[1,1]): OK

@JohannesGaessler
Copy link
Collaborator Author

Is this already happening on master?

@oldgithubman
Copy link

For qi1_m there is no support and frankly I don't think it would be worthwhile to add since the quality degradation of that format is very high.

With llama-3-405B imminent, I disagree

@JohannesGaessler
Copy link
Collaborator Author

I am not able to reproduce the iq4_nl issue by manually editing tests/test_backend_ops.cpp.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

I used this code to generate random cases. It fails very frequently with iq4_nl.

    for (int i = 0; i < 1000; i++)
    for (ggml_type type_a : all_types) {
        for (ggml_type type_b : {GGML_TYPE_F32}) {
            // m = a rows
            // n = b rows
            // k = cols
            std::uniform_int_distribution<> dist_m(1, 128);
            std::uniform_int_distribution<> dist_n(16, 128);
            std::uniform_int_distribution<> dist_k(1, 16);
            int m = dist_m(rng);
            int n = dist_n(rng);
            int k = dist_k(rng) * ggml_blck_size(type_a);
            test_cases.emplace_back(new test_mul_mat(type_a, type_b, m, n, k, { 1,  1}, {1, 1}));
        }
    }

@JohannesGaessler
Copy link
Collaborator Author

With the provided code the tests already fail on master and notably also on master with GGML_CUDA_FORCE_CUBLAS so I think the issue is unrelated to my MMQ changes.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

With the reference C implementation of ggml_vec_dot_iq4_nl_q8_0 the test passes, so it is likely that the AVX and AVX2 implementations of iq4_nl are broken. May be because it processes two blocks per iteration, but doesn't check that the number of blocks is even.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

The editorconfig check should be fixed with a rebase to master.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

I have not tested with the changes in this PR, but I left the random tests running for a while with the fix in #8549 and found that the tests often fail with m=1 and n=1.

  MUL_MAT(type_a=q4_0,type_b=f32,m=1,n=1,k=96,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.060065945 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=1,n=1,k=352,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.039411830 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=1,n=1,k=192,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.001153448 > 0.000500000 FAIL
  MUL_MAT(type_a=q3_K,type_b=f32,m=1,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.010318764 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_K,type_b=f32,m=1,n=1,k=2816,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.008912032 > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=1,n=1,k=1280,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.009120895 > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=1,n=1,k=1024,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.001211336 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_1,type_b=f32,m=1,n=1,k=512,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.000830595 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_K,type_b=f32,m=1,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.001113781 > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=1,n=1,k=3840,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.000900035 > 0.000500000 FAIL
  MUL_MAT(type_a=q3_K,type_b=f32,m=1,n=1,k=1280,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.006345874 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_K,type_b=f32,m=1,n=1,k=3584,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.008740501 > 0.000500000 FAIL
  MUL_MAT(type_a=q6_K,type_b=f32,m=1,n=1,k=2816,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.001654993 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=1,n=1,k=64,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.005168594 > 0.000500000 FAIL
  MUL_MAT(type_a=q6_K,type_b=f32,m=1,n=1,k=2304,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.002289831 > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=1,n=1,k=3584,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.004978606 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=1,n=1,k=1280,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.005011020 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_1,type_b=f32,m=1,n=1,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.007109389 > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=1,n=1,k=2816,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.740677190 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_0,type_b=f32,m=1,n=1,k=160,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.887771174 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_0,type_b=f32,m=1,n=1,k=416,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 0.002317706 > 0.000500000 FAIL

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 18, 2024

Looks like the ci build failures will be a problem. We are clearly running out of memory, when msvc is combined with high amounts of templatisation.
Either we reduce the the number of parallel executions, or the template parameters (bundling them in a struct might work around it), we self host them with more memory or we optimistically rerun the build step single threaded after it fails.

@JohannesGaessler
Copy link
Collaborator Author

I can reproduce the tests sometimes failing with m=n=1. However, those test cases are definitely unrelated to any MMQ changes because MMQ is only used for batch sizes > 8. What I think is happening: the CPU and GPU implementations are not 100% identical. Therefore you will get differences between the two sets values that are compared via NMSE. If you assume that the difference per operation follows a Gaussian distribution you can expect the NMSE to scale with 1.0/sqrt(m*n*k). So if you compare the NMSE against a fixed target of 5e-4 then m=n=1 will be much more likely to fail than e.g. m=n=128 where the differences are more likely to cancel each other out. If I modify max_nmse_err to scale with 1.0/sqrt(m*n*k) the random test failures seem to depend a lot less on the matrix dimensions.

@JohannesGaessler JohannesGaessler force-pushed the cuda-mmq-deduplicate-4 branch from 5b17b99 to f0f71a5 Compare July 20, 2024 06:22
@JohannesGaessler
Copy link
Collaborator Author

What do we do about the CI failing due to running out of memory? I am not familiar with the setup at all so I don't know how to fix it.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 20, 2024

What do we do about the CI failing due to running out of memory? I am not familiar with the setup at all so I don't know how to fix it.

What ever I proposed. Additionally, I thought of a more sophisticated variant of "lower number of parallel build executions": We inject a lock, that only allows 1 template-instantiator compilation at a time using https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_LAUNCHER.html . We detect it by file name or something and use some temporary file as lock. But this is as far as build engineering will go.

Another thing would be to experiment with precompiled headers and see how far that brings us.

If you want something simple, but slow, we can first invoke cmake to only build the ggml target single threaded and then the rest (build all target) like usual. But this is a pain and will take ages to compile, since ggml is a large (number of translation units) target.

          cmake .. -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=ON
+         cmake --build . --config Release -j 1 -t ggml
          cmake --build . --config Release -j ${env:NUMBER_OF_PROCESSORS}

@ggerganov
Copy link
Member

Apart from @Green-Sky's suggestions, maybe we can also try to reduce the parallel jobs by just 1 and see if it fits?

cmake --build . --config Release -j $((${env:NUMBER_OF_PROCESSORS}-1))

But we do need some longer term solution - not sure what is the best option of those discussed so far

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 20, 2024

Looks like "one less" did not cut it. Try my staged build suggestion, with building ggml with 1 job and the rest with number cores.

edit: or did it not update yet?

@ggerganov
Copy link
Member

@github-actions github-actions bot added the devops improvements to build systems and github actions label Jul 20, 2024
@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 20, 2024

Looks like it passed, I'm rerunning one of them to get more samples, since rerunning did work before without modifications too.

update: successful again and it takes "only" 5-10min longer.

@JohannesGaessler
Copy link
Collaborator Author

I will merge this PR in a few hours unless someone has an issue with it.

@slaren
Copy link
Member

slaren commented Jul 20, 2024

What I think is happening: the CPU and GPU implementations are not 100% identical.

I agree that this is likely the case. I ran some tests comparing the CPU, BLAS, dmmv, cuBLAS, and mmvq. Basically:
BLAS compared to dmmv or FP32 cuBLAS = zero error
BLAS compared to mmvq or CPU = some error with outliers
CPU compared to mmvq = some error with outliers

By outliers here I mean some cases where the error is very high. This can be explained by the difference in quantization format of src1 in the CPU and mmvq. Increasing m or n reduces the effect of outliers and produces more stable results. I would expect that the error would disappear almost entirely if the CPU backend and mmvq quantized src1 to the same format.

@ggerganov
Copy link
Member

ggerganov commented Jul 20, 2024

We might still want to do what @Green-Sky suggested, combining the "one less" approach:

          cmake --build . --config Release -j $((${env:NUMBER_OF_PROCESSORS} - 1)) -t ggml
          cmake --build . --config Release -j ${env:NUMBER_OF_PROCESSORS}

Though I don't expect it to make a significant difference

@JohannesGaessler JohannesGaessler merged commit 69c487f into ggml-org:master Jul 20, 2024
54 checks passed
@JohannesGaessler
Copy link
Collaborator Author

Sorry, I'm too tired right now to wait for another round of CI; let's do the fixup in a separate PR.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* CUDA: MMQ code deduplication + iquant support

* 1 less parallel job for CI build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions Nvidia GPU Issues specific to Nvidia GPUs python python script changes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants