add basic tensor data validation function #6884

slaren · 2024-04-24T19:58:27Z

Adds ggml_validate_row_data to validate tensor data, and adds this validation to llama.cpp during model loading. For floating point tensors it checks all the data, for quant types it checks the scales only. The validation consists of checking for nan and inf values in the tensors.

Should help detect issues with models such as reported in #6841.

Hopefully it won't increase load time too much so that it can be left enabled permanently, but more testing is necessary.

github-actions · 2024-04-25T00:25:52Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 218 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=22179.75ms p(95)=40557.1ms fails=, finish reason: stop=101 truncated=117
Prompt processing (pp): avg=255.47tk/s p(95)=751.5tk/s
Token generation (tg): avg=19.05tk/s p(95)=25.08tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/check-tensor commit=8ddd0228ff6c79db0580978d7559c356065f44b6

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 268.45, 268.45, 268.45, 268.45, 268.45, 292.31, 292.31, 292.31, 292.31, 292.31, 322.85, 322.85, 322.85, 322.85, 322.85, 443.62, 443.62, 443.62, 443.62, 443.62, 479.52, 479.52, 479.52, 479.52, 479.52, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 447.97, 447.97, 447.97, 447.97, 447.97, 447.72, 447.72, 447.72, 447.72, 447.72, 457.05, 457.05, 457.05, 457.05, 457.05, 460.08, 460.08, 460.08, 460.08, 460.08, 468.49, 468.49, 468.49, 468.49, 468.49, 480.9, 480.9, 480.9, 480.9, 480.9, 503.91, 503.91, 503.91, 503.91, 503.91, 510.61, 510.61, 510.61, 510.61, 510.61, 520.53, 520.53, 520.53, 520.53, 520.53, 522.71, 522.71, 522.71, 522.71, 522.71, 523.21, 523.21, 523.21, 523.21, 523.21, 526.21, 526.21, 526.21, 526.21, 526.21, 532.42, 532.42, 532.42, 532.42, 532.42, 541.37, 541.37, 541.37, 541.37, 541.37, 542.0, 542.0, 542.0, 542.0, 542.0, 550.96, 550.96, 550.96, 550.96, 550.96, 551.88, 551.88, 551.88, 551.88, 551.88, 553.6, 553.6, 553.6, 553.6, 553.6, 563.38, 563.38, 563.38, 563.38, 563.38, 565.32, 565.32, 565.32, 565.32, 565.32, 566.42, 566.42, 566.42, 566.42, 566.42, 583.0, 583.0, 583.0, 583.0, 583.0, 583.58, 583.58, 583.58, 583.58, 583.58, 598.03, 598.03, 598.03, 598.03, 598.03, 613.46, 613.46, 613.46, 613.46, 613.46, 612.96, 612.96, 612.96, 612.96, 612.96, 612.11, 612.11, 612.11, 612.11, 612.11, 615.97, 615.97, 615.97, 615.97, 615.97, 626.05, 626.05, 626.05, 626.05, 626.05, 633.59, 633.59, 633.59, 633.59, 633.59, 635.87, 635.87, 635.87, 635.87, 635.87, 635.12, 635.12, 635.12, 635.12, 635.12, 635.38, 635.38, 635.38, 635.38, 635.38, 635.84, 635.84, 635.84, 635.84, 635.84, 639.28, 639.28, 639.28, 639.28, 639.28, 640.36, 640.36, 640.36, 640.36, 640.36, 639.64, 639.64, 639.64, 639.64, 639.64, 638.53, 638.53, 638.53, 638.53, 638.53, 638.13, 638.13, 638.13, 638.13, 638.13, 650.41, 650.41, 650.41, 650.41, 650.41, 649.46, 649.46, 649.46, 649.46, 649.46, 648.32, 648.32, 648.32, 648.32, 648.32, 647.71, 647.71, 647.71, 647.71, 647.71, 648.6, 648.6, 648.6, 648.6, 648.6, 649.5, 649.5, 649.5, 649.5, 649.5, 649.38, 649.38, 649.38, 649.38, 649.38, 654.01, 654.01, 654.01, 654.01, 654.01, 654.46, 654.46, 654.46, 654.46, 654.46, 653.85, 653.85, 653.85, 653.85, 653.85, 655.21, 655.21, 655.21, 655.21, 655.21, 654.89, 654.89, 654.89, 654.89, 654.89, 656.16, 656.16, 656.16, 656.16, 656.16, 661.58, 661.58, 661.58, 661.58, 661.58, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.1, 28.1, 28.1, 28.1, 28.1, 25.62, 25.62, 25.62, 25.62, 25.62, 25.63, 25.63, 25.63, 25.63, 25.63, 24.65, 24.65, 24.65, 24.65, 24.65, 22.15, 22.15, 22.15, 22.15, 22.15, 20.82, 20.82, 20.82, 20.82, 20.82, 17.85, 17.85, 17.85, 17.85, 17.85, 17.54, 17.54, 17.54, 17.54, 17.54, 17.6, 17.6, 17.6, 17.6, 17.6, 17.86, 17.86, 17.86, 17.86, 17.86, 18.28, 18.28, 18.28, 18.28, 18.28, 18.46, 18.46, 18.46, 18.46, 18.46, 18.59, 18.59, 18.59, 18.59, 18.59, 18.56, 18.56, 18.56, 18.56, 18.56, 18.43, 18.43, 18.43, 18.43, 18.43, 18.38, 18.38, 18.38, 18.38, 18.38, 18.55, 18.55, 18.55, 18.55, 18.55, 18.76, 18.76, 18.76, 18.76, 18.76, 19.08, 19.08, 19.08, 19.08, 19.08, 19.25, 19.25, 19.25, 19.25, 19.25, 19.26, 19.26, 19.26, 19.26, 19.26, 19.33, 19.33, 19.33, 19.33, 19.33, 19.32, 19.32, 19.32, 19.32, 19.32, 19.38, 19.38, 19.38, 19.38, 19.38, 19.41, 19.41, 19.41, 19.41, 19.41, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.66, 19.66, 19.66, 19.66, 19.66, 19.67, 19.67, 19.67, 19.67, 19.67, 19.68, 19.68, 19.68, 19.68, 19.68, 19.63, 19.63, 19.63, 19.63, 19.63, 19.55, 19.55, 19.55, 19.55, 19.55, 19.37, 19.37, 19.37, 19.37, 19.37, 19.2, 19.2, 19.2, 19.2, 19.2, 19.14, 19.14, 19.14, 19.14, 19.14, 19.01, 19.01, 19.01, 19.01, 19.01, 18.83, 18.83, 18.83, 18.83, 18.83, 18.8, 18.8, 18.8, 18.8, 18.8, 18.47, 18.47, 18.47, 18.47, 18.47, 18.29, 18.29, 18.29, 18.29, 18.29, 18.15, 18.15, 18.15, 18.15, 18.15, 18.14, 18.14, 18.14, 18.14, 18.14, 18.16, 18.16, 18.16, 18.16, 18.16, 18.19, 18.19, 18.19, 18.19, 18.19, 18.28, 18.28, 18.28, 18.28, 18.28, 18.31, 18.31, 18.31, 18.31, 18.31, 18.28, 18.28, 18.28, 18.28, 18.28, 18.25, 18.25, 18.25, 18.25, 18.25, 18.21, 18.21, 18.21, 18.21, 18.21, 18.04, 18.04, 18.04, 18.04, 18.04, 17.98, 17.98, 17.98, 17.98, 17.98, 17.99, 17.99, 17.99, 17.99, 17.99, 18.06, 18.06, 18.06, 18.06, 18.06, 18.08, 18.08, 18.08, 18.08, 18.08, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.11, 18.11, 18.11, 18.11, 18.11, 18.13, 18.13, 18.13, 18.13, 18.13, 18.17, 18.17, 18.17, 18.17, 18.17, 18.15, 18.15, 18.15, 18.15, 18.15, 18.16, 18.16, 18.16]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.37, 0.37, 0.37, 0.37, 0.37, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.39, 0.39, 0.39, 0.39, 0.39, 0.36, 0.36, 0.36, 0.36, 0.36, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.37, 0.37, 0.37, 0.37, 0.37, 0.41, 0.41, 0.41, 0.41, 0.41, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0]

ggerganov · 2024-04-25T08:49:36Z

I'm testing on M2 Ultra and 70B models load time increases noticeably:

F16: 6s master -> 50s PR
Q8_0: 3s master -> 10s PR

There is no ARM SIMD optimization in the F16 branch, so this can help, but also not sure if we want to implement it for all instruction sets.

Maybe this validation should be opt-in? Probably we can do it every time with the quantize tool to prevent the creation of corrupted models in the first place. But have it disabled by default when doing inference

slaren · 2024-04-25T09:55:47Z

It is not as bad for me on x86, with purely CPU backend it doubles the load time, but the load time with mmap is always very fast. When offloading the overhead becomes much less significant. The AVX2 implementation helps a lot, it makes checking FP16 models about as fast as quant models. Still, I agree that it is too slow to leave this enabled unconditionally, so I will make it optional except for quantize.

tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors.

Nekotekina · 2024-04-25T15:33:48Z

Hmm, quantize won't validate tensors which are being written?

slaren · 2024-04-26T14:44:22Z

This should be a lot faster now in most cases. There is an NEON implementation for FP16 validation, and the model validation is multithreaded with std::async when possible. Still, doing nothing is a lot faster than doing anything at all, so with mmap for CPU and Metal it can still double the load times.

If this proves to be useful, it may be a good to enable it by default in the server, since it is not a process that is meant to be restarted very often, the increase in load time should be less important.

ggerganov

The 70B F16 case is now down to 20s (from 50s)

Yes, we can add it to server. Also, let's note the ggml-ci time increase since we do a lot of quantizations there and it might lead to significant increase. If so, we might want to add an option to disable during quantize in the CI, since the model data is not changing there. But only if the time increase is significant

ggml-ci

slaren · 2024-04-26T16:39:22Z

It doesn't seem to affect the CI times significantly, in fact the cuda-v100 time was a bit lower than the latest time on master, so I think that it is within the normal variation between runs.

* add basic tensor data validation function * add --check-tensors command line argument tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors.

slaren added 2 commits April 24, 2024 21:57

add basic tensor data validation function

6aea16e

improve fp16 validation performance

c806db3

add --check-tensors command line argument

145d315

tensor validation is disabled by default and can be enabled by adding `--check-tensors` to the command line arguments. quantize always validates tensors.

ggerganov approved these changes Apr 25, 2024

View reviewed changes

slaren linked an issue Apr 25, 2024 that may be closed by this pull request

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely? #6841

Closed

slaren added 2 commits April 26, 2024 00:47

quantize : validate generated data

cf4fa0c

add neon impl

55dec7c

ggerganov approved these changes Apr 26, 2024

View reviewed changes

validate data asynchronously when possible

ea59185

ggml-ci

slaren force-pushed the sl/check-tensor branch from 456f771 to ea59185 Compare April 26, 2024 14:55

slaren added 2 commits April 26, 2024 17:13

fix neon reinterpret type

41de83d

fix QK_K == 64

8ddd022

ggml-ci

slaren merged commit 017e699 into master Apr 26, 2024
64 of 67 checks passed

slaren deleted the sl/check-tensor branch April 26, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add basic tensor data validation function #6884

add basic tensor data validation function #6884

slaren commented Apr 24, 2024

github-actions bot commented Apr 25, 2024 •

edited

Loading

ggerganov commented Apr 25, 2024

slaren commented Apr 25, 2024

Nekotekina commented Apr 25, 2024

slaren commented Apr 26, 2024

ggerganov left a comment

slaren commented Apr 26, 2024

add basic tensor data validation function #6884

add basic tensor data validation function #6884

Conversation

slaren commented Apr 24, 2024

github-actions bot commented Apr 25, 2024 • edited Loading

ggerganov commented Apr 25, 2024

slaren commented Apr 25, 2024

Nekotekina commented Apr 25, 2024

slaren commented Apr 26, 2024

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Apr 26, 2024

github-actions bot commented Apr 25, 2024 •

edited

Loading