Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add basic tensor data validation function #6884

Merged
merged 8 commits into from
Apr 26, 2024
Merged

add basic tensor data validation function #6884

merged 8 commits into from
Apr 26, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 24, 2024

Adds ggml_validate_row_data to validate tensor data, and adds this validation to llama.cpp during model loading. For floating point tensors it checks all the data, for quant types it checks the scales only. The validation consists of checking for nan and inf values in the tensors.

Should help detect issues with models such as reported in #6841.

Hopefully it won't increase load time too much so that it can be left enabled permanently, but more testing is necessary.

Copy link
Contributor

github-actions bot commented Apr 25, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 218 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=22179.75ms p(95)=40557.1ms fails=, finish reason: stop=101 truncated=117
  • Prompt processing (pp): avg=255.47tk/s p(95)=751.5tk/s
  • Token generation (tg): avg=19.05tk/s p(95)=25.08tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=sl/check-tensor commit=8ddd0228ff6c79db0580978d7559c356065f44b6

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 268.45, 268.45, 268.45, 268.45, 268.45, 292.31, 292.31, 292.31, 292.31, 292.31, 322.85, 322.85, 322.85, 322.85, 322.85, 443.62, 443.62, 443.62, 443.62, 443.62, 479.52, 479.52, 479.52, 479.52, 479.52, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 449.51, 447.97, 447.97, 447.97, 447.97, 447.97, 447.72, 447.72, 447.72, 447.72, 447.72, 457.05, 457.05, 457.05, 457.05, 457.05, 460.08, 460.08, 460.08, 460.08, 460.08, 468.49, 468.49, 468.49, 468.49, 468.49, 480.9, 480.9, 480.9, 480.9, 480.9, 503.91, 503.91, 503.91, 503.91, 503.91, 510.61, 510.61, 510.61, 510.61, 510.61, 520.53, 520.53, 520.53, 520.53, 520.53, 522.71, 522.71, 522.71, 522.71, 522.71, 523.21, 523.21, 523.21, 523.21, 523.21, 526.21, 526.21, 526.21, 526.21, 526.21, 532.42, 532.42, 532.42, 532.42, 532.42, 541.37, 541.37, 541.37, 541.37, 541.37, 542.0, 542.0, 542.0, 542.0, 542.0, 550.96, 550.96, 550.96, 550.96, 550.96, 551.88, 551.88, 551.88, 551.88, 551.88, 553.6, 553.6, 553.6, 553.6, 553.6, 563.38, 563.38, 563.38, 563.38, 563.38, 565.32, 565.32, 565.32, 565.32, 565.32, 566.42, 566.42, 566.42, 566.42, 566.42, 583.0, 583.0, 583.0, 583.0, 583.0, 583.58, 583.58, 583.58, 583.58, 583.58, 598.03, 598.03, 598.03, 598.03, 598.03, 613.46, 613.46, 613.46, 613.46, 613.46, 612.96, 612.96, 612.96, 612.96, 612.96, 612.11, 612.11, 612.11, 612.11, 612.11, 615.97, 615.97, 615.97, 615.97, 615.97, 626.05, 626.05, 626.05, 626.05, 626.05, 633.59, 633.59, 633.59, 633.59, 633.59, 635.87, 635.87, 635.87, 635.87, 635.87, 635.12, 635.12, 635.12, 635.12, 635.12, 635.38, 635.38, 635.38, 635.38, 635.38, 635.84, 635.84, 635.84, 635.84, 635.84, 639.28, 639.28, 639.28, 639.28, 639.28, 640.36, 640.36, 640.36, 640.36, 640.36, 639.64, 639.64, 639.64, 639.64, 639.64, 638.53, 638.53, 638.53, 638.53, 638.53, 638.13, 638.13, 638.13, 638.13, 638.13, 650.41, 650.41, 650.41, 650.41, 650.41, 649.46, 649.46, 649.46, 649.46, 649.46, 648.32, 648.32, 648.32, 648.32, 648.32, 647.71, 647.71, 647.71, 647.71, 647.71, 648.6, 648.6, 648.6, 648.6, 648.6, 649.5, 649.5, 649.5, 649.5, 649.5, 649.38, 649.38, 649.38, 649.38, 649.38, 654.01, 654.01, 654.01, 654.01, 654.01, 654.46, 654.46, 654.46, 654.46, 654.46, 653.85, 653.85, 653.85, 653.85, 653.85, 655.21, 655.21, 655.21, 655.21, 655.21, 654.89, 654.89, 654.89, 654.89, 654.89, 656.16, 656.16, 656.16, 656.16, 656.16, 661.58, 661.58, 661.58, 661.58, 661.58, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59, 665.59]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.1, 28.1, 28.1, 28.1, 28.1, 25.62, 25.62, 25.62, 25.62, 25.62, 25.63, 25.63, 25.63, 25.63, 25.63, 24.65, 24.65, 24.65, 24.65, 24.65, 22.15, 22.15, 22.15, 22.15, 22.15, 20.82, 20.82, 20.82, 20.82, 20.82, 17.85, 17.85, 17.85, 17.85, 17.85, 17.54, 17.54, 17.54, 17.54, 17.54, 17.6, 17.6, 17.6, 17.6, 17.6, 17.86, 17.86, 17.86, 17.86, 17.86, 18.28, 18.28, 18.28, 18.28, 18.28, 18.46, 18.46, 18.46, 18.46, 18.46, 18.59, 18.59, 18.59, 18.59, 18.59, 18.56, 18.56, 18.56, 18.56, 18.56, 18.43, 18.43, 18.43, 18.43, 18.43, 18.38, 18.38, 18.38, 18.38, 18.38, 18.55, 18.55, 18.55, 18.55, 18.55, 18.76, 18.76, 18.76, 18.76, 18.76, 19.08, 19.08, 19.08, 19.08, 19.08, 19.25, 19.25, 19.25, 19.25, 19.25, 19.26, 19.26, 19.26, 19.26, 19.26, 19.33, 19.33, 19.33, 19.33, 19.33, 19.32, 19.32, 19.32, 19.32, 19.32, 19.38, 19.38, 19.38, 19.38, 19.38, 19.41, 19.41, 19.41, 19.41, 19.41, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.66, 19.66, 19.66, 19.66, 19.66, 19.67, 19.67, 19.67, 19.67, 19.67, 19.68, 19.68, 19.68, 19.68, 19.68, 19.63, 19.63, 19.63, 19.63, 19.63, 19.55, 19.55, 19.55, 19.55, 19.55, 19.37, 19.37, 19.37, 19.37, 19.37, 19.2, 19.2, 19.2, 19.2, 19.2, 19.14, 19.14, 19.14, 19.14, 19.14, 19.01, 19.01, 19.01, 19.01, 19.01, 18.83, 18.83, 18.83, 18.83, 18.83, 18.8, 18.8, 18.8, 18.8, 18.8, 18.47, 18.47, 18.47, 18.47, 18.47, 18.29, 18.29, 18.29, 18.29, 18.29, 18.15, 18.15, 18.15, 18.15, 18.15, 18.14, 18.14, 18.14, 18.14, 18.14, 18.16, 18.16, 18.16, 18.16, 18.16, 18.19, 18.19, 18.19, 18.19, 18.19, 18.28, 18.28, 18.28, 18.28, 18.28, 18.31, 18.31, 18.31, 18.31, 18.31, 18.28, 18.28, 18.28, 18.28, 18.28, 18.25, 18.25, 18.25, 18.25, 18.25, 18.21, 18.21, 18.21, 18.21, 18.21, 18.04, 18.04, 18.04, 18.04, 18.04, 17.98, 17.98, 17.98, 17.98, 17.98, 17.99, 17.99, 17.99, 17.99, 17.99, 18.06, 18.06, 18.06, 18.06, 18.06, 18.08, 18.08, 18.08, 18.08, 18.08, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.1, 18.11, 18.11, 18.11, 18.11, 18.11, 18.13, 18.13, 18.13, 18.13, 18.13, 18.17, 18.17, 18.17, 18.17, 18.17, 18.15, 18.15, 18.15, 18.15, 18.15, 18.16, 18.16, 18.16]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.37, 0.37, 0.37, 0.37, 0.37, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.39, 0.39, 0.39, 0.39, 0.39, 0.36, 0.36, 0.36, 0.36, 0.36, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.37, 0.37, 0.37, 0.37, 0.37, 0.41, 0.41, 0.41, 0.41, 0.41, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 218 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714153032 --> 1714153666
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0]
                    
Loading

@ggerganov
Copy link
Owner

I'm testing on M2 Ultra and 70B models load time increases noticeably:

  • F16: 6s master -> 50s PR
  • Q8_0: 3s master -> 10s PR

There is no ARM SIMD optimization in the F16 branch, so this can help, but also not sure if we want to implement it for all instruction sets.

Maybe this validation should be opt-in? Probably we can do it every time with the quantize tool to prevent the creation of corrupted models in the first place. But have it disabled by default when doing inference

@slaren
Copy link
Collaborator Author

slaren commented Apr 25, 2024

It is not as bad for me on x86, with purely CPU backend it doubles the load time, but the load time with mmap is always very fast. When offloading the overhead becomes much less significant. The AVX2 implementation helps a lot, it makes checking FP16 models about as fast as quant models. Still, I agree that it is too slow to leave this enabled unconditionally, so I will make it optional except for quantize.

tensor validation is disabled by default and can be enabled by adding
`--check-tensors` to the command line arguments.

quantize always validates tensors.
@Nekotekina
Copy link
Contributor

Hmm, quantize won't validate tensors which are being written?

@slaren
Copy link
Collaborator Author

slaren commented Apr 26, 2024

This should be a lot faster now in most cases. There is an NEON implementation for FP16 validation, and the model validation is multithreaded with std::async when possible. Still, doing nothing is a lot faster than doing anything at all, so with mmap for CPU and Metal it can still double the load times.

If this proves to be useful, it may be a good to enable it by default in the server, since it is not a process that is meant to be restarted very often, the increase in load time should be less important.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 70B F16 case is now down to 20s (from 50s)

Yes, we can add it to server. Also, let's note the ggml-ci time increase since we do a lot of quantizations there and it might lead to significant increase. If so, we might want to add an option to disable during quantize in the CI, since the model data is not changing there. But only if the time increase is significant

@slaren
Copy link
Collaborator Author

slaren commented Apr 26, 2024

It doesn't seem to affect the CI times significantly, in fact the cuda-v100 time was a bit lower than the latest time on master, so I think that it is within the normal variation between runs.

@slaren slaren merged commit 017e699 into master Apr 26, 2024
64 of 67 checks passed
@slaren slaren deleted the sl/check-tensor branch April 26, 2024 16:40
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* add basic tensor data validation function

* add --check-tensors command line argument

tensor validation is disabled by default and can be enabled by adding
`--check-tensors` to the command line arguments.

quantize always validates tensors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?
3 participants