Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : unify rope norm/neox #7634

Merged
merged 9 commits into from
Jun 5, 2024
Merged

ggml : unify rope norm/neox #7634

merged 9 commits into from
Jun 5, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 30, 2024

The RoPE modes NORM and NEOX are technically the same but simply operate on different pairs of dimensions in the head:

# norm
(x[2*i + 0], x[2*i + 1])

# neox
(x[i], x[i + n_dims/2])

However, on master the 2 implementations are quite different due to legacy reasons:

  • NORM does not support partial rotation, while NEOX does
  • CPU NORM used cached rope values while NEOX didn't
  • NEOX supports frequency factors, while NORM didn't
  • etc.

This PR will normalize the implementation in the 2 modes to make changes in the future easier.

We also remove support for xPos RoPE (ggerganov/ggml#442) since it does not seem to be used

I've also considered removing the GLM mode, but it seems to be used by ChatGLM (ggerganov/ggml#477)
@li-plus Could you confirm if GLM RoPE is still relevant today?

TODO

  • Remove xPos mode
  • Remove GLM mode and n_ctx argument
  • NORM RoPE support freq_factors
  • NORM RoPE support n_dims argument for partial rotation
  • CPU
  • Metal
  • CUDA
  • Vulkan (see 814d57d)
  • Kompute (see 13c6267)
  • SYCL (see 814d57d)
  • n_orig_ctx -> n_ctx_orig

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 30, 2024
@li-plus
Copy link
Contributor

li-plus commented May 30, 2024

Could you confirm if GLM RoPE is still relevant today?

No. ChatGLM now uses NEOX-style rope with position ids specified. mode & 4 branch is no longer used and can be completely removed. Also, the n_ctx argument is not needed now.

Copy link
Contributor

github-actions bot commented May 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 521 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8953.42ms p(95)=22457.77ms fails=, finish reason: stop=470 truncated=51
  • Prompt processing (pp): avg=107.74tk/s p(95)=488.54tk/s
  • Token generation (tg): avg=33.71tk/s p(95)=46.88tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/rope-refactor commit=ddac1ef6813132eb9e817460ef389bf7fe3c12a3

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 521 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717420801 --> 1717421429
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 363.58, 363.58, 363.58, 363.58, 363.58, 792.47, 792.47, 792.47, 792.47, 792.47, 744.78, 744.78, 744.78, 744.78, 744.78, 772.14, 772.14, 772.14, 772.14, 772.14, 843.53, 843.53, 843.53, 843.53, 843.53, 836.99, 836.99, 836.99, 836.99, 836.99, 836.62, 836.62, 836.62, 836.62, 836.62, 856.32, 856.32, 856.32, 856.32, 856.32, 858.28, 858.28, 858.28, 858.28, 858.28, 872.73, 872.73, 872.73, 872.73, 872.73, 897.55, 897.55, 897.55, 897.55, 897.55, 912.3, 912.3, 912.3, 912.3, 912.3, 924.02, 924.02, 924.02, 924.02, 924.02, 940.7, 940.7, 940.7, 940.7, 940.7, 947.09, 947.09, 947.09, 947.09, 947.09, 946.65, 946.65, 946.65, 946.65, 946.65, 945.71, 945.71, 945.71, 945.71, 945.71, 944.02, 944.02, 944.02, 944.02, 944.02, 939.14, 939.14, 939.14, 939.14, 939.14, 897.98, 897.98, 897.98, 897.98, 897.98, 894.54, 894.54, 894.54, 894.54, 894.54, 890.68, 890.68, 890.68, 890.68, 890.68, 894.28, 894.28, 894.28, 894.28, 894.28, 894.46, 894.46, 894.46, 894.46, 894.46, 887.91, 887.91, 887.91, 887.91, 887.91, 886.84, 886.84, 886.84, 886.84, 886.84, 888.08, 888.08, 888.08, 888.08, 888.08, 902.1, 902.1, 902.1, 902.1, 902.1, 897.25, 897.25, 897.25, 897.25, 897.25, 895.45, 895.45, 895.45, 895.45, 895.45, 895.58, 895.58, 895.58, 895.58, 895.58, 899.35, 899.35, 899.35, 899.35, 899.35, 898.62, 898.62, 898.62, 898.62, 898.62, 898.2, 898.2, 898.2, 898.2, 898.2, 899.54, 899.54, 899.54, 899.54, 899.54, 908.86, 908.86, 908.86, 908.86, 908.86, 910.37, 910.37, 910.37, 910.37, 910.37, 906.88, 906.88, 906.88, 906.88, 906.88, 909.99, 909.99, 909.99, 909.99, 909.99, 907.01, 907.01, 907.01, 907.01, 907.01, 906.52, 906.52, 906.52, 906.52, 906.52, 910.25, 910.25, 910.25, 910.25, 910.25, 910.34, 910.34, 910.34, 910.34, 910.34, 917.41, 917.41, 917.41, 917.41, 917.41, 921.0, 921.0, 921.0, 921.0, 921.0, 916.39, 916.39, 916.39, 916.39, 916.39, 914.49, 914.49, 914.49, 914.49, 914.49, 912.73, 912.73, 912.73, 912.73, 912.73, 911.58, 911.58, 911.58, 911.58, 911.58, 914.64, 914.64, 914.64, 914.64, 914.64, 913.57, 913.57, 913.57, 913.57, 913.57, 914.92, 914.92, 914.92, 914.92, 914.92, 914.76, 914.76, 914.76, 914.76, 914.76, 917.07, 917.07, 917.07, 917.07, 917.07, 919.5, 919.5, 919.5, 919.5, 919.5, 917.94, 917.94, 917.94, 917.94, 917.94, 917.68, 917.68, 917.68, 917.68, 917.68, 910.06, 910.06, 910.06, 910.06, 910.06, 909.35, 909.35, 909.35, 909.35, 909.35, 908.97, 908.97, 908.97, 908.97, 908.97, 909.59, 909.59, 909.59, 909.59, 909.59]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 521 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717420801 --> 1717421429
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.86, 40.86, 40.86, 40.86, 40.86, 30.18, 30.18, 30.18, 30.18, 30.18, 32.87, 32.87, 32.87, 32.87, 32.87, 34.48, 34.48, 34.48, 34.48, 34.48, 35.65, 35.65, 35.65, 35.65, 35.65, 36.49, 36.49, 36.49, 36.49, 36.49, 36.87, 36.87, 36.87, 36.87, 36.87, 36.73, 36.73, 36.73, 36.73, 36.73, 36.65, 36.65, 36.65, 36.65, 36.65, 36.54, 36.54, 36.54, 36.54, 36.54, 35.83, 35.83, 35.83, 35.83, 35.83, 35.59, 35.59, 35.59, 35.59, 35.59, 34.55, 34.55, 34.55, 34.55, 34.55, 33.15, 33.15, 33.15, 33.15, 33.15, 32.61, 32.61, 32.61, 32.61, 32.61, 31.05, 31.05, 31.05, 31.05, 31.05, 30.94, 30.94, 30.94, 30.94, 30.94, 30.96, 30.96, 30.96, 30.96, 30.96, 31.01, 31.01, 31.01, 31.01, 31.01, 30.6, 30.6, 30.6, 30.6, 30.6, 30.33, 30.33, 30.33, 30.33, 30.33, 30.26, 30.26, 30.26, 30.26, 30.26, 30.22, 30.22, 30.22, 30.22, 30.22, 30.29, 30.29, 30.29, 30.29, 30.29, 30.26, 30.26, 30.26, 30.26, 30.26, 30.53, 30.53, 30.53, 30.53, 30.53, 30.52, 30.52, 30.52, 30.52, 30.52, 30.37, 30.37, 30.37, 30.37, 30.37, 30.38, 30.38, 30.38, 30.38, 30.38, 30.53, 30.53, 30.53, 30.53, 30.53, 30.68, 30.68, 30.68, 30.68, 30.68, 30.74, 30.74, 30.74, 30.74, 30.74, 31.02, 31.02, 31.02, 31.02, 31.02, 31.05, 31.05, 31.05, 31.05, 31.05, 30.97, 30.97, 30.97, 30.97, 30.97, 30.85, 30.85, 30.85, 30.85, 30.85, 30.51, 30.51, 30.51, 30.51, 30.51, 30.03, 30.03, 30.03, 30.03, 30.03, 30.08, 30.08, 30.08, 30.08, 30.08, 30.31, 30.31, 30.31, 30.31, 30.31, 30.38, 30.38, 30.38, 30.38, 30.38, 30.51, 30.51, 30.51, 30.51, 30.51, 30.57, 30.57, 30.57, 30.57, 30.57, 30.42, 30.42, 30.42, 30.42, 30.42, 30.15, 30.15, 30.15, 30.15, 30.15, 29.67, 29.67, 29.67, 29.67, 29.67, 28.99, 28.99, 28.99, 28.99, 28.99, 28.7, 28.7, 28.7, 28.7, 28.7, 28.67, 28.67, 28.67, 28.67, 28.67, 28.67, 28.67, 28.67, 28.67, 28.67, 28.74, 28.74, 28.74, 28.74, 28.74, 28.81, 28.81, 28.81, 28.81, 28.81, 28.9, 28.9, 28.9, 28.9, 28.9, 28.91, 28.91, 28.91, 28.91, 28.91, 28.84, 28.84, 28.84, 28.84, 28.84, 28.79, 28.79, 28.79, 28.79, 28.79, 28.68, 28.68, 28.68, 28.68, 28.68, 28.72, 28.72, 28.72, 28.72, 28.72, 28.87, 28.87, 28.87, 28.87, 28.87, 28.96, 28.96, 28.96, 28.96, 28.96]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 521 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717420801 --> 1717421429
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.34, 0.34, 0.34, 0.34, 0.34, 0.35, 0.35, 0.35, 0.35, 0.35, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.32, 0.32, 0.32, 0.32, 0.32, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.53, 0.53, 0.53, 0.53, 0.53, 0.53, 0.53, 0.53, 0.53, 0.53, 0.48, 0.48, 0.48, 0.48, 0.48, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.24, 0.24, 0.24, 0.24, 0.24, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 521 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717420801 --> 1717421429
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@ggerganov ggerganov changed the title ggml : unify rope norm/neox (CPU) ggml : unify rope norm/neox May 30, 2024
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 30, 2024
@ggerganov ggerganov marked this pull request as ready for review May 30, 2024 11:30
@github-actions github-actions bot added python python script changes Kompute https://github.com/KomputeProject/kompute/ labels May 30, 2024
@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs refactoring Refactoring labels May 30, 2024
@mofosyne mofosyne requested review from slaren and xaedes May 30, 2024 12:21
ggml-cuda/rope.cu Outdated Show resolved Hide resolved
ggml.c Show resolved Hide resolved
@ggerganov ggerganov merged commit 2b33896 into master Jun 5, 2024
83 checks passed
@ggerganov ggerganov deleted the gg/rope-refactor branch June 5, 2024 08:29
joeatodd added a commit that referenced this pull request Jun 13, 2024
As per: #7634

Signed-off-by: Joe Todd <joe.todd@codeplay.com>
@joeatodd joeatodd mentioned this pull request Jun 13, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs python python script changes refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants