Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change markdown output in benchmark PR comments #2693

Merged
merged 1 commit into from
Feb 17, 2025

Conversation

EuphoricThinking
Copy link
Contributor

@EuphoricThinking EuphoricThinking commented Feb 11, 2025

🥺 add an option for limiting markdown content size
🥺 calculate relative performance with different baselines
🥺 calculate relative performance using only already saved data
🥺 group results according to suite names and explicit groups
🥺 add multiple data columns if multiple --compare specified

An example of the previous output design

@EuphoricThinking EuphoricThinking requested a review from a team as a code owner February 11, 2025 20:31
@github-actions github-actions bot added the ci/cd Continuous integration/devliery label Feb 11, 2025
@bratpiorka
Copy link
Contributor

could you provide links/images to see the difference before and after this PR?

Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13282111433

@pbalcer
Copy link
Contributor

pbalcer commented Feb 12, 2025

could you provide links/images to see the difference before and after this PR?

We can just run the benchmark to see. I just triggered it.

Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13282111433
Job status: failure. Test status: success.

Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13283071851

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13283071851
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 10 (threshold 2.00%)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3516.390000 ns 3830.530 ns 8.93%
Velocity-Bench Bitcracker 35.525100 s 37.429 s 5.36%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 704.395000 ns 739.669 ns 5.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 117.809000 ns 123.101 ns 4.49%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 204.816000 ns 213.371 ns 4.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15078.400000 ns 15563.600 ns 3.22%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4571.880000 ns 4712.160 ns 3.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.661000 ns 180.705 ns 2.87%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 595674.000000 ns 611807.000 ns 2.71%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 271.334000 ns 278.282 ns 2.56%
Regressed 21 (threshold 2.00%)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2187.210 ns 1997.540000 ns -8.67%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 2698360.000 ns 2543840.000000 ns -5.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 556.946 ns 525.156000 ns -5.71%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 104.130 ns 98.720200 ns -5.20%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1064.810 ns 1011.890000 ns -4.97%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33185.800 ns 31815.400000 ns -4.13%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2785.440 ns 2682.620000 ns -3.69%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 733.898 ns 708.146000 ns -3.51%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 56.137 μs 54.383000 μs -3.12%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3490.070 ns 3381.380000 ns -3.11%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 302.754 ns 293.969000 ns -2.90%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1895.690 ns 1841.590000 ns -2.85%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 280.195 ns 272.569000 ns -2.72%
api_overhead_benchmark_sycl SubmitKernel out of order 23.760 μs 23.120000 μs -2.69%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 868.411 ns 846.197000 ns -2.56%
Velocity-Bench Sobel Filter 606.957 ms 591.724000 ms -2.51%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17443.222 μs 17019.124000 μs -2.43%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 63.528 μs 62.093000 μs -2.26%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 212.952 ns 208.150000 ns -2.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 714.715 ns 699.118000 ns -2.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1194150.000 ns 1169220.000000 ns -2.09%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7): 99.763%
Benchmark This PR baseline Change
api_overhead_benchmark_l0 SubmitKernel in order 11.504000 μs 11.682 μs 1.55%
api_overhead_benchmark_ur SubmitKernel out of order 15.732000 μs 15.870 μs 0.88%
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.010000 μs 21.193 μs 0.87%
api_overhead_benchmark_ur SubmitKernel in order 16.453 μs 16.386000 μs -0.41%
api_overhead_benchmark_l0 SubmitKernel out of order 11.545 μs 11.494000 μs -0.44%
api_overhead_benchmark_sycl SubmitKernel in order 24.468 μs 24.138000 μs -1.35%
api_overhead_benchmark_sycl SubmitKernel out of order 23.760 μs 23.120000 μs -2.69%
Relative perf in group (17): 99.671%
Benchmark This PR baseline Change
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.103000 μs 2.144 μs 1.95%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.764000 μs 5.793 μs 0.50%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 48731.565000 μs 48939.615 μs 0.43%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 133.266000 μs 133.793 μs 0.40%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.186000 GB/s 3.181 GB/s 0.16%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.683000 μs 1.684 μs 0.06%
miscellaneous_benchmark_sycl VectorSum 858.316000 bw GB/s 858.609 bw GB/s 0.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1205.576 μs 1204.721000 μs -0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 113099.997 μs 112569.641000 μs -0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8745.221 μs 8695.080000 μs -0.57%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 255.633 μs 254.094000 μs -0.60%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40986.423 μs 40721.892000 μs -0.65%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6978.908 μs 6931.801000 μs -0.67%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7591.671 μs 7522.288000 μs -0.91%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2081.220 μs 2059.364000 μs -1.05%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 26150.355 μs 25730.535000 μs -1.61%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17443.222 μs 17019.124000 μs -2.43%
Relative perf in group SinKernelGraph (4): 100.088%
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72591.316000 μs 72725.709 μs 0.19%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71750.955000 μs 71861.794 μs 0.15%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353441.240000 μs 353468.831 μs 0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353349.861000 μs 353366.397 μs 0.00%
Relative perf in group SubmitGraph (3): 98.015%
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 679.934 μs 676.159000 μs -0.56%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 63.528 μs 62.093000 μs -2.26%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 56.137 μs 54.383000 μs -3.12%
Relative perf in group ExecGraph (3): 100.280%
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5592.763000 μs 5622.422 μs 0.53%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5602.257000 μs 5626.650 μs 0.44%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56512.337 μs 56442.150000 μs -0.12%
Relative perf in group SubmitKernel CPU count (3): 100.000%
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104723.000000 instr 104723.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count 110066.000000 instr 110066.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 122936.000000 instr 122936.000 instr 0.00%
Velocity Bench
Relative perf in group (8): 101.045%
Benchmark This PR baseline Change
Velocity-Bench Bitcracker 35.525100 s 37.429 s 5.36%
Velocity-Bench QuickSilver 117.790000 MMS/CTT 116.690 MMS/CTT 0.94%
Velocity-Bench Hashtable 362.656861 M keys/sec 359.598 M keys/sec 0.85%
Velocity-Bench dl-mnist 2.720000 s 2.740 s 0.74%
Velocity-Bench Sobel Filter 606.957 ms 591.724000 ms -2.51%
Velocity-Bench CudaSift 202.355000 ms -
Velocity-Bench dl-cifar 23.663800 s -
Velocity-Bench svm 0.136800 s -
SYCL-Bench
Relative perf in group (54): cannot calculate
Benchmark This PR baseline Change
Runtime_IndependentDAGTaskThroughput_SingleTask 266.719000 ms -
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 287.965000 ms -
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 273.464000 ms -
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 273.761000 ms -
Runtime_DAGTaskThroughput_SingleTask 1677.949000 ms -
Runtime_DAGTaskThroughput_BasicParallelFor 1774.890000 ms -
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1731.006000 ms -
Runtime_DAGTaskThroughput_NDRangeParallelFor 1705.253000 ms -
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.689000 ms -
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.735000 ms -
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.598000 ms -
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.671000 ms -
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.469000 ms -
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.478000 ms -
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.768000 ms -
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.997000 ms -
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 5.056000 ms -
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.824000 ms -
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 616.926000 ms -
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.928000 ms -
MicroBench_LocalMem_int32_4096 29.820000 ms -
MicroBench_LocalMem_fp32_4096 29.910000 ms -
Pattern_Reduction_NDRange_int32 16.299000 ms -
Pattern_Reduction_Hierarchical_int32 16.343000 ms -
ScalarProduct_NDRange_int32 3.768000 ms -
ScalarProduct_NDRange_int64 5.423000 ms -
ScalarProduct_NDRange_fp32 3.804000 ms -
ScalarProduct_Hierarchical_int32 10.539000 ms -
ScalarProduct_Hierarchical_int64 11.494000 ms -
ScalarProduct_Hierarchical_fp32 10.158000 ms -
Pattern_SegmentedReduction_NDRange_int16 2.266000 ms -
Pattern_SegmentedReduction_NDRange_int32 2.163000 ms -
Pattern_SegmentedReduction_NDRange_int64 2.338000 ms -
Pattern_SegmentedReduction_NDRange_fp32 2.169000 ms -
Pattern_SegmentedReduction_Hierarchical_int16 11.803000 ms -
Pattern_SegmentedReduction_Hierarchical_int32 11.590000 ms -
Pattern_SegmentedReduction_Hierarchical_int64 11.771000 ms -
Pattern_SegmentedReduction_Hierarchical_fp32 11.588000 ms -
USM_Allocation_latency_fp32_device 0.062000 ms -
USM_Allocation_latency_fp32_host 37.576000 ms -
USM_Allocation_latency_fp32_shared 0.067000 ms -
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.671000 ms -
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.046000 ms -
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.848000 ms -
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.216000 ms -
VectorAddition_int32 1.468000 ms -
VectorAddition_int64 3.059000 ms -
VectorAddition_fp32 1.510000 ms -
Polybench_2mm 1.055000 ms -
Polybench_3mm 1.484000 ms -
Polybench_Atax 6.460000 ms -
Kmeans_fp32 14.050000 ms -
LinearRegressionCoeff_fp32 890.405000 ms -
MolecularDynamics 0.030000 ms -
llama.cpp bench
Relative perf in group (6): cannot calculate
Benchmark This PR baseline Change
llama.cpp Prompt Processing Batched 128 827.038521 token/s -
llama.cpp Text Generation Batched 128 62.428237 token/s -
llama.cpp Prompt Processing Batched 256 870.024268 token/s -
llama.cpp Text Generation Batched 256 62.462323 token/s -
llama.cpp Prompt Processing Batched 512 426.594975 token/s -
llama.cpp Text Generation Batched 512 62.476457 token/s -
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 98.691%
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 117.809000 ns 123.101 ns 4.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4571.880000 ns 4712.160 ns 3.07%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3093.100000 ns 3133.790 ns 1.32%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 287.473 ns 281.921000 ns -1.93%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3490.070 ns 3381.380000 ns -3.11%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2785.440 ns 2682.620000 ns -3.69%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2187.210 ns 1997.540000 ns -8.67%
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 98.130%
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 490.573000 ns 492.253 ns 0.34%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.409 ns 119.362000 ns -0.04%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 194.902 ns 193.089000 ns -0.93%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 714.715 ns 699.118000 ns -2.18%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 212.952 ns 208.150000 ns -2.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 280.195 ns 272.569000 ns -2.72%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 104.130 ns 98.720200 ns -5.20%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 100.952%
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3516.390000 ns 3830.530 ns 8.93%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 271.334000 ns 278.282 ns 2.56%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4502.950000 ns 4561.320 ns 1.30%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1228.000 ns 1222.550000 ns -0.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3272.730 ns 3243.100000 ns -0.91%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 116.401 ns 114.675000 ns -1.48%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1895.690 ns 1841.590000 ns -2.85%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 99.236%
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 204.816000 ns 213.371 ns 4.18%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.210000 ns 119.600 ns 0.33%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 190.194000 ns 190.534 ns 0.18%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 502.837 ns 494.342000 ns -1.69%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 198.251 ns 194.817000 ns -1.73%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 302.754 ns 293.969000 ns -2.90%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 733.898 ns 708.146000 ns -3.51%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4): 96.922%
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4161.990000 ns 4206.340 ns 1.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 868.411 ns 846.197000 ns -2.56%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1064.810 ns 1011.890000 ns -4.97%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 556.946 ns 525.156000 ns -5.71%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4): 102.096%
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 704.395000 ns 739.669 ns 5.01%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.661000 ns 180.705 ns 2.87%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 941.239000 ns 948.074 ns 0.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 356.014 ns 355.503000 ns -0.14%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7): 98.810%
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 7713060.000000 ns 7742690.000 ns 0.38%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1724620.000000 ns 1725460.000 ns 0.05%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1170800.000 ns 1168520.000000 ns -0.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 47296.700 ns 46996.600000 ns -0.63%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 519470.000 ns 510965.000000 ns -1.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1194150.000 ns 1169220.000000 ns -2.09%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33185.800 ns 31815.400000 ns -4.13%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7): 99.405%
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15078.400000 ns 15563.600 ns 3.22%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4212.360000 ns 4278.780 ns 1.58%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24163.000000 ns 24207.200 ns 0.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 140134.000 ns 139266.000000 ns -0.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 162180.000 ns 160574.000000 ns -0.99%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 208332.000 ns 205074.000000 ns -1.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 2698360.000 ns 2543840.000000 ns -5.73%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4): 100.438%
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 595674.000000 ns 611807.000 ns 2.71%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 74721.000000 ns 75748.800 ns 1.38%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 139431.000 ns 138754.000000 ns -0.49%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 10883300.000 ns 10688900.000000 ns -1.79%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4): 99.735%
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 58800.200000 ns 59340.100 ns 0.92%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 2607870.000000 ns 2608650.000 ns 0.03%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25671.200 ns 25635.500000 ns -0.14%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 31641.200 ns 31056.100000 ns -1.85%

Details

Benchmark details contain too many chars to display

# Generate the row with the best value highlighted
# Generate the row with all the results from saved runs specified by
# --compare,
# Highight the best value in the row with data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misspell Highight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a misspell ;d

@EuphoricThinking EuphoricThinking force-pushed the benchmark_markdown branch 2 times, most recently from 2e8d039 to 09ce9ac Compare February 13, 2025 14:08
Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grouping doesn't work for some of the benchmarks:

Relative perf in group (17): 99.671%


# If data is collected from already saved results,
# the content is parsed as strings
if isinstance(res.env, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this improve the existing way of printing env vars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My OCD couldn't stand empty Environment variables sections, if you are asking about the introduced ifs.

If you are asking about ast.literal_eval, I have added it when using results which have been not calculated during the script runs, but have been already saved. This function enables us to access the elements of the dictionary with environmental variables, which originally is parsed from json to string. Maybe we could change something about Benchmark.from_json() instead.



def get_relative_perf_summary(group_size: int, diffs_product: int,
root_for_geometric_mean: int, group_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geomean is complicated to calculate. I'd just replace the : xx % change with (improved X, regressed Y)

@EuphoricThinking
Copy link
Contributor Author

EuphoricThinking commented Feb 14, 2025

grouping doesn't work for some of the benchmarks:

Relative perf in group (17): 99.671%

These benchmarks don't have explicit group assigned (example without assigned, example with assigned, the default explicit_group is empty string. I'm going to change the default group name to "Ungrouped".

Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13335358616

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13335358616
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 23 (threshold 2.00%)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1164370.000000 ns 1290770.000 ns 10.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1192800.000000 ns 1301090.000 ns 9.08%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2608.040000 ns 2813.690 ns 7.89%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 290.805000 ns 312.357 ns 7.41%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 291.544000 ns 312.314 ns 7.12%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 496255.000000 ns 529627.000 ns 6.72%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25690.100000 ns 27282.400 ns 6.20%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 266.115000 ns 279.926 ns 5.19%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4286.200000 ns 4505.450 ns 5.12%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 580.222000 ns 609.181 ns 4.99%
VectorAddition_int32 1.448000 ms 1.519 ms 4.90%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 142261.000000 ns 148026.000 ns 4.05%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.939000 ns 124.375 ns 3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 496.554000 ns 513.605 ns 3.43%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3409.200000 ns 3515.560 ns 3.12%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 206862.000000 ns 212699.000 ns 2.82%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 613468.000000 ns 629497.000 ns 2.61%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.826000 ms 4.952 ms 2.61%
ScalarProduct_NDRange_int32 3.766000 ms 3.848 ms 2.18%
Polybench_Atax 6.259000 ms 6.394 ms 2.16%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 121.121000 ns 123.713 ns 2.14%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15362.200000 ns 15689.000 ns 2.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1764240.000000 ns 1801270.000 ns 2.10%
Regressed 27 (threshold 2.00%)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 979.017 ns 743.029000 ns -24.10%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1436.210 ns 1258.420000 ns -12.38%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1998.270 ns 1783.030000 ns -10.77%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 267.519 ns 239.990000 ns -10.29%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3422.840 ns 3109.690000 ns -9.15%
USM_Allocation_latency_fp32_shared 0.070 ms 0.064000 ms -8.57%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 140.458 ns 128.515000 ns -8.50%
USM_Allocation_latency_fp32_device 0.065 ms 0.060000 ms -7.69%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2106.500 ns 1967.950000 ns -6.58%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3803.790 ns 3577.360000 ns -5.95%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1052.890 ns 1001.060000 ns -4.92%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 873.835 ns 832.341000 ns -4.75%
api_overhead_benchmark_l0 SubmitKernel out of order 11.917 μs 11.376000 μs -4.54%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 267.323 ns 255.713000 ns -4.34%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4654.630 ns 4467.570000 ns -4.02%
VectorAddition_fp32 1.533 ms 1.472000 ms -3.98%
MolecularDynamics 0.031 ms 0.030000 ms -3.23%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 872.718 ns 846.087000 ns -3.05%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 183.128 ns 177.966000 ns -2.82%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 196.708 ns 191.184000 ns -2.81%
api_overhead_benchmark_ur SubmitKernel out of order 15.980 μs 15.587000 μs -2.46%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4335.340 ns 4229.540000 ns -2.44%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8789.452 μs 8581.558000 μs -2.37%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.713 ms 1.674000 ms -2.28%
LinearRegressionCoeff_fp32 941.013 ms 920.731000 ms -2.16%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 142682.000 ns 139678.000000 ns -2.11%
api_overhead_benchmark_ur SubmitKernel in order 16.631 μs 16.293000 μs -2.03%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion 20.950000 μs 21.090 μs 0.67%
api_overhead_benchmark_sycl SubmitKernel in order 24.544 μs 24.377000 μs -0.68%
api_overhead_benchmark_sycl SubmitKernel out of order 23.173 μs 22.912000 μs -1.13%
api_overhead_benchmark_l0 SubmitKernel in order 11.842 μs 11.696000 μs -1.23%
api_overhead_benchmark_ur SubmitKernel in order 16.631 μs 16.293000 μs -2.03%
api_overhead_benchmark_ur SubmitKernel out of order 15.980 μs 15.587000 μs -2.46%
api_overhead_benchmark_l0 SubmitKernel out of order 11.917 μs 11.376000 μs -4.54%
Relative perf in group (17)
Benchmark This PR baseline Change
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7446.901000 μs 7527.460 μs 1.08%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40664.268000 μs 41011.050 μs 0.85%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 133.498000 μs 134.477 μs 0.73%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 25736.730000 μs 25845.518 μs 0.42%
miscellaneous_benchmark_sycl VectorSum 858.902000 bw GB/s 861.548 bw GB/s 0.31%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6923.736 μs 6913.059000 μs -0.15%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.676 μs 1.673000 μs -0.18%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.183 GB/s 3.189000 GB/s -0.19%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 48059.615 μs 47966.780000 μs -0.19%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 253.755 μs 252.961000 μs -0.31%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.632 μs 5.613000 μs -0.34%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 113003.736 μs 112059.723000 μs -0.84%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2094.790 μs 2077.079000 μs -0.85%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17323.738 μs 17146.720000 μs -1.02%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.147 μs 2.114000 μs -1.54%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1209.309 μs 1186.615000 μs -1.88%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8789.452 μs 8581.558000 μs -2.37%
Relative perf in group SinKernelGraph (4)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71721.562000 μs 71737.327 μs 0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353239.836 μs 353216.125000 μs -0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353560.841 μs 353477.995000 μs -0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72664.228 μs 72516.787000 μs -0.20%
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 61.738000 μs 61.921 μs 0.30%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 672.665000 μs 673.385 μs 0.11%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.707 μs 54.243000 μs -0.85%
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5593.804000 μs 5622.149 μs 0.51%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56521.162 μs 56485.620000 μs -0.06%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5584.778 μs 5580.318000 μs -0.08%
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104593.000000 instr 104593.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count 109936.000000 instr 109936.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 122806.000000 instr 122806.000 instr 0.00%
Velocity Bench
Relative perf in group (8)
Benchmark This PR baseline Change
Velocity-Bench QuickSilver 118.170000 MMS/CTT 117.050 MMS/CTT 0.96%
Velocity-Bench svm 0.133600 s 0.135 s 0.82%
Velocity-Bench CudaSift 202.781000 ms 203.219 ms 0.22%
Velocity-Bench Bitcracker 35.578800 s 35.584 s 0.02%
Velocity-Bench dl-cifar 24.066 s 23.890200 s -0.73%
Velocity-Bench Sobel Filter 615.344 ms 610.451000 ms -0.80%
Velocity-Bench Hashtable 355.727 M keys/sec 359.331110 M keys/sec -1.00%
Velocity-Bench dl-mnist 2.760 s 2.730000 s -1.09%
SYCL-Bench
Relative perf in group (54)
Benchmark This PR baseline Change
VectorAddition_int32 1.448000 ms 1.519 ms 4.90%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.826000 ms 4.952 ms 2.61%
ScalarProduct_NDRange_int32 3.766000 ms 3.848 ms 2.18%
Polybench_Atax 6.259000 ms 6.394 ms 2.16%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 5.011000 ms 5.087 ms 1.52%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.792000 ms 4.864 ms 1.50%
Runtime_DAGTaskThroughput_NDRangeParallelFor 1705.474000 ms 1727.936 ms 1.32%
Runtime_DAGTaskThroughput_SingleTask 1693.933000 ms 1715.450 ms 1.27%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.797000 ms 4.855 ms 1.21%
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1750.190000 ms 1770.145 ms 1.14%
Runtime_DAGTaskThroughput_BasicParallelFor 1761.610000 ms 1775.221 ms 0.77%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 5.122000 ms 5.153 ms 0.61%
ScalarProduct_Hierarchical_int64 11.474000 ms 11.542 ms 0.59%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 275.894000 ms 277.203 ms 0.47%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.214000 ms 1.219 ms 0.41%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.734000 ms 4.749 ms 0.32%
Runtime_IndependentDAGTaskThroughput_SingleTask 267.999000 ms 268.591 ms 0.22%
Pattern_SegmentedReduction_NDRange_int16 2.262000 ms 2.266 ms 0.18%
ScalarProduct_NDRange_int64 5.465000 ms 5.474 ms 0.16%
USM_Allocation_latency_fp32_host 37.324000 ms 37.383 ms 0.16%
MicroBench_LocalMem_fp32_4096 29.848000 ms 29.891 ms 0.14%
Pattern_SegmentedReduction_NDRange_int64 2.333000 ms 2.336 ms 0.13%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 276.243000 ms 276.568 ms 0.12%
Pattern_SegmentedReduction_NDRange_int32 2.163000 ms 2.165 ms 0.09%
ScalarProduct_Hierarchical_fp32 10.171000 ms 10.176 ms 0.05%
Pattern_SegmentedReduction_Hierarchical_fp32 11.582000 ms 11.587 ms 0.04%
Pattern_SegmentedReduction_Hierarchical_int16 11.798000 ms 11.803 ms 0.04%
ScalarProduct_Hierarchical_int32 10.517000 ms 10.521 ms 0.04%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 616.909000 ms 617.076 ms 0.03%
Pattern_SegmentedReduction_Hierarchical_int64 11.778000 ms 11.781 ms 0.03%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.905000 ms 616.946 ms 0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.566000 ms 617.585 ms 0.00%
Polybench_2mm 1.052000 ms 1.052 ms 0.00%
Polybench_3mm 1.481000 ms 1.481 ms 0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.589 ms 617.560000 ms -0.00%
Pattern_SegmentedReduction_Hierarchical_int32 11.589 ms 11.587000 ms -0.02%
Pattern_SegmentedReduction_NDRange_fp32 2.164 ms 2.163000 ms -0.05%
MicroBench_LocalMem_int32_4096 29.884 ms 29.826000 ms -0.19%
Kmeans_fp32 14.109 ms 14.048000 ms -0.43%
Pattern_Reduction_NDRange_int32 16.801 ms 16.686000 ms -0.68%
Pattern_Reduction_Hierarchical_int32 16.900 ms 16.762000 ms -0.82%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.841 ms 1.825000 ms -0.87%
ScalarProduct_NDRange_fp32 3.794 ms 3.760000 ms -0.90%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.718 ms 4.674000 ms -0.93%
VectorAddition_int64 3.095 ms 3.064000 ms -1.00%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 293.181 ms 289.887000 ms -1.12%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.065 ms 1.051000 ms -1.31%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.852 ms 4.775000 ms -1.59%
LinearRegressionCoeff_fp32 941.013 ms 920.731000 ms -2.16%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.713 ms 1.674000 ms -2.28%
MolecularDynamics 0.031 ms 0.030000 ms -3.23%
VectorAddition_fp32 1.533 ms 1.472000 ms -3.98%
USM_Allocation_latency_fp32_device 0.065 ms 0.060000 ms -7.69%
USM_Allocation_latency_fp32_shared 0.070 ms 0.064000 ms -8.57%
llama.cpp bench
Relative perf in group (6)
Benchmark This PR baseline Change
llama.cpp Text Generation Batched 128 62.523001 token/s 62.439 token/s 0.13%
llama.cpp Text Generation Batched 512 62.506139 token/s 62.433 token/s 0.12%
llama.cpp Text Generation Batched 256 62.528739 token/s 62.524 token/s 0.01%
llama.cpp Prompt Processing Batched 512 420.439 token/s 422.877525 token/s -0.58%
llama.cpp Prompt Processing Batched 256 865.016 token/s 871.296349 token/s -0.72%
llama.cpp Prompt Processing Batched 128 825.488 token/s 838.278943 token/s -1.53%
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2608.040000 ns 2813.690 ns 7.89%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 290.805000 ns 312.357 ns 7.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4286.200000 ns 4505.450 ns 5.12%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3132.190000 ns 3179.010 ns 1.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 133.185000 ns 133.661 ns 0.36%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3803.790 ns 3577.360000 ns -5.95%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2106.500 ns 1967.950000 ns -6.58%
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 266.115000 ns 279.926 ns 5.19%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.939000 ns 124.375 ns 3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 496.554000 ns 513.605 ns 3.43%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 712.866000 ns 719.821 ns 0.98%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 209.903000 ns 211.785 ns 0.90%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 188.784 ns 188.514000 ns -0.14%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 96.764 ns 96.221000 ns -0.56%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3409.200000 ns 3515.560 ns 3.12%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4654.630 ns 4467.570000 ns -4.02%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 267.323 ns 255.713000 ns -4.34%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 140.458 ns 128.515000 ns -8.50%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3422.840 ns 3109.690000 ns -9.15%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1998.270 ns 1783.030000 ns -10.77%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1436.210 ns 1258.420000 ns -12.38%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 291.544000 ns 312.314 ns 7.12%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 121.121000 ns 123.713 ns 2.14%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 502.574 ns 501.051000 ns -0.30%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 203.301 ns 201.403000 ns -0.93%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 196.708 ns 191.184000 ns -2.81%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 267.519 ns 239.990000 ns -10.29%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 979.017 ns 743.029000 ns -24.10%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4335.340 ns 4229.540000 ns -2.44%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 872.718 ns 846.087000 ns -3.05%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 873.835 ns 832.341000 ns -4.75%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1052.890 ns 1001.060000 ns -4.92%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 580.222000 ns 609.181 ns 4.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 962.359000 ns 979.526 ns 1.78%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 350.754000 ns 353.953 ns 0.91%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 183.128 ns 177.966000 ns -2.82%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1164370.000000 ns 1290770.000 ns 10.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1192800.000000 ns 1301090.000 ns 9.08%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 496255.000000 ns 529627.000 ns 6.72%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1764240.000000 ns 1801270.000 ns 2.10%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 701661.000000 ns 709591.000 ns 1.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32465.300000 ns 32614.100 ns 0.46%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 47368.300 ns 47151.600000 ns -0.46%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 142261.000000 ns 148026.000 ns 4.05%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 206862.000000 ns 212699.000 ns 2.82%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15362.200000 ns 15689.000 ns 2.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 162737.000000 ns 165724.000 ns 1.84%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 117142.000000 ns 117400.000 ns 0.22%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4301.930 ns 4299.410000 ns -0.06%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24827.000 ns 24506.300000 ns -1.29%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 613468.000000 ns 629497.000 ns 2.61%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 75662.800000 ns 76304.400 ns 0.85%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 670349.000 ns 666074.000000 ns -0.64%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 142682.000 ns 139678.000000 ns -2.11%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25690.100000 ns 27282.400 ns 6.20%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 31316.900000 ns 31631.300 ns 1.00%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 59541.500000 ns 59837.700 ns 0.50%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 132633.000 ns 131923.000000 ns -0.54%

Details

Benchmark details contain too many chars to display

Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13357456707

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13357456707
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 16 (threshold 2.00%)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 709.347000 ns 839.789 ns 18.39%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3354.690000 ns 3670.840 ns 9.42%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 2969.050000 ns 3145.700 ns 5.95%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.924000 ms 5.177 ms 5.14%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3149.510000 ns 3292.800 ns 4.55%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 351.825000 ns 367.152 ns 4.36%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.963000 ms 5.173 ms 4.23%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 986.221000 ns 1025.920 ns 4.03%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.313000 ns 181.985 ns 3.81%
VectorAddition_int32 1.459000 ms 1.506 ms 3.22%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.742000 ms 4.884 ms 2.99%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 128.073000 ns 131.773 ns 2.89%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17001.345000 μs 17478.948 μs 2.81%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1161930.000000 ns 1192630.000 ns 2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15148.100000 ns 15514.100 ns 2.42%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 626.494000 ns 639.395 ns 2.06%
Regressed 17 (threshold 2.00%)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_shared 0.066 ms 0.055000 ms -16.67%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1388.320 ns 1202.580000 ns -13.38%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 255.429 ns 241.503000 ns -5.45%
USM_Allocation_latency_fp32_device 0.065 ms 0.062000 ms -4.62%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2242.230 ns 2142.640000 ns -4.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 261.679 ns 250.231000 ns -4.37%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32992.200 ns 31667.700000 ns -4.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3767.700 ns 3630.740000 ns -3.64%
MolecularDynamics 0.032 ms 0.031000 ms -3.12%
Velocity-Bench Sobel Filter 611.555 ms 593.513000 ms -2.95%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.883 ms 4.751000 ms -2.70%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 995.454 ns 969.826000 ns -2.57%
api_overhead_benchmark_l0 SubmitKernel in order 11.694 μs 11.398000 μs -2.53%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1193740.000 ns 1165850.000000 ns -2.34%
api_overhead_benchmark_ur SubmitKernel in order 16.530 μs 16.180000 μs -2.12%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.767 ms 4.671000 ms -2.01%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.696 ms 1.662000 ms -2.00%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_l0 SubmitKernel out of order 11.524000 μs 11.747 μs 1.94%
api_overhead_benchmark_sycl SubmitKernel in order 24.117000 μs 24.218 μs 0.42%
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.078 μs 20.931000 μs -0.70%
api_overhead_benchmark_sycl SubmitKernel out of order 23.111 μs 22.862000 μs -1.08%
api_overhead_benchmark_ur SubmitKernel out of order 15.705 μs 15.485000 μs -1.40%
api_overhead_benchmark_ur SubmitKernel in order 16.530 μs 16.180000 μs -2.12%
api_overhead_benchmark_l0 SubmitKernel in order 11.694 μs 11.398000 μs -2.53%
Relative perf in group (17)
Benchmark This PR baseline Change
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17001.345000 μs 17478.948 μs 2.81%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8630.563000 μs 8719.602 μs 1.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 25620.442000 μs 25867.194 μs 0.96%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.186000 GB/s 3.163 GB/s 0.73%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40931.902000 μs 41181.436 μs 0.61%
miscellaneous_benchmark_sycl VectorSum 857.147000 bw GB/s 860.959 bw GB/s 0.44%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 112499.588000 μs 112938.927 μs 0.39%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.534000 μs 5.553 μs 0.34%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.098000 μs 2.105 μs 0.33%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 134.006000 μs 134.388 μs 0.29%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1188.326000 μs 1191.684 μs 0.28%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6951.370 μs 6946.184000 μs -0.07%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.687 μs 1.679000 μs -0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2102.064 μs 2084.496000 μs -0.84%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7550.282 μs 7486.409000 μs -0.85%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 48669.897 μs 48250.437000 μs -0.86%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 253.401 μs 251.023000 μs -0.94%
Relative perf in group SinKernelGraph (4)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353374.079000 μs 353443.490 μs 0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353273.575 μs 353238.347000 μs -0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72574.742 μs 72555.811000 μs -0.03%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71855.159 μs 71737.858000 μs -0.16%
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.401 μs 54.334000 μs -0.12%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 674.742 μs 673.309000 μs -0.21%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 62.371 μs 61.787000 μs -0.94%
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56482.408000 μs 56486.380 μs 0.01%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5593.672 μs 5586.543000 μs -0.13%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5597.140 μs 5588.567000 μs -0.15%
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104593.000000 instr 104593.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count 109936.000000 instr 109936.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 123120.000 instr 122807.000000 instr -0.25%
Velocity Bench
Relative perf in group Ungrouped (8)
Benchmark This PR baseline Change
Velocity-Bench svm 0.134300 s 0.135 s 0.30%
Velocity-Bench QuickSilver 117.870000 MMS/CTT 117.560 MMS/CTT 0.26%
Velocity-Bench CudaSift 202.895000 ms 203.196 ms 0.15%
Velocity-Bench Bitcracker 35.528 s 35.501000 s -0.07%
Velocity-Bench Hashtable 357.425 M keys/sec 358.637420 M keys/sec -0.34%
Velocity-Bench dl-mnist 2.730 s 2.720000 s -0.37%
Velocity-Bench dl-cifar 24.054 s 23.806700 s -1.03%
Velocity-Bench Sobel Filter 611.555 ms 593.513000 ms -2.95%
SYCL-Bench
Relative perf in group Ungrouped (53)
Benchmark This PR baseline Change
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 4.924000 ms 5.177 ms 5.14%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.963000 ms 5.173 ms 4.23%
VectorAddition_int32 1.459000 ms 1.506 ms 3.22%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.742000 ms 4.884 ms 2.99%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.766000 ms 4.831 ms 1.36%
VectorAddition_fp32 1.454000 ms 1.473 ms 1.31%
Pattern_Reduction_NDRange_int32 16.499000 ms 16.700 ms 1.22%
Polybench_2mm 1.041000 ms 1.051 ms 0.96%
ScalarProduct_Hierarchical_int64 11.470000 ms 11.516 ms 0.40%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.827000 ms 4.835 ms 0.17%
ScalarProduct_Hierarchical_fp32 10.140000 ms 10.156 ms 0.16%
ScalarProduct_NDRange_fp32 3.741000 ms 3.745 ms 0.11%
Runtime_DAGTaskThroughput_NDRangeParallelFor 1694.603000 ms 1696.297 ms 0.10%
Pattern_SegmentedReduction_NDRange_fp32 2.164000 ms 2.165 ms 0.05%
Pattern_SegmentedReduction_Hierarchical_int16 11.802000 ms 11.805 ms 0.03%
Pattern_Reduction_Hierarchical_int32 16.932000 ms 16.935 ms 0.02%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 616.911000 ms 616.960 ms 0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.915000 ms 616.936 ms 0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.638 ms 617.589000 ms -0.01%
Pattern_SegmentedReduction_Hierarchical_fp32 11.587 ms 11.586000 ms -0.01%
Pattern_SegmentedReduction_Hierarchical_int32 11.586 ms 11.584000 ms -0.02%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.690 ms 617.554000 ms -0.02%
ScalarProduct_Hierarchical_int32 10.551 ms 10.548000 ms -0.03%
Pattern_SegmentedReduction_NDRange_int16 2.265 ms 2.264000 ms -0.04%
Pattern_SegmentedReduction_Hierarchical_int64 11.769 ms 11.763000 ms -0.05%
USM_Allocation_latency_fp32_host 37.543 ms 37.512000 ms -0.08%
Kmeans_fp32 14.046 ms 14.031000 ms -0.11%
ScalarProduct_NDRange_int64 5.431 ms 5.425000 ms -0.11%
MicroBench_LocalMem_int32_4096 29.859 ms 29.798000 ms -0.20%
Polybench_3mm 1.486 ms 1.482000 ms -0.27%
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1731.429 ms 1726.405000 ms -0.29%
ScalarProduct_NDRange_int32 3.770 ms 3.759000 ms -0.29%
Pattern_SegmentedReduction_NDRange_int64 2.343 ms 2.336000 ms -0.30%
Runtime_DAGTaskThroughput_BasicParallelFor 1754.051 ms 1748.761000 ms -0.30%
MicroBench_LocalMem_fp32_4096 29.926 ms 29.816000 ms -0.37%
Runtime_DAGTaskThroughput_SingleTask 1684.153 ms 1677.953000 ms -0.37%
Pattern_SegmentedReduction_NDRange_int32 2.172 ms 2.162000 ms -0.46%
Runtime_IndependentDAGTaskThroughput_SingleTask 262.293 ms 260.173000 ms -0.81%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.805 ms 4.764000 ms -0.85%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 277.956 ms 275.430000 ms -0.91%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 279.673 ms 276.920000 ms -0.98%
Polybench_Atax 6.436 ms 6.372000 ms -0.99%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.205 ms 1.193000 ms -1.00%
VectorAddition_int64 3.108 ms 3.067000 ms -1.32%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.054 ms 1.039000 ms -1.42%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.841 ms 1.814000 ms -1.47%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 292.626 ms 286.848000 ms -1.97%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.696 ms 1.662000 ms -2.00%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.767 ms 4.671000 ms -2.01%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.883 ms 4.751000 ms -2.70%
MolecularDynamics 0.032 ms 0.031000 ms -3.12%
USM_Allocation_latency_fp32_device 0.065 ms 0.062000 ms -4.62%
USM_Allocation_latency_fp32_shared 0.066 ms 0.055000 ms -16.67%
llama.cpp bench
Relative perf in group Ungrouped (6)
Benchmark This PR baseline Change
llama.cpp Prompt Processing Batched 512 421.883415 token/s 417.717 token/s 1.00%
llama.cpp Prompt Processing Batched 128 825.849545 token/s 822.681 token/s 0.39%
llama.cpp Text Generation Batched 256 62.537986 token/s 62.449 token/s 0.14%
llama.cpp Text Generation Batched 512 62.481856 token/s 62.410 token/s 0.12%
llama.cpp Text Generation Batched 128 62.470 token/s 62.489279 token/s -0.03%
llama.cpp Prompt Processing Batched 256 863.426 token/s 869.125841 token/s -0.66%
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 2969.050000 ns 3145.700 ns 5.95%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 290.412000 ns 296.208 ns 2.00%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2582.560000 ns 2595.190 ns 0.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4700.460 ns 4686.100000 ns -0.31%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 135.677 ns 133.691000 ns -1.46%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3767.700 ns 3630.740000 ns -3.64%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2242.230 ns 2142.640000 ns -4.44%
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 210.848000 ns 213.403 ns 1.21%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 499.323000 ns 500.448 ns 0.23%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.587000 ns 119.758 ns 0.14%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 189.676 ns 188.841000 ns -0.44%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 98.056 ns 97.605900 ns -0.46%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 274.047 ns 270.523000 ns -1.29%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 713.179 ns 701.745000 ns -1.60%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3354.690000 ns 3670.840 ns 9.42%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3149.510000 ns 3292.800 ns 4.55%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 128.073000 ns 131.773 ns 2.89%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4724.390000 ns 4775.630 ns 1.08%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1980.670 ns 1980.320000 ns -0.02%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 261.679 ns 250.231000 ns -4.37%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1388.320 ns 1202.580000 ns -13.38%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 294.962 ns 294.927000 ns -0.01%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 203.327 ns 203.225000 ns -0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.499 ns 119.377000 ns -0.10%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 505.384 ns 503.757000 ns -0.32%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 192.617 ns 191.825000 ns -0.41%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 750.861 ns 743.102000 ns -1.03%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 255.429 ns 241.503000 ns -5.45%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 709.347000 ns 839.789 ns 18.39%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 986.221000 ns 1025.920 ns 4.03%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 849.690000 ns 861.095 ns 1.34%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4298.920 ns 4282.990000 ns -0.37%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 351.825000 ns 367.152 ns 4.36%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.313000 ns 181.985 ns 3.81%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 626.494000 ns 639.395 ns 2.06%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 995.454 ns 969.826000 ns -2.57%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1161930.000000 ns 1192630.000 ns 2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 706545.000000 ns 713122.000 ns 0.93%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 47358.700 ns 47332.200000 ns -0.06%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1754920.000 ns 1751760.000000 ns -0.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 525143.000 ns 518073.000000 ns -1.35%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1193740.000 ns 1165850.000000 ns -2.34%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32992.200 ns 31667.700000 ns -4.01%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15148.100000 ns 15514.100 ns 2.42%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 162004.000000 ns 164859.000 ns 1.76%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 117360.000000 ns 117587.000 ns 0.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 211739.000 ns 211217.000000 ns -0.25%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24508.400 ns 24403.500000 ns -0.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 143723.000 ns 141395.000000 ns -1.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4287.140 ns 4203.400000 ns -1.95%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 139385.000000 ns 141671.000 ns 1.64%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 616923.000000 ns 622760.000 ns 0.95%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 74303.300000 ns 74577.400 ns 0.37%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 670951.000 ns 666557.000000 ns -0.65%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 31618.300000 ns 32027.100 ns 1.29%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25898.200 ns 25677.800000 ns -0.85%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 60407.500 ns 59857.400000 ns -0.91%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 132797.000 ns 131430.000000 ns -1.03%

Details

Benchmark details contain too many chars to display

@@ -18,7 +18,7 @@ class Result:
stdout: str
passed: bool = True
unit: str = ""
explicit_group: str = ""
explicit_group: str = "Ungrouped"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The html output interprets anything other than "" as a group (see https://github.com/oneapi-src/unified-runtime/blob/main/scripts/benchmarks/output_html.py#L117). And every explicit group is shown together on a bar chart.
So this needs to stay as "".
My suggestion is to use "Others" in the markdown output when "" is specified.

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a couple of nits...

parser.add_argument("--dry-run", help='Do not run any actual benchmarks', action="store_true", default=False)
parser.add_argument("--compute-runtime", nargs='?', const=options.compute_runtime_tag, help="Fetch and build compute runtime")
parser.add_argument("--iterations-stddev", type=int, help="Max number of iterations of the loop calculating stddev after completed benchmark runs", default=options.iterations_stddev)
parser.add_argument("--build-igc", help="Build IGC from source instead of using the OS-installed version", action="store_true", default=options.build_igc)
parser.add_argument("--relative-perf", type=str, help="The name of the results which should be used as a baseline for metrics calculation", default=options.current_run_name)
parser.add_argument("--new-base-name", help="New name of the default baseline to compare", type=str, default='')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if we need this let's just remove the default compare. E.g., with nothing is specified, we don't compare at all.
This will eliminate the need for this option.

(x.diff is not None, x.diff), reverse=True)

# Geometric mean calculation
product = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be unused?

Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13368284077

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13368284077
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 20 (threshold 2.00%)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 207.541000 ns 254.262 ns 22.51%
USM_Allocation_latency_fp32_device 0.064000 ms 0.075 ms 17.19%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 1918.930000 ns 2237.740 ns 16.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4427.100000 ns 4801.330 ns 8.45%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2627.630000 ns 2828.790 ns 7.66%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 96.979700 ns 103.858 ns 7.09%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24340.100000 ns 25904.100 ns 6.43%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3413.760000 ns 3598.790 ns 5.42%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.667000 ms 4.867 ms 4.29%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 295.763000 ns 308.311 ns 4.24%
api_overhead_benchmark_l0 SubmitKernel out of order 11.410000 μs 11.854 μs 3.89%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 142882.000000 ns 148394.000 ns 3.86%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.531000 μs 5.739 μs 3.76%
MolecularDynamics 0.030000 ms 0.031 ms 3.33%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1747590.000000 ns 1805140.000 ns 3.29%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.652000 ms 4.803 ms 3.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 212.352000 ns 218.316 ns 2.81%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 754.576000 ns 775.151 ns 2.73%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25140.600000 ns 25814.900 ns 2.68%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 208239.000000 ns 213417.000 ns 2.49%
Regressed 14 (threshold 2.00%)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_shared 0.064 ms 0.053000 ms -17.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1316330.000 ns 1161880.000000 ns -11.73%
VectorAddition_fp32 1.558 ms 1.439000 ms -7.64%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3511.250 ns 3272.820000 ns -6.79%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1303400.000 ns 1229410.000000 ns -5.68%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1279.670 ns 1211.730000 ns -5.31%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 287.453 ms 273.621000 ms -4.81%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 272.439 ns 261.712000 ns -3.94%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 984.997 ns 948.150000 ns -3.74%
Runtime_IndependentDAGTaskThroughput_SingleTask 265.490 ms 257.665000 ms -2.95%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 196.068 ns 190.897000 ns -2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32410.400 ns 31644.300000 ns -2.36%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 55.146 μs 53.871000 μs -2.31%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.067 ms 1.044000 ms -2.16%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_l0 SubmitKernel out of order 11.410000 μs 11.854 μs 3.89%
api_overhead_benchmark_l0 SubmitKernel in order 11.493000 μs 11.662 μs 1.47%
api_overhead_benchmark_ur SubmitKernel out of order 15.536000 μs 15.729 μs 1.24%
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.048000 μs 21.172 μs 0.59%
api_overhead_benchmark_ur SubmitKernel in order 16.135000 μs 16.210 μs 0.46%
api_overhead_benchmark_sycl SubmitKernel in order 24.223 μs 24.195000 μs -0.12%
api_overhead_benchmark_sycl SubmitKernel out of order 22.892 μs 22.788000 μs -0.45%
Relative perf in group Ungrouped (17)
Benchmark This PR baseline Change
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.531000 μs 5.739 μs 3.76%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 25560.605000 μs 25995.334 μs 1.70%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40807.955000 μs 41326.322 μs 1.27%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.124000 μs 2.150 μs 1.22%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2047.112000 μs 2070.294 μs 1.13%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 133.418000 μs 134.462 μs 0.78%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 110757.927000 μs 111607.803 μs 0.77%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.196000 GB/s 3.172 GB/s 0.76%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 250.255000 μs 251.433 μs 0.47%
miscellaneous_benchmark_sycl VectorSum 858.609000 bw GB/s 861.843 bw GB/s 0.38%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17225.687000 μs 17286.218 μs 0.35%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8730.538000 μs 8738.599 μs 0.09%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1190.110000 μs 1190.999 μs 0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6942.625 μs 6930.233000 μs -0.18%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7494.870 μs 7473.513000 μs -0.28%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.698 μs 1.684000 μs -0.82%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 48319.957 μs 47693.770000 μs -1.30%
Relative perf in group SinKernelGraph (4)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72528.102000 μs 72693.457 μs 0.23%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353403.030000 μs 353572.971 μs 0.05%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71739.442 μs 71736.788000 μs -0.00%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353232.660 μs 353031.159000 μs -0.06%
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 61.877 μs 61.705000 μs -0.28%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 674.803 μs 672.596000 μs -0.33%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 55.146 μs 53.871000 μs -2.31%
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5578.026000 μs 5585.971 μs 0.14%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56490.231000 μs 56544.632 μs 0.10%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5600.265 μs 5596.380000 μs -0.07%
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 122806.000000 instr 123120.000 instr 0.26%
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104593.000000 instr 104593.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count 109936.000000 instr 109936.000 instr 0.00%
Velocity Bench
Relative perf in group Ungrouped (8)
Benchmark This PR baseline Change
Velocity-Bench Hashtable 360.146407 M keys/sec 357.844 M keys/sec 0.64%
Velocity-Bench dl-cifar 23.751700 s 23.846 s 0.40%
Velocity-Bench CudaSift 202.817000 ms 203.111 ms 0.14%
Velocity-Bench Bitcracker 35.506100 s 35.521 s 0.04%
Velocity-Bench svm 0.135 s 0.134400 s -0.30%
Velocity-Bench QuickSilver 117.830 MMS/CTT 118.240000 MMS/CTT -0.35%
Velocity-Bench dl-mnist 2.740 s 2.730000 s -0.36%
Velocity-Bench Sobel Filter 607.475 ms 595.796000 ms -1.92%
SYCL-Bench
Relative perf in group Ungrouped (54)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_device 0.064000 ms 0.075 ms 17.19%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.667000 ms 4.867 ms 4.29%
MolecularDynamics 0.030000 ms 0.031 ms 3.33%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.652000 ms 4.803 ms 3.25%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.760000 ms 4.844 ms 1.76%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.909000 ms 4.992 ms 1.69%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.824000 ms 4.881 ms 1.18%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 272.333000 ms 275.524 ms 1.17%
Pattern_Reduction_NDRange_int32 16.495000 ms 16.656 ms 0.98%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 271.866000 ms 273.979 ms 0.78%
ScalarProduct_NDRange_int32 3.766000 ms 3.792 ms 0.69%
ScalarProduct_NDRange_int64 5.435000 ms 5.462 ms 0.50%
VectorAddition_int64 3.057000 ms 3.069 ms 0.39%
Pattern_Reduction_Hierarchical_int32 16.898000 ms 16.961 ms 0.37%
LinearRegressionCoeff_fp32 912.509000 ms 915.761 ms 0.36%
ScalarProduct_Hierarchical_int64 11.493000 ms 11.518 ms 0.22%
Pattern_SegmentedReduction_NDRange_fp32 2.164000 ms 2.166 ms 0.09%
Pattern_SegmentedReduction_NDRange_int64 2.335000 ms 2.337 ms 0.09%
Polybench_3mm 1.479000 ms 1.480 ms 0.07%
MicroBench_LocalMem_fp32_4096 29.856000 ms 29.866 ms 0.03%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.505000 ms 617.565 ms 0.01%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.523000 ms 617.571 ms 0.01%
Pattern_SegmentedReduction_NDRange_int16 2.265000 ms 2.265 ms 0.00%
Pattern_SegmentedReduction_NDRange_int32 2.163000 ms 2.163 ms 0.00%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.937 ms 616.925000 ms -0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.078 ms 617.060000 ms -0.00%
Pattern_SegmentedReduction_Hierarchical_fp32 11.590 ms 11.589000 ms -0.01%
Pattern_SegmentedReduction_Hierarchical_int32 11.588 ms 11.584000 ms -0.03%
Pattern_SegmentedReduction_Hierarchical_int16 11.808 ms 11.800000 ms -0.07%
ScalarProduct_Hierarchical_fp32 10.176 ms 10.169000 ms -0.07%
USM_Allocation_latency_fp32_host 37.526 ms 37.499000 ms -0.07%
Pattern_SegmentedReduction_Hierarchical_int64 11.778 ms 11.767000 ms -0.09%
Polybench_Atax 6.426 ms 6.418000 ms -0.12%
ScalarProduct_Hierarchical_int32 10.528 ms 10.513000 ms -0.14%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.740 ms 4.733000 ms -0.15%
MicroBench_LocalMem_int32_4096 29.924 ms 29.875000 ms -0.16%
Runtime_DAGTaskThroughput_BasicParallelFor 1757.745 ms 1751.649000 ms -0.35%
Kmeans_fp32 14.141 ms 14.089000 ms -0.37%
Polybench_2mm 1.047 ms 1.043000 ms -0.38%
ScalarProduct_NDRange_fp32 3.765 ms 3.750000 ms -0.40%
Runtime_DAGTaskThroughput_NDRangeParallelFor 1700.697 ms 1692.563000 ms -0.48%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 5.113 ms 5.087000 ms -0.51%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 5.223 ms 5.195000 ms -0.54%
Runtime_DAGTaskThroughput_SingleTask 1688.159 ms 1676.869000 ms -0.67%
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1743.627 ms 1731.836000 ms -0.68%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.836 ms 1.821000 ms -0.82%
VectorAddition_int32 1.491 ms 1.471000 ms -1.34%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.217 ms 1.197000 ms -1.64%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.712 ms 1.678000 ms -1.99%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.067 ms 1.044000 ms -2.16%
Runtime_IndependentDAGTaskThroughput_SingleTask 265.490 ms 257.665000 ms -2.95%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 287.453 ms 273.621000 ms -4.81%
VectorAddition_fp32 1.558 ms 1.439000 ms -7.64%
USM_Allocation_latency_fp32_shared 0.064 ms 0.053000 ms -17.19%
llama.cpp bench
Relative perf in group Ungrouped (6)
Benchmark This PR baseline Change
llama.cpp Prompt Processing Batched 512 423.309840 token/s 420.526 token/s 0.66%
llama.cpp Prompt Processing Batched 256 874.095863 token/s 870.669 token/s 0.39%
llama.cpp Text Generation Batched 128 62.565562 token/s 62.536 token/s 0.05%
llama.cpp Text Generation Batched 256 62.560031 token/s 62.540 token/s 0.03%
llama.cpp Text Generation Batched 512 62.487 token/s 62.526848 token/s -0.06%
llama.cpp Prompt Processing Batched 128 830.491 token/s 837.632259 token/s -0.85%
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 1918.930000 ns 2237.740 ns 16.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4427.100000 ns 4801.330 ns 8.45%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2627.630000 ns 2828.790 ns 7.66%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 295.763000 ns 308.311 ns 4.24%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 134.742000 ns 136.220 ns 1.10%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3133.630000 ns 3157.440 ns 0.76%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3511.250 ns 3272.820000 ns -6.79%
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 96.979700 ns 103.858 ns 7.09%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 212.352000 ns 218.316 ns 2.81%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 274.244000 ns 277.043 ns 1.02%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 499.595000 ns 500.397 ns 0.16%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 709.164 ns 705.745000 ns -0.48%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 120.133 ns 118.361000 ns -1.48%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 196.068 ns 190.897000 ns -2.64%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3413.760000 ns 3598.790 ns 5.42%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4657.030000 ns 4669.200 ns 0.26%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 2001.790000 ns 2005.880 ns 0.20%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 132.027 ns 131.939000 ns -0.07%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3375.770 ns 3334.700000 ns -1.22%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 272.439 ns 261.712000 ns -3.94%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1279.670 ns 1211.730000 ns -5.31%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 207.541000 ns 254.262 ns 22.51%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 499.779000 ns 506.641 ns 1.37%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 300.482000 ns 302.371 ns 0.63%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 191.603000 ns 192.581 ns 0.51%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 202.400000 ns 203.050 ns 0.32%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.393 ns 119.332000 ns -0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 756.415 ns 752.618000 ns -0.50%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 754.576000 ns 775.151 ns 2.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 855.214000 ns 869.820 ns 1.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4114.110000 ns 4136.040 ns 0.53%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 984.997 ns 948.150000 ns -3.74%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 174.742000 ns 176.480 ns 0.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 348.183000 ns 351.644 ns 0.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 609.229000 ns 613.537 ns 0.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 966.426000 ns 966.457 ns 0.00%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1747590.000000 ns 1805140.000 ns 3.29%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 708107.000000 ns 712838.000 ns 0.67%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 517037.000000 ns 518305.000 ns 0.25%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 46936.600 ns 46813.200000 ns -0.26%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32410.400 ns 31644.300000 ns -2.36%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1303400.000 ns 1229410.000000 ns -5.68%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1316330.000 ns 1161880.000000 ns -11.73%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24340.100000 ns 25904.100 ns 6.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 142882.000000 ns 148394.000 ns 3.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 208239.000000 ns 213417.000 ns 2.49%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 164225.000000 ns 165995.000 ns 1.08%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15395.600000 ns 15536.500 ns 0.92%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4236.460000 ns 4259.720 ns 0.55%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 117484.000000 ns 118110.000 ns 0.53%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 633790.000000 ns 641556.000 ns 1.23%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 139976.000000 ns 140363.000 ns 0.28%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 666338.000000 ns 667489.000 ns 0.17%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 75331.300 ns 75247.200000 ns -0.11%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25140.600000 ns 25814.900 ns 2.68%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 59600.900000 ns 60016.500 ns 0.70%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 31241.900 ns 31174.000000 ns -0.22%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 132302.000 ns 131884.000000 ns -0.32%

Details

Benchmark details contain too many chars to display

Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13369553246

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13369553246
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 21 (threshold 2.00%)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_device 0.051000 ms 0.075 ms 47.06%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1746.340000 ns 2005.880 ns 14.86%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 682.180000 ns 775.151 ns 13.63%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2495.030000 ns 2828.790 ns 13.38%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3296.830000 ns 3598.790 ns 9.16%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 2934.820000 ns 3157.440 ns 7.59%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 97.025600 ns 103.858 ns 7.04%
MolecularDynamics 0.029000 ms 0.031 ms 6.90%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24528.300000 ns 25904.100 ns 5.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2122.800000 ns 2237.740 ns 5.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4599.240000 ns 4801.330 ns 4.39%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1187100.000000 ns 1229410.000 ns 3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15002.000000 ns 15536.500 ns 3.56%
Polybench_Atax 6.224000 ms 6.418 ms 3.12%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 595.246000 ns 613.537 ns 3.07%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 299.300000 ns 308.311 ns 3.01%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 623116.000000 ns 641556.000 ns 2.96%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 846.881000 ns 869.820 ns 2.71%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.097000 μs 2.150 μs 2.53%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 73415.400000 ns 75247.200 ns 2.50%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.771000 ms 4.881 ms 2.31%
Regressed 11 (threshold 2.00%)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_shared 0.064 ms 0.053000 ms -17.19%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1114.640 ns 948.150000 ns -14.94%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3478.340 ns 3272.820000 ns -5.91%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 286.523 ms 273.621000 ms -4.50%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33106.700 ns 31644.300000 ns -4.42%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 264.756 ns 254.262000 ns -3.96%
Runtime_IndependentDAGTaskThroughput_SingleTask 267.557 ms 257.665000 ms -3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 284.400 ns 277.043000 ns -2.59%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1190810.000 ns 1161880.000000 ns -2.43%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 309.859 ns 302.371000 ns -2.42%
VectorAddition_fp32 1.470 ms 1.439000 ms -2.11%

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel out of order 15.459000 μs 15.729 μs 1.75%
api_overhead_benchmark_ur SubmitKernel in order with measure completion 20.962000 μs 21.172 μs 1.00%
api_overhead_benchmark_l0 SubmitKernel out of order 11.781000 μs 11.854 μs 0.62%
api_overhead_benchmark_sycl SubmitKernel in order 24.135000 μs 24.195 μs 0.25%
api_overhead_benchmark_sycl SubmitKernel out of order 22.870 μs 22.788000 μs -0.36%
api_overhead_benchmark_ur SubmitKernel in order 16.323 μs 16.210000 μs -0.69%
api_overhead_benchmark_l0 SubmitKernel in order 11.845 μs 11.662000 μs -1.54%
Relative perf in group Other (17)
Benchmark This PR baseline Change
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.097000 μs 2.150 μs 2.53%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40751.379000 μs 41326.322 μs 1.41%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 110280.410000 μs 111607.803 μs 1.20%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.208000 GB/s 3.172 GB/s 1.13%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.683000 μs 5.739 μs 0.99%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 250.160000 μs 251.433 μs 0.51%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 133.862000 μs 134.462 μs 0.45%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8711.600000 μs 8738.599 μs 0.31%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 47595.884000 μs 47693.770 μs 0.21%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7464.570000 μs 7473.513 μs 0.12%
miscellaneous_benchmark_sycl VectorSum 861.253000 bw GB/s 861.843 bw GB/s 0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1190.553000 μs 1190.999 μs 0.04%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17281.431000 μs 17286.218 μs 0.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 26030.833 μs 25995.334000 μs -0.14%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6962.616 μs 6930.233000 μs -0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2101.985 μs 2070.294000 μs -1.51%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.713 μs 1.684000 μs -1.69%
Relative perf in group SinKernelGraph (4)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353373.123000 μs 353572.971 μs 0.06%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72659.797000 μs 72693.457 μs 0.05%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71747.951 μs 71736.788000 μs -0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353253.507 μs 353031.159000 μs -0.06%
Relative perf in group SubmitGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 673.919 μs 672.596000 μs -0.20%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 62.121 μs 61.705000 μs -0.67%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.589 μs 53.871000 μs -1.32%
Relative perf in group ExecGraph (3)
Benchmark This PR baseline Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56454.829000 μs 56544.632 μs 0.16%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5583.459000 μs 5585.971 μs 0.04%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5597.457 μs 5596.380000 μs -0.02%
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR baseline Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 122806.000000 instr 123120.000 instr 0.26%
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104593.000000 instr 104593.000 instr 0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count 109936.000000 instr 109936.000 instr 0.00%
Velocity Bench
Relative perf in group Other (8)
Benchmark This PR baseline Change
Velocity-Bench dl-mnist 2.700000 s 2.730 s 1.11%
Velocity-Bench Hashtable 359.868959 M keys/sec 357.844 M keys/sec 0.57%
Velocity-Bench QuickSilver 118.440000 MMS/CTT 118.240 MMS/CTT 0.17%
Velocity-Bench svm 0.134200 s 0.134 s 0.15%
Velocity-Bench dl-cifar 23.818200 s 23.846 s 0.12%
Velocity-Bench CudaSift 203.002000 ms 203.111 ms 0.05%
Velocity-Bench Bitcracker 35.512700 s 35.521 s 0.02%
Velocity-Bench Sobel Filter 606.108 ms 595.796000 ms -1.70%
SYCL-Bench
Relative perf in group Other (54)
Benchmark This PR baseline Change
USM_Allocation_latency_fp32_device 0.051000 ms 0.075 ms 47.06%
MolecularDynamics 0.029000 ms 0.031 ms 6.90%
Polybench_Atax 6.224000 ms 6.418 ms 3.12%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.771000 ms 4.881 ms 2.31%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.773000 ms 4.867 ms 1.97%
VectorAddition_int32 1.447000 ms 1.471 ms 1.66%
ScalarProduct_NDRange_int32 3.750000 ms 3.792 ms 1.12%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.797000 ms 4.844 ms 0.98%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 5.039000 ms 5.087 ms 0.95%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 5.150000 ms 5.195 ms 0.87%
Pattern_Reduction_Hierarchical_int32 16.828000 ms 16.961 ms 0.79%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.968000 ms 4.992 ms 0.48%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.712000 ms 4.733 ms 0.45%
ScalarProduct_NDRange_int64 5.439000 ms 5.462 ms 0.42%
Polybench_2mm 1.039000 ms 1.043 ms 0.38%
ScalarProduct_Hierarchical_int64 11.474000 ms 11.518 ms 0.38%
VectorAddition_int64 3.061000 ms 3.069 ms 0.26%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.791000 ms 4.803 ms 0.25%
ScalarProduct_Hierarchical_fp32 10.145000 ms 10.169 ms 0.24%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.675000 ms 1.678 ms 0.18%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 273.506000 ms 273.979 ms 0.17%
MicroBench_LocalMem_fp32_4096 29.819000 ms 29.866 ms 0.16%
Pattern_SegmentedReduction_NDRange_int16 2.262000 ms 2.265 ms 0.13%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 275.267000 ms 275.524 ms 0.09%
Pattern_SegmentedReduction_NDRange_fp32 2.164000 ms 2.166 ms 0.09%
Pattern_SegmentedReduction_NDRange_int64 2.335000 ms 2.337 ms 0.09%
Kmeans_fp32 14.078000 ms 14.089 ms 0.08%
Pattern_SegmentedReduction_Hierarchical_fp32 11.583000 ms 11.589 ms 0.05%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.892000 ms 616.925 ms 0.01%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.538000 ms 617.571 ms 0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.549000 ms 617.565 ms 0.00%
Pattern_SegmentedReduction_NDRange_int32 2.163000 ms 2.163 ms 0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.065 ms 617.060000 ms -0.00%
Pattern_SegmentedReduction_Hierarchical_int16 11.801 ms 11.800000 ms -0.01%
USM_Allocation_latency_fp32_host 37.509 ms 37.499000 ms -0.03%
Pattern_SegmentedReduction_Hierarchical_int32 11.590 ms 11.584000 ms -0.05%
Runtime_DAGTaskThroughput_NDRangeParallelFor 1693.687 ms 1692.563000 ms -0.07%
Pattern_SegmentedReduction_Hierarchical_int64 11.775 ms 11.767000 ms -0.07%
MicroBench_LocalMem_int32_4096 29.915 ms 29.875000 ms -0.13%
Polybench_3mm 1.482 ms 1.480000 ms -0.13%
ScalarProduct_Hierarchical_int32 10.533 ms 10.513000 ms -0.19%
Runtime_DAGTaskThroughput_BasicParallelFor 1756.541 ms 1751.649000 ms -0.28%
LinearRegressionCoeff_fp32 918.741 ms 915.761000 ms -0.32%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.827 ms 1.821000 ms -0.33%
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1738.041 ms 1731.836000 ms -0.36%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.048 ms 1.044000 ms -0.38%
ScalarProduct_NDRange_fp32 3.765 ms 3.750000 ms -0.40%
Runtime_DAGTaskThroughput_SingleTask 1686.891 ms 1676.869000 ms -0.59%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.210 ms 1.197000 ms -1.07%
Pattern_Reduction_NDRange_int32 16.878 ms 16.656000 ms -1.32%
VectorAddition_fp32 1.470 ms 1.439000 ms -2.11%
Runtime_IndependentDAGTaskThroughput_SingleTask 267.557 ms 257.665000 ms -3.70%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 286.523 ms 273.621000 ms -4.50%
USM_Allocation_latency_fp32_shared 0.064 ms 0.053000 ms -17.19%
llama.cpp bench
Relative perf in group Other (6)
Benchmark This PR baseline Change
llama.cpp Prompt Processing Batched 512 421.813576 token/s 420.526 token/s 0.31%
llama.cpp Text Generation Batched 128 62.502 token/s 62.536385 token/s -0.05%
llama.cpp Text Generation Batched 256 62.480 token/s 62.539887 token/s -0.10%
llama.cpp Text Generation Batched 512 62.460 token/s 62.526848 token/s -0.11%
llama.cpp Prompt Processing Batched 128 831.944 token/s 837.632259 token/s -0.68%
llama.cpp Prompt Processing Batched 256 862.763 token/s 870.668720 token/s -0.91%
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2495.030000 ns 2828.790 ns 13.38%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 2934.820000 ns 3157.440 ns 7.59%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2122.800000 ns 2237.740 ns 5.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4599.240000 ns 4801.330 ns 4.39%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 299.300000 ns 308.311 ns 3.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 137.299 ns 136.220000 ns -0.79%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3478.340 ns 3272.820000 ns -5.91%
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 97.025600 ns 103.858 ns 7.04%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 493.060000 ns 500.397 ns 1.49%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 701.873000 ns 705.745 ns 0.55%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 217.504000 ns 218.316 ns 0.37%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 190.852000 ns 190.897 ns 0.02%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.367 ns 118.361000 ns -0.84%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 284.400 ns 277.043000 ns -2.59%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1746.340000 ns 2005.880 ns 14.86%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3296.830000 ns 3598.790 ns 9.16%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 258.644000 ns 261.712 ns 1.19%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1198.800000 ns 1211.730 ns 1.08%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4648.820000 ns 4669.200 ns 0.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3371.340 ns 3334.700000 ns -1.09%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 134.341 ns 131.939000 ns -1.79%
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)
Benchmark This PR baseline Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 201.647000 ns 203.050 ns 0.70%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 118.890000 ns 119.332 ns 0.37%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 753.010 ns 752.618000 ns -0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 507.982 ns 506.641000 ns -0.26%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 196.427 ns 192.581000 ns -1.96%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 309.859 ns 302.371000 ns -2.42%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 264.756 ns 254.262000 ns -3.96%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 682.180000 ns 775.151 ns 13.63%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 846.881000 ns 869.820 ns 2.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4211.600 ns 4136.040000 ns -1.79%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1114.640 ns 948.150000 ns -14.94%
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)
Benchmark This PR baseline Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 595.246000 ns 613.537 ns 3.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 959.568000 ns 966.457 ns 0.72%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 351.671 ns 351.644000 ns -0.01%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 176.597 ns 176.480000 ns -0.07%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1187100.000000 ns 1229410.000 ns 3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 704649.000000 ns 712838.000 ns 1.16%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1794930.000000 ns 1805140.000 ns 0.57%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 518163.000000 ns 518305.000 ns 0.03%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 46921.100 ns 46813.200000 ns -0.23%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1190810.000 ns 1161880.000000 ns -2.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33106.700 ns 31644.300000 ns -4.42%
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)
Benchmark This PR baseline Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24528.300000 ns 25904.100 ns 5.61%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15002.000000 ns 15536.500 ns 3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 163353.000000 ns 165995.000 ns 1.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 146586.000000 ns 148394.000 ns 1.23%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 117076.000000 ns 118110.000 ns 0.88%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4288.090 ns 4259.720000 ns -0.66%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 215507.000 ns 213417.000000 ns -0.97%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 623116.000000 ns 641556.000 ns 2.96%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 73415.400000 ns 75247.200 ns 2.50%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 659960.000000 ns 667489.000 ns 1.14%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 140482.000 ns 140363.000000 ns -0.08%
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)
Benchmark This PR baseline Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25632.900000 ns 25814.900 ns 0.71%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 131839.000000 ns 131884.000 ns 0.03%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 60016.500000 ns 60016.500 ns 0.00%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 31185.100 ns 31174.000000 ns -0.04%

Details

Benchmark details contain too many chars to display

@EuphoricThinking EuphoricThinking force-pushed the benchmark_markdown branch 3 times, most recently from 36ff0a5 to 7028a34 Compare February 17, 2025 15:21
@@ -37,11 +37,16 @@ By default, the benchmark results are not stored. To store them, use the option

To compare a benchmark run with a previously stored result, use the option `--compare <name>`. You can compare with more than one result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above there's a sentence By default, all benchmark runs are compared against baseline, which is a well-established set of the latest data. which should be now gone, I believe.

## Output formats
You can display the results in the form of a HTML file by using `--ouptut-html` and a markdown file by using `--output-markdown`. Due to character limits for posting PR comments, the final content of the markdown file might be reduced. In order to obtain the full markdown output, use `--output-markdown full`.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one redundant empty line

# Generate the row with the best value highlighted
# Generate the row with all the results from saved runs specified by
# --compare,
# Highight the best value in the row with data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a misspell ;d

add an option for limiting markdown content size
calculate relative performance with different baselines
calculate relative performance using only already saved data
group results according to suite names and explicit groups
add multiple data columns if multiple --compare specified
Copy link

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13374078367

Copy link

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13374078367
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)
No diffs to calculate performance change

Performance change in benchmark groups

Compute Benchmarks
Relative perf in group SubmitKernel (7)
Benchmark This PR
api_overhead_benchmark_l0 SubmitKernel out of order 11.776000 μs
api_overhead_benchmark_l0 SubmitKernel in order 11.922000 μs
api_overhead_benchmark_sycl SubmitKernel out of order 22.947000 μs
api_overhead_benchmark_sycl SubmitKernel in order 24.499000 μs
api_overhead_benchmark_ur SubmitKernel out of order 15.787000 μs
api_overhead_benchmark_ur SubmitKernel in order 16.459000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.155000 μs
Relative perf in group Other (17)
Benchmark This PR
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 255.950000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 134.070000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.679000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.169000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.160000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.730000 μs
miscellaneous_benchmark_sycl VectorSum 856.272000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6922.979000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 17208.941000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 48159.835000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2087.698000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7556.543000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8684.467000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 25857.625000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1195.338000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 40981.923000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 111831.648000 μs
Relative perf in group SinKernelGraph (4)
Benchmark This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71729.094000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72595.653000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353444.654000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 353397.056000 μs
Relative perf in group SubmitGraph (3)
Benchmark This PR
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.338000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 62.270000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 677.416000 μs
Relative perf in group ExecGraph (3)
Benchmark This PR
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5597.346000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5611.676000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 56483.549000 μs
Relative perf in group SubmitKernel CPU count (3)
Benchmark This PR
api_overhead_benchmark_ur SubmitKernel out of order CPU count 104593.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count 109936.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 122806.000000 instr
Velocity Bench
Relative perf in group Other (8)
Benchmark This PR
Velocity-Bench Hashtable 357.020483 M keys/sec
Velocity-Bench Bitcracker 35.410900 s
Velocity-Bench CudaSift 202.614000 ms
Velocity-Bench QuickSilver 117.350000 MMS/CTT
Velocity-Bench Sobel Filter 608.239000 ms
Velocity-Bench dl-cifar 23.907000 s
Velocity-Bench dl-mnist 2.740000 s
Velocity-Bench svm 0.135400 s
SYCL-Bench
Relative perf in group Other (53)
Benchmark This PR
Runtime_IndependentDAGTaskThroughput_SingleTask 268.524000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 293.891000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 276.167000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 278.026000 ms
Runtime_DAGTaskThroughput_SingleTask 1697.255000 ms
Runtime_DAGTaskThroughput_BasicParallelFor 1767.704000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1751.062000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor 1711.114000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.848000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.858000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.825000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.756000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 617.570000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 617.605000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.764000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 5.090000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 4.909000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.934000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.027000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 616.918000 ms
MicroBench_LocalMem_int32_4096 29.927000 ms
MicroBench_LocalMem_fp32_4096 29.881000 ms
Pattern_Reduction_NDRange_int32 16.968000 ms
Pattern_Reduction_Hierarchical_int32 16.685000 ms
ScalarProduct_NDRange_int32 3.760000 ms
ScalarProduct_NDRange_int64 5.458000 ms
ScalarProduct_NDRange_fp32 3.741000 ms
ScalarProduct_Hierarchical_int32 10.530000 ms
ScalarProduct_Hierarchical_int64 11.497000 ms
ScalarProduct_Hierarchical_fp32 10.164000 ms
Pattern_SegmentedReduction_NDRange_int16 2.265000 ms
Pattern_SegmentedReduction_NDRange_int32 2.161000 ms
Pattern_SegmentedReduction_NDRange_int64 2.333000 ms
Pattern_SegmentedReduction_NDRange_fp32 2.167000 ms
Pattern_SegmentedReduction_Hierarchical_int16 11.804000 ms
Pattern_SegmentedReduction_Hierarchical_int32 11.587000 ms
Pattern_SegmentedReduction_Hierarchical_int64 11.770000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 11.588000 ms
USM_Allocation_latency_fp32_device 0.064000 ms
USM_Allocation_latency_fp32_host 37.563000 ms
USM_Allocation_latency_fp32_shared 0.055000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.672000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.048000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.814000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.205000 ms
VectorAddition_int32 1.460000 ms
VectorAddition_int64 3.130000 ms
VectorAddition_fp32 1.486000 ms
Polybench_2mm 1.042000 ms
Polybench_3mm 1.480000 ms
Polybench_Atax 6.390000 ms
Kmeans_fp32 14.046000 ms
MolecularDynamics 0.030000 ms
llama.cpp bench
Relative perf in group Other (6)
Benchmark This PR
llama.cpp Prompt Processing Batched 128 827.954592 token/s
llama.cpp Text Generation Batched 128 62.510353 token/s
llama.cpp Prompt Processing Batched 256 868.599537 token/s
llama.cpp Text Generation Batched 256 62.527584 token/s
llama.cpp Prompt Processing Batched 512 420.065042 token/s
llama.cpp Text Generation Batched 512 62.487604 token/s
UMF
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)
Benchmark This PR
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2742.670000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2067.960000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3129.460000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4559.790000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3355.410000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 295.297000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy 135.108000 ns
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)
Benchmark This PR
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 707.619000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 188.767000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 275.108000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 495.126000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 119.030000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 214.015000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy 97.362400 ns
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)
Benchmark This PR
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1444.760000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1980.130000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3256.670000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider> 4472.580000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider> 3525.540000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 262.952000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy 132.116000 ns
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)
Benchmark This PR
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 978.814000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 194.471000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 305.558000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider> 496.014000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider> 118.534000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 269.829000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy 201.943000 ns
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)
Benchmark This PR
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 854.820000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider> 4153.800000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1013.870000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy 722.673000 ns
Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)
Benchmark This PR
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.874000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider> 345.899000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 969.503000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy 614.514000 ns
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)
Benchmark This PR
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32389.300000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1183460.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1182330.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider> 1740640.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider> 512444.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 48513.600000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy 705291.000000 ns
Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)
Benchmark This PR
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4328.220000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 164126.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 143840.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider> 213370.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider> 24692.900000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 15469.900000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy 117218.000000 ns
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)
Benchmark This PR
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 140408.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider> 623862.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 77464.100000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy 664466.000000 ns
Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)
Benchmark This PR
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 30428.300000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider> 60592.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25434.600000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy 131628.000000 ns

Details

Benchmark details contain too many chars to display

@pbalcer pbalcer merged commit 8682bbc into oneapi-src:main Feb 17, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/cd Continuous integration/devliery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants