change markdown output in benchmark PR comments #2693

EuphoricThinking · 2025-02-11T20:31:38Z

🥺 add an option for limiting markdown content size
🥺 calculate relative performance with different baselines
🥺 calculate relative performance using only already saved data
🥺 group results according to suite names and explicit groups
🥺 add multiple data columns if multiple --compare specified

An example of the previous output design

bratpiorka · 2025-02-12T09:23:17Z

could you provide links/images to see the difference before and after this PR?

github-actions · 2025-02-12T09:26:05Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/13282111433

pbalcer · 2025-02-12T09:26:09Z

could you provide links/images to see the difference before and after this PR?

We can just run the benchmark to see. I just triggered it.

github-actions · 2025-02-12T10:12:34Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/13282111433
Job status: failure. Test status: success.

github-actions · 2025-02-12T10:18:07Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13283071851

github-actions · 2025-02-12T11:06:24Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13283071851
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 10 (threshold 2.00%)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3516.390000 ns	3830.530 ns	8.93%
Velocity-Bench Bitcracker	35.525100 s	37.429 s	5.36%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	704.395000 ns	739.669 ns	5.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	117.809000 ns	123.101 ns	4.49%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	204.816000 ns	213.371 ns	4.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15078.400000 ns	15563.600 ns	3.22%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4571.880000 ns	4712.160 ns	3.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.661000 ns	180.705 ns	2.87%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	595674.000000 ns	611807.000 ns	2.71%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	271.334000 ns	278.282 ns	2.56%

Regressed 21 (threshold 2.00%)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2187.210 ns	1997.540000 ns	-8.67%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2698360.000 ns	2543840.000000 ns	-5.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	556.946 ns	525.156000 ns	-5.71%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	104.130 ns	98.720200 ns	-5.20%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1064.810 ns	1011.890000 ns	-4.97%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33185.800 ns	31815.400000 ns	-4.13%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2785.440 ns	2682.620000 ns	-3.69%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	733.898 ns	708.146000 ns	-3.51%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	56.137 μs	54.383000 μs	-3.12%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3490.070 ns	3381.380000 ns	-3.11%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	302.754 ns	293.969000 ns	-2.90%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1895.690 ns	1841.590000 ns	-2.85%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	280.195 ns	272.569000 ns	-2.72%
api_overhead_benchmark_sycl SubmitKernel out of order	23.760 μs	23.120000 μs	-2.69%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	868.411 ns	846.197000 ns	-2.56%
Velocity-Bench Sobel Filter	606.957 ms	591.724000 ms	-2.51%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17443.222 μs	17019.124000 μs	-2.43%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	63.528 μs	62.093000 μs	-2.26%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.952 ns	208.150000 ns	-2.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	714.715 ns	699.118000 ns	-2.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1194150.000 ns	1169220.000000 ns	-2.09%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7): 99.763%

Benchmark	This PR	baseline	Change
api_overhead_benchmark_l0 SubmitKernel in order	11.504000 μs	11.682 μs	1.55%
api_overhead_benchmark_ur SubmitKernel out of order	15.732000 μs	15.870 μs	0.88%
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.010000 μs	21.193 μs	0.87%
api_overhead_benchmark_ur SubmitKernel in order	16.453 μs	16.386000 μs	-0.41%
api_overhead_benchmark_l0 SubmitKernel out of order	11.545 μs	11.494000 μs	-0.44%
api_overhead_benchmark_sycl SubmitKernel in order	24.468 μs	24.138000 μs	-1.35%
api_overhead_benchmark_sycl SubmitKernel out of order	23.760 μs	23.120000 μs	-2.69%

Relative perf in group (17): 99.671%

Benchmark	This PR	baseline	Change
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.103000 μs	2.144 μs	1.95%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.764000 μs	5.793 μs	0.50%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48731.565000 μs	48939.615 μs	0.43%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.266000 μs	133.793 μs	0.40%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.186000 GB/s	3.181 GB/s	0.16%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.683000 μs	1.684 μs	0.06%
miscellaneous_benchmark_sycl VectorSum	858.316000 bw GB/s	858.609 bw GB/s	0.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1205.576 μs	1204.721000 μs	-0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113099.997 μs	112569.641000 μs	-0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8745.221 μs	8695.080000 μs	-0.57%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	255.633 μs	254.094000 μs	-0.60%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40986.423 μs	40721.892000 μs	-0.65%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6978.908 μs	6931.801000 μs	-0.67%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7591.671 μs	7522.288000 μs	-0.91%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2081.220 μs	2059.364000 μs	-1.05%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26150.355 μs	25730.535000 μs	-1.61%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17443.222 μs	17019.124000 μs	-2.43%

Relative perf in group SinKernelGraph (4): 100.088%

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72591.316000 μs	72725.709 μs	0.19%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71750.955000 μs	71861.794 μs	0.15%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353441.240000 μs	353468.831 μs	0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353349.861000 μs	353366.397 μs	0.00%

Relative perf in group SubmitGraph (3): 98.015%

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	679.934 μs	676.159000 μs	-0.56%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	63.528 μs	62.093000 μs	-2.26%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	56.137 μs	54.383000 μs	-3.12%

Relative perf in group ExecGraph (3): 100.280%

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5592.763000 μs	5622.422 μs	0.53%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5602.257000 μs	5626.650 μs	0.44%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56512.337 μs	56442.150000 μs	-0.12%

Relative perf in group SubmitKernel CPU count (3): 100.000%

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104723.000000 instr	104723.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count	110066.000000 instr	110066.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122936.000000 instr	122936.000 instr	0.00%

Velocity Bench

Relative perf in group (8): 101.045%

Benchmark	This PR	baseline	Change
Velocity-Bench Bitcracker	35.525100 s	37.429 s	5.36%
Velocity-Bench QuickSilver	117.790000 MMS/CTT	116.690 MMS/CTT	0.94%
Velocity-Bench Hashtable	362.656861 M keys/sec	359.598 M keys/sec	0.85%
Velocity-Bench dl-mnist	2.720000 s	2.740 s	0.74%
Velocity-Bench Sobel Filter	606.957 ms	591.724000 ms	-2.51%
Velocity-Bench CudaSift	202.355000 ms	-
Velocity-Bench dl-cifar	23.663800 s	-
Velocity-Bench svm	0.136800 s	-

SYCL-Bench

Relative perf in group (54): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	266.719000 ms	-
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	287.965000 ms	-
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	273.464000 ms	-
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	273.761000 ms	-
Runtime_DAGTaskThroughput_SingleTask	1677.949000 ms	-
Runtime_DAGTaskThroughput_BasicParallelFor	1774.890000 ms	-
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1731.006000 ms	-
Runtime_DAGTaskThroughput_NDRangeParallelFor	1705.253000 ms	-
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.689000 ms	-
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.735000 ms	-
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.598000 ms	-
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.671000 ms	-
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.469000 ms	-
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.478000 ms	-
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.768000 ms	-
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.997000 ms	-
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.056000 ms	-
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.824000 ms	-
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	616.926000 ms	-
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.928000 ms	-
MicroBench_LocalMem_int32_4096	29.820000 ms	-
MicroBench_LocalMem_fp32_4096	29.910000 ms	-
Pattern_Reduction_NDRange_int32	16.299000 ms	-
Pattern_Reduction_Hierarchical_int32	16.343000 ms	-
ScalarProduct_NDRange_int32	3.768000 ms	-
ScalarProduct_NDRange_int64	5.423000 ms	-
ScalarProduct_NDRange_fp32	3.804000 ms	-
ScalarProduct_Hierarchical_int32	10.539000 ms	-
ScalarProduct_Hierarchical_int64	11.494000 ms	-
ScalarProduct_Hierarchical_fp32	10.158000 ms	-
Pattern_SegmentedReduction_NDRange_int16	2.266000 ms	-
Pattern_SegmentedReduction_NDRange_int32	2.163000 ms	-
Pattern_SegmentedReduction_NDRange_int64	2.338000 ms	-
Pattern_SegmentedReduction_NDRange_fp32	2.169000 ms	-
Pattern_SegmentedReduction_Hierarchical_int16	11.803000 ms	-
Pattern_SegmentedReduction_Hierarchical_int32	11.590000 ms	-
Pattern_SegmentedReduction_Hierarchical_int64	11.771000 ms	-
Pattern_SegmentedReduction_Hierarchical_fp32	11.588000 ms	-
USM_Allocation_latency_fp32_device	0.062000 ms	-
USM_Allocation_latency_fp32_host	37.576000 ms	-
USM_Allocation_latency_fp32_shared	0.067000 ms	-
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.671000 ms	-
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.046000 ms	-
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.848000 ms	-
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.216000 ms	-
VectorAddition_int32	1.468000 ms	-
VectorAddition_int64	3.059000 ms	-
VectorAddition_fp32	1.510000 ms	-
Polybench_2mm	1.055000 ms	-
Polybench_3mm	1.484000 ms	-
Polybench_Atax	6.460000 ms	-
Kmeans_fp32	14.050000 ms	-
LinearRegressionCoeff_fp32	890.405000 ms	-
MolecularDynamics	0.030000 ms	-

llama.cpp bench

Relative perf in group (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	827.038521 token/s	-
llama.cpp Text Generation Batched 128	62.428237 token/s	-
llama.cpp Prompt Processing Batched 256	870.024268 token/s	-
llama.cpp Text Generation Batched 256	62.462323 token/s	-
llama.cpp Prompt Processing Batched 512	426.594975 token/s	-
llama.cpp Text Generation Batched 512	62.476457 token/s	-

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7): 98.691%

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	117.809000 ns	123.101 ns	4.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4571.880000 ns	4712.160 ns	3.07%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3093.100000 ns	3133.790 ns	1.32%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	287.473 ns	281.921000 ns	-1.93%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3490.070 ns	3381.380000 ns	-3.11%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2785.440 ns	2682.620000 ns	-3.69%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2187.210 ns	1997.540000 ns	-8.67%

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7): 98.130%

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	490.573000 ns	492.253 ns	0.34%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.409 ns	119.362000 ns	-0.04%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	194.902 ns	193.089000 ns	-0.93%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	714.715 ns	699.118000 ns	-2.18%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.952 ns	208.150000 ns	-2.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	280.195 ns	272.569000 ns	-2.72%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	104.130 ns	98.720200 ns	-5.20%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7): 100.952%

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3516.390000 ns	3830.530 ns	8.93%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	271.334000 ns	278.282 ns	2.56%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4502.950000 ns	4561.320 ns	1.30%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1228.000 ns	1222.550000 ns	-0.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3272.730 ns	3243.100000 ns	-0.91%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	116.401 ns	114.675000 ns	-1.48%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1895.690 ns	1841.590000 ns	-2.85%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7): 99.236%

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	204.816000 ns	213.371 ns	4.18%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.210000 ns	119.600 ns	0.33%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	190.194000 ns	190.534 ns	0.18%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	502.837 ns	494.342000 ns	-1.69%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	198.251 ns	194.817000 ns	-1.73%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	302.754 ns	293.969000 ns	-2.90%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	733.898 ns	708.146000 ns	-3.51%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4): 96.922%

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4161.990000 ns	4206.340 ns	1.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	868.411 ns	846.197000 ns	-2.56%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1064.810 ns	1011.890000 ns	-4.97%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	556.946 ns	525.156000 ns	-5.71%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4): 102.096%

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	704.395000 ns	739.669 ns	5.01%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.661000 ns	180.705 ns	2.87%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	941.239000 ns	948.074 ns	0.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	356.014 ns	355.503000 ns	-0.14%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7): 98.810%

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	7713060.000000 ns	7742690.000 ns	0.38%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1724620.000000 ns	1725460.000 ns	0.05%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1170800.000 ns	1168520.000000 ns	-0.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47296.700 ns	46996.600000 ns	-0.63%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	519470.000 ns	510965.000000 ns	-1.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1194150.000 ns	1169220.000000 ns	-2.09%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33185.800 ns	31815.400000 ns	-4.13%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7): 99.405%

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15078.400000 ns	15563.600 ns	3.22%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4212.360000 ns	4278.780 ns	1.58%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24163.000000 ns	24207.200 ns	0.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	140134.000 ns	139266.000000 ns	-0.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	162180.000 ns	160574.000000 ns	-0.99%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	208332.000 ns	205074.000000 ns	-1.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	2698360.000 ns	2543840.000000 ns	-5.73%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4): 100.438%

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	595674.000000 ns	611807.000 ns	2.71%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	74721.000000 ns	75748.800 ns	1.38%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139431.000 ns	138754.000000 ns	-0.49%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	10883300.000 ns	10688900.000000 ns	-1.79%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4): 99.735%

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	58800.200000 ns	59340.100 ns	0.92%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	2607870.000000 ns	2608650.000 ns	0.03%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25671.200 ns	25635.500000 ns	-0.14%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31641.200 ns	31056.100000 ns	-1.85%

Details

Benchmark details contain too many chars to display

.github/workflows/benchmarks-reusable.yml

scripts/benchmarks/main.py

lukaszstolarczuk · 2025-02-12T12:08:01Z

scripts/benchmarks/output_markdown.py

-        # Generate the row with the best value highlighted
+        # Generate the row with all the results from saved runs specified by
+        # --compare,
+        # Highight the best value in the row with data


misspell Highight

still a misspell ;d

scripts/benchmarks/output_markdown.py

pbalcer

grouping doesn't work for some of the benchmarks:

Relative perf in group (17): 99.671%

pbalcer · 2025-02-14T10:34:40Z

scripts/benchmarks/output_markdown.py

+
+        # If data is collected from already saved results,
+        # the content is parsed as strings
+        if isinstance(res.env, str):


how does this improve the existing way of printing env vars?

My OCD couldn't stand empty Environment variables sections, if you are asking about the introduced ifs.

If you are asking about ast.literal_eval, I have added it when using results which have been not calculated during the script runs, but have been already saved. This function enables us to access the elements of the dictionary with environmental variables, which originally is parsed from json to string. Maybe we could change something about Benchmark.from_json() instead.

pbalcer · 2025-02-14T11:13:14Z

scripts/benchmarks/output_markdown.py

+
+
+def get_relative_perf_summary(group_size: int, diffs_product: int, 
+                              root_for_geometric_mean: int, group_name: str):


geomean is complicated to calculate. I'd just replace the : xx % change with (improved X, regressed Y)

EuphoricThinking · 2025-02-14T18:24:59Z

grouping doesn't work for some of the benchmarks:

Relative perf in group (17): 99.671%

These benchmarks don't have explicit group assigned (example without assigned, example with assigned, the default explicit_group is empty string. I'm going to change the default group name to "Ungrouped".

github-actions · 2025-02-14T18:40:13Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13335358616

github-actions · 2025-02-14T19:24:43Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13335358616
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 23 (threshold 2.00%)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1164370.000000 ns	1290770.000 ns	10.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1192800.000000 ns	1301090.000 ns	9.08%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2608.040000 ns	2813.690 ns	7.89%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	290.805000 ns	312.357 ns	7.41%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	291.544000 ns	312.314 ns	7.12%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	496255.000000 ns	529627.000 ns	6.72%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25690.100000 ns	27282.400 ns	6.20%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	266.115000 ns	279.926 ns	5.19%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4286.200000 ns	4505.450 ns	5.12%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	580.222000 ns	609.181 ns	4.99%
VectorAddition_int32	1.448000 ms	1.519 ms	4.90%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	142261.000000 ns	148026.000 ns	4.05%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.939000 ns	124.375 ns	3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	496.554000 ns	513.605 ns	3.43%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3409.200000 ns	3515.560 ns	3.12%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	206862.000000 ns	212699.000 ns	2.82%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	613468.000000 ns	629497.000 ns	2.61%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.826000 ms	4.952 ms	2.61%
ScalarProduct_NDRange_int32	3.766000 ms	3.848 ms	2.18%
Polybench_Atax	6.259000 ms	6.394 ms	2.16%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	121.121000 ns	123.713 ns	2.14%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15362.200000 ns	15689.000 ns	2.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1764240.000000 ns	1801270.000 ns	2.10%

Regressed 27 (threshold 2.00%)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	979.017 ns	743.029000 ns	-24.10%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1436.210 ns	1258.420000 ns	-12.38%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1998.270 ns	1783.030000 ns	-10.77%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	267.519 ns	239.990000 ns	-10.29%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3422.840 ns	3109.690000 ns	-9.15%
USM_Allocation_latency_fp32_shared	0.070 ms	0.064000 ms	-8.57%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	140.458 ns	128.515000 ns	-8.50%
USM_Allocation_latency_fp32_device	0.065 ms	0.060000 ms	-7.69%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2106.500 ns	1967.950000 ns	-6.58%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3803.790 ns	3577.360000 ns	-5.95%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1052.890 ns	1001.060000 ns	-4.92%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	873.835 ns	832.341000 ns	-4.75%
api_overhead_benchmark_l0 SubmitKernel out of order	11.917 μs	11.376000 μs	-4.54%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	267.323 ns	255.713000 ns	-4.34%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4654.630 ns	4467.570000 ns	-4.02%
VectorAddition_fp32	1.533 ms	1.472000 ms	-3.98%
MolecularDynamics	0.031 ms	0.030000 ms	-3.23%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	872.718 ns	846.087000 ns	-3.05%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	183.128 ns	177.966000 ns	-2.82%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	196.708 ns	191.184000 ns	-2.81%
api_overhead_benchmark_ur SubmitKernel out of order	15.980 μs	15.587000 μs	-2.46%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4335.340 ns	4229.540000 ns	-2.44%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8789.452 μs	8581.558000 μs	-2.37%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.713 ms	1.674000 ms	-2.28%
LinearRegressionCoeff_fp32	941.013 ms	920.731000 ms	-2.16%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142682.000 ns	139678.000000 ns	-2.11%
api_overhead_benchmark_ur SubmitKernel in order	16.631 μs	16.293000 μs	-2.03%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion	20.950000 μs	21.090 μs	0.67%
api_overhead_benchmark_sycl SubmitKernel in order	24.544 μs	24.377000 μs	-0.68%
api_overhead_benchmark_sycl SubmitKernel out of order	23.173 μs	22.912000 μs	-1.13%
api_overhead_benchmark_l0 SubmitKernel in order	11.842 μs	11.696000 μs	-1.23%
api_overhead_benchmark_ur SubmitKernel in order	16.631 μs	16.293000 μs	-2.03%
api_overhead_benchmark_ur SubmitKernel out of order	15.980 μs	15.587000 μs	-2.46%
api_overhead_benchmark_l0 SubmitKernel out of order	11.917 μs	11.376000 μs	-4.54%

Relative perf in group (17)

Benchmark	This PR	baseline	Change
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7446.901000 μs	7527.460 μs	1.08%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40664.268000 μs	41011.050 μs	0.85%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.498000 μs	134.477 μs	0.73%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25736.730000 μs	25845.518 μs	0.42%
miscellaneous_benchmark_sycl VectorSum	858.902000 bw GB/s	861.548 bw GB/s	0.31%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6923.736 μs	6913.059000 μs	-0.15%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.676 μs	1.673000 μs	-0.18%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.183 GB/s	3.189000 GB/s	-0.19%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48059.615 μs	47966.780000 μs	-0.19%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	253.755 μs	252.961000 μs	-0.31%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.632 μs	5.613000 μs	-0.34%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113003.736 μs	112059.723000 μs	-0.84%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2094.790 μs	2077.079000 μs	-0.85%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17323.738 μs	17146.720000 μs	-1.02%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.147 μs	2.114000 μs	-1.54%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1209.309 μs	1186.615000 μs	-1.88%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8789.452 μs	8581.558000 μs	-2.37%

Relative perf in group SinKernelGraph (4)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71721.562000 μs	71737.327 μs	0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353239.836 μs	353216.125000 μs	-0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353560.841 μs	353477.995000 μs	-0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72664.228 μs	72516.787000 μs	-0.20%

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	61.738000 μs	61.921 μs	0.30%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	672.665000 μs	673.385 μs	0.11%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.707 μs	54.243000 μs	-0.85%

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5593.804000 μs	5622.149 μs	0.51%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56521.162 μs	56485.620000 μs	-0.06%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5584.778 μs	5580.318000 μs	-0.08%

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104593.000000 instr	104593.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count	109936.000000 instr	109936.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122806.000000 instr	122806.000 instr	0.00%

Velocity Bench

Relative perf in group (8)

Benchmark	This PR	baseline	Change
Velocity-Bench QuickSilver	118.170000 MMS/CTT	117.050 MMS/CTT	0.96%
Velocity-Bench svm	0.133600 s	0.135 s	0.82%
Velocity-Bench CudaSift	202.781000 ms	203.219 ms	0.22%
Velocity-Bench Bitcracker	35.578800 s	35.584 s	0.02%
Velocity-Bench dl-cifar	24.066 s	23.890200 s	-0.73%
Velocity-Bench Sobel Filter	615.344 ms	610.451000 ms	-0.80%
Velocity-Bench Hashtable	355.727 M keys/sec	359.331110 M keys/sec	-1.00%
Velocity-Bench dl-mnist	2.760 s	2.730000 s	-1.09%

SYCL-Bench

Relative perf in group (54)

Benchmark	This PR	baseline	Change
VectorAddition_int32	1.448000 ms	1.519 ms	4.90%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.826000 ms	4.952 ms	2.61%
ScalarProduct_NDRange_int32	3.766000 ms	3.848 ms	2.18%
Polybench_Atax	6.259000 ms	6.394 ms	2.16%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.011000 ms	5.087 ms	1.52%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.792000 ms	4.864 ms	1.50%
Runtime_DAGTaskThroughput_NDRangeParallelFor	1705.474000 ms	1727.936 ms	1.32%
Runtime_DAGTaskThroughput_SingleTask	1693.933000 ms	1715.450 ms	1.27%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.797000 ms	4.855 ms	1.21%
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1750.190000 ms	1770.145 ms	1.14%
Runtime_DAGTaskThroughput_BasicParallelFor	1761.610000 ms	1775.221 ms	0.77%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.122000 ms	5.153 ms	0.61%
ScalarProduct_Hierarchical_int64	11.474000 ms	11.542 ms	0.59%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	275.894000 ms	277.203 ms	0.47%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.214000 ms	1.219 ms	0.41%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.734000 ms	4.749 ms	0.32%
Runtime_IndependentDAGTaskThroughput_SingleTask	267.999000 ms	268.591 ms	0.22%
Pattern_SegmentedReduction_NDRange_int16	2.262000 ms	2.266 ms	0.18%
ScalarProduct_NDRange_int64	5.465000 ms	5.474 ms	0.16%
USM_Allocation_latency_fp32_host	37.324000 ms	37.383 ms	0.16%
MicroBench_LocalMem_fp32_4096	29.848000 ms	29.891 ms	0.14%
Pattern_SegmentedReduction_NDRange_int64	2.333000 ms	2.336 ms	0.13%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	276.243000 ms	276.568 ms	0.12%
Pattern_SegmentedReduction_NDRange_int32	2.163000 ms	2.165 ms	0.09%
ScalarProduct_Hierarchical_fp32	10.171000 ms	10.176 ms	0.05%
Pattern_SegmentedReduction_Hierarchical_fp32	11.582000 ms	11.587 ms	0.04%
Pattern_SegmentedReduction_Hierarchical_int16	11.798000 ms	11.803 ms	0.04%
ScalarProduct_Hierarchical_int32	10.517000 ms	10.521 ms	0.04%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	616.909000 ms	617.076 ms	0.03%
Pattern_SegmentedReduction_Hierarchical_int64	11.778000 ms	11.781 ms	0.03%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.905000 ms	616.946 ms	0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.566000 ms	617.585 ms	0.00%
Polybench_2mm	1.052000 ms	1.052 ms	0.00%
Polybench_3mm	1.481000 ms	1.481 ms	0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.589 ms	617.560000 ms	-0.00%
Pattern_SegmentedReduction_Hierarchical_int32	11.589 ms	11.587000 ms	-0.02%
Pattern_SegmentedReduction_NDRange_fp32	2.164 ms	2.163000 ms	-0.05%
MicroBench_LocalMem_int32_4096	29.884 ms	29.826000 ms	-0.19%
Kmeans_fp32	14.109 ms	14.048000 ms	-0.43%
Pattern_Reduction_NDRange_int32	16.801 ms	16.686000 ms	-0.68%
Pattern_Reduction_Hierarchical_int32	16.900 ms	16.762000 ms	-0.82%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.841 ms	1.825000 ms	-0.87%
ScalarProduct_NDRange_fp32	3.794 ms	3.760000 ms	-0.90%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.718 ms	4.674000 ms	-0.93%
VectorAddition_int64	3.095 ms	3.064000 ms	-1.00%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	293.181 ms	289.887000 ms	-1.12%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.065 ms	1.051000 ms	-1.31%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.852 ms	4.775000 ms	-1.59%
LinearRegressionCoeff_fp32	941.013 ms	920.731000 ms	-2.16%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.713 ms	1.674000 ms	-2.28%
MolecularDynamics	0.031 ms	0.030000 ms	-3.23%
VectorAddition_fp32	1.533 ms	1.472000 ms	-3.98%
USM_Allocation_latency_fp32_device	0.065 ms	0.060000 ms	-7.69%
USM_Allocation_latency_fp32_shared	0.070 ms	0.064000 ms	-8.57%

llama.cpp bench

Relative perf in group (6)

Benchmark	This PR	baseline	Change
llama.cpp Text Generation Batched 128	62.523001 token/s	62.439 token/s	0.13%
llama.cpp Text Generation Batched 512	62.506139 token/s	62.433 token/s	0.12%
llama.cpp Text Generation Batched 256	62.528739 token/s	62.524 token/s	0.01%
llama.cpp Prompt Processing Batched 512	420.439 token/s	422.877525 token/s	-0.58%
llama.cpp Prompt Processing Batched 256	865.016 token/s	871.296349 token/s	-0.72%
llama.cpp Prompt Processing Batched 128	825.488 token/s	838.278943 token/s	-1.53%

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2608.040000 ns	2813.690 ns	7.89%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	290.805000 ns	312.357 ns	7.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4286.200000 ns	4505.450 ns	5.12%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3132.190000 ns	3179.010 ns	1.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	133.185000 ns	133.661 ns	0.36%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3803.790 ns	3577.360000 ns	-5.95%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2106.500 ns	1967.950000 ns	-6.58%

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	266.115000 ns	279.926 ns	5.19%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.939000 ns	124.375 ns	3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	496.554000 ns	513.605 ns	3.43%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	712.866000 ns	719.821 ns	0.98%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	209.903000 ns	211.785 ns	0.90%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	188.784 ns	188.514000 ns	-0.14%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	96.764 ns	96.221000 ns	-0.56%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3409.200000 ns	3515.560 ns	3.12%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4654.630 ns	4467.570000 ns	-4.02%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	267.323 ns	255.713000 ns	-4.34%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	140.458 ns	128.515000 ns	-8.50%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3422.840 ns	3109.690000 ns	-9.15%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1998.270 ns	1783.030000 ns	-10.77%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1436.210 ns	1258.420000 ns	-12.38%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	291.544000 ns	312.314 ns	7.12%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	121.121000 ns	123.713 ns	2.14%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	502.574 ns	501.051000 ns	-0.30%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	203.301 ns	201.403000 ns	-0.93%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	196.708 ns	191.184000 ns	-2.81%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	267.519 ns	239.990000 ns	-10.29%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	979.017 ns	743.029000 ns	-24.10%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4335.340 ns	4229.540000 ns	-2.44%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	872.718 ns	846.087000 ns	-3.05%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	873.835 ns	832.341000 ns	-4.75%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1052.890 ns	1001.060000 ns	-4.92%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	580.222000 ns	609.181 ns	4.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	962.359000 ns	979.526 ns	1.78%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	350.754000 ns	353.953 ns	0.91%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	183.128 ns	177.966000 ns	-2.82%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1164370.000000 ns	1290770.000 ns	10.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1192800.000000 ns	1301090.000 ns	9.08%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	496255.000000 ns	529627.000 ns	6.72%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1764240.000000 ns	1801270.000 ns	2.10%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	701661.000000 ns	709591.000 ns	1.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32465.300000 ns	32614.100 ns	0.46%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47368.300 ns	47151.600000 ns	-0.46%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	142261.000000 ns	148026.000 ns	4.05%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	206862.000000 ns	212699.000 ns	2.82%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15362.200000 ns	15689.000 ns	2.13%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	162737.000000 ns	165724.000 ns	1.84%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	117142.000000 ns	117400.000 ns	0.22%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4301.930 ns	4299.410000 ns	-0.06%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24827.000 ns	24506.300000 ns	-1.29%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	613468.000000 ns	629497.000 ns	2.61%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75662.800000 ns	76304.400 ns	0.85%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	670349.000 ns	666074.000000 ns	-0.64%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142682.000 ns	139678.000000 ns	-2.11%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25690.100000 ns	27282.400 ns	6.20%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31316.900000 ns	31631.300 ns	1.00%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	59541.500000 ns	59837.700 ns	0.50%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	132633.000 ns	131923.000000 ns	-0.54%

Details

Benchmark details contain too many chars to display

github-actions · 2025-02-16T17:59:49Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13357456707

github-actions · 2025-02-16T18:43:31Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13357456707
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 16 (threshold 2.00%)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	709.347000 ns	839.789 ns	18.39%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3354.690000 ns	3670.840 ns	9.42%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2969.050000 ns	3145.700 ns	5.95%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.924000 ms	5.177 ms	5.14%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3149.510000 ns	3292.800 ns	4.55%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	351.825000 ns	367.152 ns	4.36%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.963000 ms	5.173 ms	4.23%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	986.221000 ns	1025.920 ns	4.03%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.313000 ns	181.985 ns	3.81%
VectorAddition_int32	1.459000 ms	1.506 ms	3.22%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.742000 ms	4.884 ms	2.99%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	128.073000 ns	131.773 ns	2.89%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17001.345000 μs	17478.948 μs	2.81%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1161930.000000 ns	1192630.000 ns	2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15148.100000 ns	15514.100 ns	2.42%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	626.494000 ns	639.395 ns	2.06%

Regressed 17 (threshold 2.00%)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_shared	0.066 ms	0.055000 ms	-16.67%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1388.320 ns	1202.580000 ns	-13.38%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	255.429 ns	241.503000 ns	-5.45%
USM_Allocation_latency_fp32_device	0.065 ms	0.062000 ms	-4.62%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2242.230 ns	2142.640000 ns	-4.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	261.679 ns	250.231000 ns	-4.37%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32992.200 ns	31667.700000 ns	-4.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3767.700 ns	3630.740000 ns	-3.64%
MolecularDynamics	0.032 ms	0.031000 ms	-3.12%
Velocity-Bench Sobel Filter	611.555 ms	593.513000 ms	-2.95%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.883 ms	4.751000 ms	-2.70%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	995.454 ns	969.826000 ns	-2.57%
api_overhead_benchmark_l0 SubmitKernel in order	11.694 μs	11.398000 μs	-2.53%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1193740.000 ns	1165850.000000 ns	-2.34%
api_overhead_benchmark_ur SubmitKernel in order	16.530 μs	16.180000 μs	-2.12%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.767 ms	4.671000 ms	-2.01%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.696 ms	1.662000 ms	-2.00%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_l0 SubmitKernel out of order	11.524000 μs	11.747 μs	1.94%
api_overhead_benchmark_sycl SubmitKernel in order	24.117000 μs	24.218 μs	0.42%
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.078 μs	20.931000 μs	-0.70%
api_overhead_benchmark_sycl SubmitKernel out of order	23.111 μs	22.862000 μs	-1.08%
api_overhead_benchmark_ur SubmitKernel out of order	15.705 μs	15.485000 μs	-1.40%
api_overhead_benchmark_ur SubmitKernel in order	16.530 μs	16.180000 μs	-2.12%
api_overhead_benchmark_l0 SubmitKernel in order	11.694 μs	11.398000 μs	-2.53%

Relative perf in group (17)

Benchmark	This PR	baseline	Change
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17001.345000 μs	17478.948 μs	2.81%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8630.563000 μs	8719.602 μs	1.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25620.442000 μs	25867.194 μs	0.96%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.186000 GB/s	3.163 GB/s	0.73%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40931.902000 μs	41181.436 μs	0.61%
miscellaneous_benchmark_sycl VectorSum	857.147000 bw GB/s	860.959 bw GB/s	0.44%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	112499.588000 μs	112938.927 μs	0.39%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.534000 μs	5.553 μs	0.34%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.098000 μs	2.105 μs	0.33%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	134.006000 μs	134.388 μs	0.29%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1188.326000 μs	1191.684 μs	0.28%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6951.370 μs	6946.184000 μs	-0.07%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.687 μs	1.679000 μs	-0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2102.064 μs	2084.496000 μs	-0.84%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7550.282 μs	7486.409000 μs	-0.85%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48669.897 μs	48250.437000 μs	-0.86%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	253.401 μs	251.023000 μs	-0.94%

Relative perf in group SinKernelGraph (4)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353374.079000 μs	353443.490 μs	0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353273.575 μs	353238.347000 μs	-0.01%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72574.742 μs	72555.811000 μs	-0.03%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71855.159 μs	71737.858000 μs	-0.16%

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.401 μs	54.334000 μs	-0.12%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	674.742 μs	673.309000 μs	-0.21%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.371 μs	61.787000 μs	-0.94%

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56482.408000 μs	56486.380 μs	0.01%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5593.672 μs	5586.543000 μs	-0.13%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5597.140 μs	5588.567000 μs	-0.15%

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104593.000000 instr	104593.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count	109936.000000 instr	109936.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	123120.000 instr	122807.000000 instr	-0.25%

Velocity Bench

Relative perf in group Ungrouped (8)

Benchmark	This PR	baseline	Change
Velocity-Bench svm	0.134300 s	0.135 s	0.30%
Velocity-Bench QuickSilver	117.870000 MMS/CTT	117.560 MMS/CTT	0.26%
Velocity-Bench CudaSift	202.895000 ms	203.196 ms	0.15%
Velocity-Bench Bitcracker	35.528 s	35.501000 s	-0.07%
Velocity-Bench Hashtable	357.425 M keys/sec	358.637420 M keys/sec	-0.34%
Velocity-Bench dl-mnist	2.730 s	2.720000 s	-0.37%
Velocity-Bench dl-cifar	24.054 s	23.806700 s	-1.03%
Velocity-Bench Sobel Filter	611.555 ms	593.513000 ms	-2.95%

SYCL-Bench

Relative perf in group Ungrouped (53)

Benchmark	This PR	baseline	Change
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.924000 ms	5.177 ms	5.14%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.963000 ms	5.173 ms	4.23%
VectorAddition_int32	1.459000 ms	1.506 ms	3.22%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.742000 ms	4.884 ms	2.99%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.766000 ms	4.831 ms	1.36%
VectorAddition_fp32	1.454000 ms	1.473 ms	1.31%
Pattern_Reduction_NDRange_int32	16.499000 ms	16.700 ms	1.22%
Polybench_2mm	1.041000 ms	1.051 ms	0.96%
ScalarProduct_Hierarchical_int64	11.470000 ms	11.516 ms	0.40%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.827000 ms	4.835 ms	0.17%
ScalarProduct_Hierarchical_fp32	10.140000 ms	10.156 ms	0.16%
ScalarProduct_NDRange_fp32	3.741000 ms	3.745 ms	0.11%
Runtime_DAGTaskThroughput_NDRangeParallelFor	1694.603000 ms	1696.297 ms	0.10%
Pattern_SegmentedReduction_NDRange_fp32	2.164000 ms	2.165 ms	0.05%
Pattern_SegmentedReduction_Hierarchical_int16	11.802000 ms	11.805 ms	0.03%
Pattern_Reduction_Hierarchical_int32	16.932000 ms	16.935 ms	0.02%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	616.911000 ms	616.960 ms	0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.915000 ms	616.936 ms	0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.638 ms	617.589000 ms	-0.01%
Pattern_SegmentedReduction_Hierarchical_fp32	11.587 ms	11.586000 ms	-0.01%
Pattern_SegmentedReduction_Hierarchical_int32	11.586 ms	11.584000 ms	-0.02%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.690 ms	617.554000 ms	-0.02%
ScalarProduct_Hierarchical_int32	10.551 ms	10.548000 ms	-0.03%
Pattern_SegmentedReduction_NDRange_int16	2.265 ms	2.264000 ms	-0.04%
Pattern_SegmentedReduction_Hierarchical_int64	11.769 ms	11.763000 ms	-0.05%
USM_Allocation_latency_fp32_host	37.543 ms	37.512000 ms	-0.08%
Kmeans_fp32	14.046 ms	14.031000 ms	-0.11%
ScalarProduct_NDRange_int64	5.431 ms	5.425000 ms	-0.11%
MicroBench_LocalMem_int32_4096	29.859 ms	29.798000 ms	-0.20%
Polybench_3mm	1.486 ms	1.482000 ms	-0.27%
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1731.429 ms	1726.405000 ms	-0.29%
ScalarProduct_NDRange_int32	3.770 ms	3.759000 ms	-0.29%
Pattern_SegmentedReduction_NDRange_int64	2.343 ms	2.336000 ms	-0.30%
Runtime_DAGTaskThroughput_BasicParallelFor	1754.051 ms	1748.761000 ms	-0.30%
MicroBench_LocalMem_fp32_4096	29.926 ms	29.816000 ms	-0.37%
Runtime_DAGTaskThroughput_SingleTask	1684.153 ms	1677.953000 ms	-0.37%
Pattern_SegmentedReduction_NDRange_int32	2.172 ms	2.162000 ms	-0.46%
Runtime_IndependentDAGTaskThroughput_SingleTask	262.293 ms	260.173000 ms	-0.81%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.805 ms	4.764000 ms	-0.85%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	277.956 ms	275.430000 ms	-0.91%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	279.673 ms	276.920000 ms	-0.98%
Polybench_Atax	6.436 ms	6.372000 ms	-0.99%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.205 ms	1.193000 ms	-1.00%
VectorAddition_int64	3.108 ms	3.067000 ms	-1.32%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.054 ms	1.039000 ms	-1.42%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.841 ms	1.814000 ms	-1.47%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	292.626 ms	286.848000 ms	-1.97%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.696 ms	1.662000 ms	-2.00%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.767 ms	4.671000 ms	-2.01%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.883 ms	4.751000 ms	-2.70%
MolecularDynamics	0.032 ms	0.031000 ms	-3.12%
USM_Allocation_latency_fp32_device	0.065 ms	0.062000 ms	-4.62%
USM_Allocation_latency_fp32_shared	0.066 ms	0.055000 ms	-16.67%

llama.cpp bench

Relative perf in group Ungrouped (6)

Benchmark	This PR	baseline	Change
llama.cpp Prompt Processing Batched 512	421.883415 token/s	417.717 token/s	1.00%
llama.cpp Prompt Processing Batched 128	825.849545 token/s	822.681 token/s	0.39%
llama.cpp Text Generation Batched 256	62.537986 token/s	62.449 token/s	0.14%
llama.cpp Text Generation Batched 512	62.481856 token/s	62.410 token/s	0.12%
llama.cpp Text Generation Batched 128	62.470 token/s	62.489279 token/s	-0.03%
llama.cpp Prompt Processing Batched 256	863.426 token/s	869.125841 token/s	-0.66%

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2969.050000 ns	3145.700 ns	5.95%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	290.412000 ns	296.208 ns	2.00%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2582.560000 ns	2595.190 ns	0.49%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4700.460 ns	4686.100000 ns	-0.31%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	135.677 ns	133.691000 ns	-1.46%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3767.700 ns	3630.740000 ns	-3.64%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2242.230 ns	2142.640000 ns	-4.44%

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	210.848000 ns	213.403 ns	1.21%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	499.323000 ns	500.448 ns	0.23%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.587000 ns	119.758 ns	0.14%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	189.676 ns	188.841000 ns	-0.44%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	98.056 ns	97.605900 ns	-0.46%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	274.047 ns	270.523000 ns	-1.29%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	713.179 ns	701.745000 ns	-1.60%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3354.690000 ns	3670.840 ns	9.42%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3149.510000 ns	3292.800 ns	4.55%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	128.073000 ns	131.773 ns	2.89%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4724.390000 ns	4775.630 ns	1.08%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1980.670 ns	1980.320000 ns	-0.02%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	261.679 ns	250.231000 ns	-4.37%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1388.320 ns	1202.580000 ns	-13.38%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	294.962 ns	294.927000 ns	-0.01%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	203.327 ns	203.225000 ns	-0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.499 ns	119.377000 ns	-0.10%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	505.384 ns	503.757000 ns	-0.32%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	192.617 ns	191.825000 ns	-0.41%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	750.861 ns	743.102000 ns	-1.03%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	255.429 ns	241.503000 ns	-5.45%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	709.347000 ns	839.789 ns	18.39%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	986.221000 ns	1025.920 ns	4.03%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	849.690000 ns	861.095 ns	1.34%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4298.920 ns	4282.990000 ns	-0.37%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	351.825000 ns	367.152 ns	4.36%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.313000 ns	181.985 ns	3.81%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	626.494000 ns	639.395 ns	2.06%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	995.454 ns	969.826000 ns	-2.57%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1161930.000000 ns	1192630.000 ns	2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	706545.000000 ns	713122.000 ns	0.93%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	47358.700 ns	47332.200000 ns	-0.06%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1754920.000 ns	1751760.000000 ns	-0.18%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	525143.000 ns	518073.000000 ns	-1.35%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1193740.000 ns	1165850.000000 ns	-2.34%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32992.200 ns	31667.700000 ns	-4.01%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15148.100000 ns	15514.100 ns	2.42%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	162004.000000 ns	164859.000 ns	1.76%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	117360.000000 ns	117587.000 ns	0.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	211739.000 ns	211217.000000 ns	-0.25%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24508.400 ns	24403.500000 ns	-0.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	143723.000 ns	141395.000000 ns	-1.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4287.140 ns	4203.400000 ns	-1.95%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139385.000000 ns	141671.000 ns	1.64%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	616923.000000 ns	622760.000 ns	0.95%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	74303.300000 ns	74577.400 ns	0.37%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	670951.000 ns	666557.000000 ns	-0.65%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31618.300000 ns	32027.100 ns	1.29%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25898.200 ns	25677.800000 ns	-0.85%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60407.500 ns	59857.400000 ns	-0.91%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	132797.000 ns	131430.000000 ns	-1.03%

Details

Benchmark details contain too many chars to display

pbalcer · 2025-02-17T09:05:04Z

scripts/benchmarks/benches/result.py

@@ -18,7 +18,7 @@ class Result:
    stdout: str
    passed: bool = True
    unit: str = ""
-    explicit_group: str = ""
+    explicit_group: str = "Ungrouped"


The html output interprets anything other than "" as a group (see https://github.com/oneapi-src/unified-runtime/blob/main/scripts/benchmarks/output_html.py#L117). And every explicit group is shown together on a bar chart.
So this needs to stay as "".
My suggestion is to use "Others" in the markdown output when "" is specified.

pbalcer

lgtm, just a couple of nits...

pbalcer · 2025-02-17T09:06:39Z

scripts/benchmarks/main.py

    parser.add_argument("--dry-run", help='Do not run any actual benchmarks', action="store_true", default=False)
    parser.add_argument("--compute-runtime", nargs='?', const=options.compute_runtime_tag, help="Fetch and build compute runtime")
    parser.add_argument("--iterations-stddev", type=int, help="Max number of iterations of the loop calculating stddev after completed benchmark runs", default=options.iterations_stddev)
    parser.add_argument("--build-igc", help="Build IGC from source instead of using the OS-installed version", action="store_true", default=options.build_igc)
+    parser.add_argument("--relative-perf",  type=str, help="The name of the results which should be used as a baseline for metrics calculation", default=options.current_run_name)
+    parser.add_argument("--new-base-name", help="New name of the default baseline to compare", type=str, default='')


Hm, if we need this let's just remove the default compare. E.g., with nothing is specified, we don't compare at all.
This will eliminate the need for this option.

pbalcer · 2025-02-17T09:16:23Z

scripts/benchmarks/output_markdown.py

+                                (x.diff is not None, x.diff), reverse=True)
+
+            # Geometric mean calculation
+            product = 1.0


this appears to be unused?

github-actions · 2025-02-17T10:37:00Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13368284077

github-actions · 2025-02-17T11:22:35Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13368284077
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 20 (threshold 2.00%)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	207.541000 ns	254.262 ns	22.51%
USM_Allocation_latency_fp32_device	0.064000 ms	0.075 ms	17.19%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	1918.930000 ns	2237.740 ns	16.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4427.100000 ns	4801.330 ns	8.45%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2627.630000 ns	2828.790 ns	7.66%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	96.979700 ns	103.858 ns	7.09%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24340.100000 ns	25904.100 ns	6.43%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3413.760000 ns	3598.790 ns	5.42%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.667000 ms	4.867 ms	4.29%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	295.763000 ns	308.311 ns	4.24%
api_overhead_benchmark_l0 SubmitKernel out of order	11.410000 μs	11.854 μs	3.89%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	142882.000000 ns	148394.000 ns	3.86%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.531000 μs	5.739 μs	3.76%
MolecularDynamics	0.030000 ms	0.031 ms	3.33%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1747590.000000 ns	1805140.000 ns	3.29%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.652000 ms	4.803 ms	3.25%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.352000 ns	218.316 ns	2.81%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	754.576000 ns	775.151 ns	2.73%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25140.600000 ns	25814.900 ns	2.68%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	208239.000000 ns	213417.000 ns	2.49%

Regressed 14 (threshold 2.00%)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_shared	0.064 ms	0.053000 ms	-17.19%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1316330.000 ns	1161880.000000 ns	-11.73%
VectorAddition_fp32	1.558 ms	1.439000 ms	-7.64%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3511.250 ns	3272.820000 ns	-6.79%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1303400.000 ns	1229410.000000 ns	-5.68%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1279.670 ns	1211.730000 ns	-5.31%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	287.453 ms	273.621000 ms	-4.81%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	272.439 ns	261.712000 ns	-3.94%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	984.997 ns	948.150000 ns	-3.74%
Runtime_IndependentDAGTaskThroughput_SingleTask	265.490 ms	257.665000 ms	-2.95%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	196.068 ns	190.897000 ns	-2.64%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32410.400 ns	31644.300000 ns	-2.36%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	55.146 μs	53.871000 μs	-2.31%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.067 ms	1.044000 ms	-2.16%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_l0 SubmitKernel out of order	11.410000 μs	11.854 μs	3.89%
api_overhead_benchmark_l0 SubmitKernel in order	11.493000 μs	11.662 μs	1.47%
api_overhead_benchmark_ur SubmitKernel out of order	15.536000 μs	15.729 μs	1.24%
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.048000 μs	21.172 μs	0.59%
api_overhead_benchmark_ur SubmitKernel in order	16.135000 μs	16.210 μs	0.46%
api_overhead_benchmark_sycl SubmitKernel in order	24.223 μs	24.195000 μs	-0.12%
api_overhead_benchmark_sycl SubmitKernel out of order	22.892 μs	22.788000 μs	-0.45%

Relative perf in group Ungrouped (17)

Benchmark	This PR	baseline	Change
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.531000 μs	5.739 μs	3.76%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25560.605000 μs	25995.334 μs	1.70%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40807.955000 μs	41326.322 μs	1.27%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.124000 μs	2.150 μs	1.22%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2047.112000 μs	2070.294 μs	1.13%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.418000 μs	134.462 μs	0.78%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	110757.927000 μs	111607.803 μs	0.77%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.196000 GB/s	3.172 GB/s	0.76%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	250.255000 μs	251.433 μs	0.47%
miscellaneous_benchmark_sycl VectorSum	858.609000 bw GB/s	861.843 bw GB/s	0.38%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17225.687000 μs	17286.218 μs	0.35%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8730.538000 μs	8738.599 μs	0.09%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1190.110000 μs	1190.999 μs	0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6942.625 μs	6930.233000 μs	-0.18%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7494.870 μs	7473.513000 μs	-0.28%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.698 μs	1.684000 μs	-0.82%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48319.957 μs	47693.770000 μs	-1.30%

Relative perf in group SinKernelGraph (4)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72528.102000 μs	72693.457 μs	0.23%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353403.030000 μs	353572.971 μs	0.05%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71739.442 μs	71736.788000 μs	-0.00%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353232.660 μs	353031.159000 μs	-0.06%

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	61.877 μs	61.705000 μs	-0.28%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	674.803 μs	672.596000 μs	-0.33%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	55.146 μs	53.871000 μs	-2.31%

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5578.026000 μs	5585.971 μs	0.14%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56490.231000 μs	56544.632 μs	0.10%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5600.265 μs	5596.380000 μs	-0.07%

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122806.000000 instr	123120.000 instr	0.26%
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104593.000000 instr	104593.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count	109936.000000 instr	109936.000 instr	0.00%

Velocity Bench

Relative perf in group Ungrouped (8)

Benchmark	This PR	baseline	Change
Velocity-Bench Hashtable	360.146407 M keys/sec	357.844 M keys/sec	0.64%
Velocity-Bench dl-cifar	23.751700 s	23.846 s	0.40%
Velocity-Bench CudaSift	202.817000 ms	203.111 ms	0.14%
Velocity-Bench Bitcracker	35.506100 s	35.521 s	0.04%
Velocity-Bench svm	0.135 s	0.134400 s	-0.30%
Velocity-Bench QuickSilver	117.830 MMS/CTT	118.240000 MMS/CTT	-0.35%
Velocity-Bench dl-mnist	2.740 s	2.730000 s	-0.36%
Velocity-Bench Sobel Filter	607.475 ms	595.796000 ms	-1.92%

SYCL-Bench

Relative perf in group Ungrouped (54)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_device	0.064000 ms	0.075 ms	17.19%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.667000 ms	4.867 ms	4.29%
MolecularDynamics	0.030000 ms	0.031 ms	3.33%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.652000 ms	4.803 ms	3.25%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.760000 ms	4.844 ms	1.76%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.909000 ms	4.992 ms	1.69%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.824000 ms	4.881 ms	1.18%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	272.333000 ms	275.524 ms	1.17%
Pattern_Reduction_NDRange_int32	16.495000 ms	16.656 ms	0.98%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	271.866000 ms	273.979 ms	0.78%
ScalarProduct_NDRange_int32	3.766000 ms	3.792 ms	0.69%
ScalarProduct_NDRange_int64	5.435000 ms	5.462 ms	0.50%
VectorAddition_int64	3.057000 ms	3.069 ms	0.39%
Pattern_Reduction_Hierarchical_int32	16.898000 ms	16.961 ms	0.37%
LinearRegressionCoeff_fp32	912.509000 ms	915.761 ms	0.36%
ScalarProduct_Hierarchical_int64	11.493000 ms	11.518 ms	0.22%
Pattern_SegmentedReduction_NDRange_fp32	2.164000 ms	2.166 ms	0.09%
Pattern_SegmentedReduction_NDRange_int64	2.335000 ms	2.337 ms	0.09%
Polybench_3mm	1.479000 ms	1.480 ms	0.07%
MicroBench_LocalMem_fp32_4096	29.856000 ms	29.866 ms	0.03%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.505000 ms	617.565 ms	0.01%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.523000 ms	617.571 ms	0.01%
Pattern_SegmentedReduction_NDRange_int16	2.265000 ms	2.265 ms	0.00%
Pattern_SegmentedReduction_NDRange_int32	2.163000 ms	2.163 ms	0.00%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.937 ms	616.925000 ms	-0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.078 ms	617.060000 ms	-0.00%
Pattern_SegmentedReduction_Hierarchical_fp32	11.590 ms	11.589000 ms	-0.01%
Pattern_SegmentedReduction_Hierarchical_int32	11.588 ms	11.584000 ms	-0.03%
Pattern_SegmentedReduction_Hierarchical_int16	11.808 ms	11.800000 ms	-0.07%
ScalarProduct_Hierarchical_fp32	10.176 ms	10.169000 ms	-0.07%
USM_Allocation_latency_fp32_host	37.526 ms	37.499000 ms	-0.07%
Pattern_SegmentedReduction_Hierarchical_int64	11.778 ms	11.767000 ms	-0.09%
Polybench_Atax	6.426 ms	6.418000 ms	-0.12%
ScalarProduct_Hierarchical_int32	10.528 ms	10.513000 ms	-0.14%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.740 ms	4.733000 ms	-0.15%
MicroBench_LocalMem_int32_4096	29.924 ms	29.875000 ms	-0.16%
Runtime_DAGTaskThroughput_BasicParallelFor	1757.745 ms	1751.649000 ms	-0.35%
Kmeans_fp32	14.141 ms	14.089000 ms	-0.37%
Polybench_2mm	1.047 ms	1.043000 ms	-0.38%
ScalarProduct_NDRange_fp32	3.765 ms	3.750000 ms	-0.40%
Runtime_DAGTaskThroughput_NDRangeParallelFor	1700.697 ms	1692.563000 ms	-0.48%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.113 ms	5.087000 ms	-0.51%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.223 ms	5.195000 ms	-0.54%
Runtime_DAGTaskThroughput_SingleTask	1688.159 ms	1676.869000 ms	-0.67%
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1743.627 ms	1731.836000 ms	-0.68%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.836 ms	1.821000 ms	-0.82%
VectorAddition_int32	1.491 ms	1.471000 ms	-1.34%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.217 ms	1.197000 ms	-1.64%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.712 ms	1.678000 ms	-1.99%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.067 ms	1.044000 ms	-2.16%
Runtime_IndependentDAGTaskThroughput_SingleTask	265.490 ms	257.665000 ms	-2.95%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	287.453 ms	273.621000 ms	-4.81%
VectorAddition_fp32	1.558 ms	1.439000 ms	-7.64%
USM_Allocation_latency_fp32_shared	0.064 ms	0.053000 ms	-17.19%

llama.cpp bench

Relative perf in group Ungrouped (6)

Benchmark	This PR	baseline	Change
llama.cpp Prompt Processing Batched 512	423.309840 token/s	420.526 token/s	0.66%
llama.cpp Prompt Processing Batched 256	874.095863 token/s	870.669 token/s	0.39%
llama.cpp Text Generation Batched 128	62.565562 token/s	62.536 token/s	0.05%
llama.cpp Text Generation Batched 256	62.560031 token/s	62.540 token/s	0.03%
llama.cpp Text Generation Batched 512	62.487 token/s	62.526848 token/s	-0.06%
llama.cpp Prompt Processing Batched 128	830.491 token/s	837.632259 token/s	-0.85%

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	1918.930000 ns	2237.740 ns	16.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4427.100000 ns	4801.330 ns	8.45%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2627.630000 ns	2828.790 ns	7.66%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	295.763000 ns	308.311 ns	4.24%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	134.742000 ns	136.220 ns	1.10%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3133.630000 ns	3157.440 ns	0.76%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3511.250 ns	3272.820000 ns	-6.79%

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	96.979700 ns	103.858 ns	7.09%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	212.352000 ns	218.316 ns	2.81%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	274.244000 ns	277.043 ns	1.02%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	499.595000 ns	500.397 ns	0.16%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	709.164 ns	705.745000 ns	-0.48%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	120.133 ns	118.361000 ns	-1.48%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	196.068 ns	190.897000 ns	-2.64%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3413.760000 ns	3598.790 ns	5.42%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4657.030000 ns	4669.200 ns	0.26%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	2001.790000 ns	2005.880 ns	0.20%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	132.027 ns	131.939000 ns	-0.07%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3375.770 ns	3334.700000 ns	-1.22%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	272.439 ns	261.712000 ns	-3.94%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1279.670 ns	1211.730000 ns	-5.31%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	207.541000 ns	254.262 ns	22.51%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	499.779000 ns	506.641 ns	1.37%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	300.482000 ns	302.371 ns	0.63%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	191.603000 ns	192.581 ns	0.51%
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	202.400000 ns	203.050 ns	0.32%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.393 ns	119.332000 ns	-0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	756.415 ns	752.618000 ns	-0.50%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	754.576000 ns	775.151 ns	2.73%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	855.214000 ns	869.820 ns	1.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4114.110000 ns	4136.040 ns	0.53%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	984.997 ns	948.150000 ns	-3.74%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	174.742000 ns	176.480 ns	0.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	348.183000 ns	351.644 ns	0.99%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	609.229000 ns	613.537 ns	0.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	966.426000 ns	966.457 ns	0.00%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1747590.000000 ns	1805140.000 ns	3.29%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	708107.000000 ns	712838.000 ns	0.67%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	517037.000000 ns	518305.000 ns	0.25%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	46936.600 ns	46813.200000 ns	-0.26%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32410.400 ns	31644.300000 ns	-2.36%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1303400.000 ns	1229410.000000 ns	-5.68%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1316330.000 ns	1161880.000000 ns	-11.73%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24340.100000 ns	25904.100 ns	6.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	142882.000000 ns	148394.000 ns	3.86%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	208239.000000 ns	213417.000 ns	2.49%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	164225.000000 ns	165995.000 ns	1.08%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15395.600000 ns	15536.500 ns	0.92%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4236.460000 ns	4259.720 ns	0.55%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	117484.000000 ns	118110.000 ns	0.53%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	633790.000000 ns	641556.000 ns	1.23%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	139976.000000 ns	140363.000 ns	0.28%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	666338.000000 ns	667489.000 ns	0.17%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75331.300 ns	75247.200000 ns	-0.11%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25140.600000 ns	25814.900 ns	2.68%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	59600.900000 ns	60016.500 ns	0.70%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31241.900 ns	31174.000000 ns	-0.22%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	132302.000 ns	131884.000000 ns	-0.32%

Details

Benchmark details contain too many chars to display

github-actions · 2025-02-17T11:49:26Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13369553246

github-actions · 2025-02-17T12:30:31Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13369553246
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)

Improved 21 (threshold 2.00%)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_device	0.051000 ms	0.075 ms	47.06%
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1746.340000 ns	2005.880 ns	14.86%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	682.180000 ns	775.151 ns	13.63%
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2495.030000 ns	2828.790 ns	13.38%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3296.830000 ns	3598.790 ns	9.16%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2934.820000 ns	3157.440 ns	7.59%
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	97.025600 ns	103.858 ns	7.04%
MolecularDynamics	0.029000 ms	0.031 ms	6.90%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24528.300000 ns	25904.100 ns	5.61%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2122.800000 ns	2237.740 ns	5.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4599.240000 ns	4801.330 ns	4.39%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1187100.000000 ns	1229410.000 ns	3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15002.000000 ns	15536.500 ns	3.56%
Polybench_Atax	6.224000 ms	6.418 ms	3.12%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	595.246000 ns	613.537 ns	3.07%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	299.300000 ns	308.311 ns	3.01%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	623116.000000 ns	641556.000 ns	2.96%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	846.881000 ns	869.820 ns	2.71%
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.097000 μs	2.150 μs	2.53%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	73415.400000 ns	75247.200 ns	2.50%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.771000 ms	4.881 ms	2.31%

Regressed 11 (threshold 2.00%)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_shared	0.064 ms	0.053000 ms	-17.19%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1114.640 ns	948.150000 ns	-14.94%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3478.340 ns	3272.820000 ns	-5.91%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	286.523 ms	273.621000 ms	-4.50%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33106.700 ns	31644.300000 ns	-4.42%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	264.756 ns	254.262000 ns	-3.96%
Runtime_IndependentDAGTaskThroughput_SingleTask	267.557 ms	257.665000 ms	-3.70%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	284.400 ns	277.043000 ns	-2.59%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1190810.000 ns	1161880.000000 ns	-2.43%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	309.859 ns	302.371000 ns	-2.42%
VectorAddition_fp32	1.470 ms	1.439000 ms	-2.11%

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel out of order	15.459000 μs	15.729 μs	1.75%
api_overhead_benchmark_ur SubmitKernel in order with measure completion	20.962000 μs	21.172 μs	1.00%
api_overhead_benchmark_l0 SubmitKernel out of order	11.781000 μs	11.854 μs	0.62%
api_overhead_benchmark_sycl SubmitKernel in order	24.135000 μs	24.195 μs	0.25%
api_overhead_benchmark_sycl SubmitKernel out of order	22.870 μs	22.788000 μs	-0.36%
api_overhead_benchmark_ur SubmitKernel in order	16.323 μs	16.210000 μs	-0.69%
api_overhead_benchmark_l0 SubmitKernel in order	11.845 μs	11.662000 μs	-1.54%

Relative perf in group Other (17)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.097000 μs	2.150 μs	2.53%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40751.379000 μs	41326.322 μs	1.41%
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	110280.410000 μs	111607.803 μs	1.20%
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.208000 GB/s	3.172 GB/s	1.13%
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.683000 μs	5.739 μs	0.99%
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	250.160000 μs	251.433 μs	0.51%
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	133.862000 μs	134.462 μs	0.45%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8711.600000 μs	8738.599 μs	0.31%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47595.884000 μs	47693.770 μs	0.21%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7464.570000 μs	7473.513 μs	0.12%
miscellaneous_benchmark_sycl VectorSum	861.253000 bw GB/s	861.843 bw GB/s	0.07%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1190.553000 μs	1190.999 μs	0.04%
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17281.431000 μs	17286.218 μs	0.03%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26030.833 μs	25995.334000 μs	-0.14%
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6962.616 μs	6930.233000 μs	-0.47%
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2101.985 μs	2070.294000 μs	-1.51%
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.713 μs	1.684000 μs	-1.69%

Relative perf in group SinKernelGraph (4)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353373.123000 μs	353572.971 μs	0.06%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72659.797000 μs	72693.457 μs	0.05%
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71747.951 μs	71736.788000 μs	-0.02%
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353253.507 μs	353031.159000 μs	-0.06%

Relative perf in group SubmitGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	673.919 μs	672.596000 μs	-0.20%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.121 μs	61.705000 μs	-0.67%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.589 μs	53.871000 μs	-1.32%

Relative perf in group ExecGraph (3)

Benchmark	This PR	baseline	Change
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56454.829000 μs	56544.632 μs	0.16%
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5583.459000 μs	5585.971 μs	0.04%
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5597.457 μs	5596.380000 μs	-0.02%

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR	baseline	Change
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122806.000000 instr	123120.000 instr	0.26%
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104593.000000 instr	104593.000 instr	0.00%
api_overhead_benchmark_ur SubmitKernel in order CPU count	109936.000000 instr	109936.000 instr	0.00%

Velocity Bench

Relative perf in group Other (8)

Benchmark	This PR	baseline	Change
Velocity-Bench dl-mnist	2.700000 s	2.730 s	1.11%
Velocity-Bench Hashtable	359.868959 M keys/sec	357.844 M keys/sec	0.57%
Velocity-Bench QuickSilver	118.440000 MMS/CTT	118.240 MMS/CTT	0.17%
Velocity-Bench svm	0.134200 s	0.134 s	0.15%
Velocity-Bench dl-cifar	23.818200 s	23.846 s	0.12%
Velocity-Bench CudaSift	203.002000 ms	203.111 ms	0.05%
Velocity-Bench Bitcracker	35.512700 s	35.521 s	0.02%
Velocity-Bench Sobel Filter	606.108 ms	595.796000 ms	-1.70%

SYCL-Bench

Relative perf in group Other (54)

Benchmark	This PR	baseline	Change
USM_Allocation_latency_fp32_device	0.051000 ms	0.075 ms	47.06%
MolecularDynamics	0.029000 ms	0.031 ms	6.90%
Polybench_Atax	6.224000 ms	6.418 ms	3.12%
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.771000 ms	4.881 ms	2.31%
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.773000 ms	4.867 ms	1.97%
VectorAddition_int32	1.447000 ms	1.471 ms	1.66%
ScalarProduct_NDRange_int32	3.750000 ms	3.792 ms	1.12%
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.797000 ms	4.844 ms	0.98%
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.039000 ms	5.087 ms	0.95%
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.150000 ms	5.195 ms	0.87%
Pattern_Reduction_Hierarchical_int32	16.828000 ms	16.961 ms	0.79%
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.968000 ms	4.992 ms	0.48%
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.712000 ms	4.733 ms	0.45%
ScalarProduct_NDRange_int64	5.439000 ms	5.462 ms	0.42%
Polybench_2mm	1.039000 ms	1.043 ms	0.38%
ScalarProduct_Hierarchical_int64	11.474000 ms	11.518 ms	0.38%
VectorAddition_int64	3.061000 ms	3.069 ms	0.26%
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.791000 ms	4.803 ms	0.25%
ScalarProduct_Hierarchical_fp32	10.145000 ms	10.169 ms	0.24%
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.675000 ms	1.678 ms	0.18%
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	273.506000 ms	273.979 ms	0.17%
MicroBench_LocalMem_fp32_4096	29.819000 ms	29.866 ms	0.16%
Pattern_SegmentedReduction_NDRange_int16	2.262000 ms	2.265 ms	0.13%
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	275.267000 ms	275.524 ms	0.09%
Pattern_SegmentedReduction_NDRange_fp32	2.164000 ms	2.166 ms	0.09%
Pattern_SegmentedReduction_NDRange_int64	2.335000 ms	2.337 ms	0.09%
Kmeans_fp32	14.078000 ms	14.089 ms	0.08%
Pattern_SegmentedReduction_Hierarchical_fp32	11.583000 ms	11.589 ms	0.05%
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.892000 ms	616.925 ms	0.01%
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.538000 ms	617.571 ms	0.01%
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.549000 ms	617.565 ms	0.00%
Pattern_SegmentedReduction_NDRange_int32	2.163000 ms	2.163 ms	0.00%
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.065 ms	617.060000 ms	-0.00%
Pattern_SegmentedReduction_Hierarchical_int16	11.801 ms	11.800000 ms	-0.01%
USM_Allocation_latency_fp32_host	37.509 ms	37.499000 ms	-0.03%
Pattern_SegmentedReduction_Hierarchical_int32	11.590 ms	11.584000 ms	-0.05%
Runtime_DAGTaskThroughput_NDRangeParallelFor	1693.687 ms	1692.563000 ms	-0.07%
Pattern_SegmentedReduction_Hierarchical_int64	11.775 ms	11.767000 ms	-0.07%
MicroBench_LocalMem_int32_4096	29.915 ms	29.875000 ms	-0.13%
Polybench_3mm	1.482 ms	1.480000 ms	-0.13%
ScalarProduct_Hierarchical_int32	10.533 ms	10.513000 ms	-0.19%
Runtime_DAGTaskThroughput_BasicParallelFor	1756.541 ms	1751.649000 ms	-0.28%
LinearRegressionCoeff_fp32	918.741 ms	915.761000 ms	-0.32%
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.827 ms	1.821000 ms	-0.33%
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1738.041 ms	1731.836000 ms	-0.36%
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.048 ms	1.044000 ms	-0.38%
ScalarProduct_NDRange_fp32	3.765 ms	3.750000 ms	-0.40%
Runtime_DAGTaskThroughput_SingleTask	1686.891 ms	1676.869000 ms	-0.59%
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.210 ms	1.197000 ms	-1.07%
Pattern_Reduction_NDRange_int32	16.878 ms	16.656000 ms	-1.32%
VectorAddition_fp32	1.470 ms	1.439000 ms	-2.11%
Runtime_IndependentDAGTaskThroughput_SingleTask	267.557 ms	257.665000 ms	-3.70%
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	286.523 ms	273.621000 ms	-4.50%
USM_Allocation_latency_fp32_shared	0.064 ms	0.053000 ms	-17.19%

llama.cpp bench

Relative perf in group Other (6)

Benchmark	This PR	baseline	Change
llama.cpp Prompt Processing Batched 512	421.813576 token/s	420.526 token/s	0.31%
llama.cpp Text Generation Batched 128	62.502 token/s	62.536385 token/s	-0.05%
llama.cpp Text Generation Batched 256	62.480 token/s	62.539887 token/s	-0.10%
llama.cpp Text Generation Batched 512	62.460 token/s	62.526848 token/s	-0.11%
llama.cpp Prompt Processing Batched 128	831.944 token/s	837.632259 token/s	-0.68%
llama.cpp Prompt Processing Batched 256	862.763 token/s	870.668720 token/s	-0.91%

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2495.030000 ns	2828.790 ns	13.38%
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	2934.820000 ns	3157.440 ns	7.59%
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2122.800000 ns	2237.740 ns	5.41%
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4599.240000 ns	4801.330 ns	4.39%
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	299.300000 ns	308.311 ns	3.01%
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	137.299 ns	136.220000 ns	-0.79%
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3478.340 ns	3272.820000 ns	-5.91%

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	97.025600 ns	103.858 ns	7.04%
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	493.060000 ns	500.397 ns	1.49%
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	701.873000 ns	705.745 ns	0.55%
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	217.504000 ns	218.316 ns	0.37%
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	190.852000 ns	190.897 ns	0.02%
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.367 ns	118.361000 ns	-0.84%
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	284.400 ns	277.043000 ns	-2.59%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1746.340000 ns	2005.880 ns	14.86%
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3296.830000 ns	3598.790 ns	9.16%
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	258.644000 ns	261.712 ns	1.19%
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1198.800000 ns	1211.730 ns	1.08%
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4648.820000 ns	4669.200 ns	0.44%
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3371.340 ns	3334.700000 ns	-1.09%
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	134.341 ns	131.939000 ns	-1.79%

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)

Benchmark	This PR	baseline	Change
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	201.647000 ns	203.050 ns	0.70%
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	118.890000 ns	119.332 ns	0.37%
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	753.010 ns	752.618000 ns	-0.05%
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	507.982 ns	506.641000 ns	-0.26%
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	196.427 ns	192.581000 ns	-1.96%
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	309.859 ns	302.371000 ns	-2.42%
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	264.756 ns	254.262000 ns	-3.96%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	682.180000 ns	775.151 ns	13.63%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	846.881000 ns	869.820 ns	2.71%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4211.600 ns	4136.040000 ns	-1.79%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1114.640 ns	948.150000 ns	-14.94%

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)

Benchmark	This PR	baseline	Change
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	595.246000 ns	613.537 ns	3.07%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	959.568000 ns	966.457 ns	0.72%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	351.671 ns	351.644000 ns	-0.01%
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	176.597 ns	176.480000 ns	-0.07%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1187100.000000 ns	1229410.000 ns	3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	704649.000000 ns	712838.000 ns	1.16%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1794930.000000 ns	1805140.000 ns	0.57%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	518163.000000 ns	518305.000 ns	0.03%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	46921.100 ns	46813.200000 ns	-0.23%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1190810.000 ns	1161880.000000 ns	-2.43%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33106.700 ns	31644.300000 ns	-4.42%

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)

Benchmark	This PR	baseline	Change
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24528.300000 ns	25904.100 ns	5.61%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15002.000000 ns	15536.500 ns	3.56%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	163353.000000 ns	165995.000 ns	1.62%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	146586.000000 ns	148394.000 ns	1.23%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	117076.000000 ns	118110.000 ns	0.88%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4288.090 ns	4259.720000 ns	-0.66%
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	215507.000 ns	213417.000000 ns	-0.97%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	623116.000000 ns	641556.000 ns	2.96%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	73415.400000 ns	75247.200 ns	2.50%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	659960.000000 ns	667489.000 ns	1.14%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140482.000 ns	140363.000000 ns	-0.08%

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)

Benchmark	This PR	baseline	Change
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25632.900000 ns	25814.900 ns	0.71%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	131839.000000 ns	131884.000 ns	0.03%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60016.500000 ns	60016.500 ns	0.00%
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	31185.100 ns	31174.000000 ns	-0.04%

Details

Benchmark details contain too many chars to display

lukaszstolarczuk · 2025-02-17T15:07:19Z

scripts/benchmarks/README.md

@@ -37,11 +37,16 @@ By default, the benchmark results are not stored. To store them, use the option

 To compare a benchmark run with a previously stored result, use the option `--compare <name>`. You can compare with more than one result.



above there's a sentence By default, all benchmark runs are compared against baseline, which is a well-established set of the latest data. which should be now gone, I believe.

lukaszstolarczuk · 2025-02-17T15:08:18Z

scripts/benchmarks/README.md

+## Output formats
+You can display the results in the form of a HTML file by using `--ouptut-html` and a markdown file by using `--output-markdown`. Due to character limits for posting PR comments, the final content of the markdown file might be reduced. In order to obtain the full markdown output, use `--output-markdown full`.
+
+


one redundant empty line

lukaszstolarczuk · 2025-02-17T15:37:19Z

scripts/benchmarks/output_markdown.py

-        # Generate the row with the best value highlighted
+        # Generate the row with all the results from saved runs specified by
+        # --compare,
+        # Highight the best value in the row with data


still a misspell ;d

add an option for limiting markdown content size calculate relative performance with different baselines calculate relative performance using only already saved data group results according to suite names and explicit groups add multiple data columns if multiple --compare specified

github-actions · 2025-02-17T15:51:06Z

Compute Benchmarks level_zero run (with params: --output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13374078367

github-actions · 2025-02-17T16:33:59Z

Compute Benchmarks level_zero run (--output-markdown):
https://github.com/oneapi-src/unified-runtime/actions/runs/13374078367
Job status: success. Test status: success.

Summary

(Emphasized values are the best results)
No diffs to calculate performance change

Performance change in benchmark groups

Compute Benchmarks

Relative perf in group SubmitKernel (7)

Benchmark	This PR
api_overhead_benchmark_l0 SubmitKernel out of order	11.776000 μs
api_overhead_benchmark_l0 SubmitKernel in order	11.922000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	22.947000 μs
api_overhead_benchmark_sycl SubmitKernel in order	24.499000 μs
api_overhead_benchmark_ur SubmitKernel out of order	15.787000 μs
api_overhead_benchmark_ur SubmitKernel in order	16.459000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.155000 μs

Relative perf in group Other (17)

Benchmark	This PR
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	255.950000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	134.070000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.679000 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.169000 GB/s
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.160000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.730000 μs
miscellaneous_benchmark_sycl VectorSum	856.272000 bw GB/s
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6922.979000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	17208.941000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	48159.835000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2087.698000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7556.543000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8684.467000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	25857.625000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1195.338000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	40981.923000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	111831.648000 μs

Relative perf in group SinKernelGraph (4)

Benchmark	This PR
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71729.094000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72595.653000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353444.654000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	353397.056000 μs

Relative perf in group SubmitGraph (3)

Benchmark	This PR
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.338000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	62.270000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	677.416000 μs

Relative perf in group ExecGraph (3)

Benchmark	This PR
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5597.346000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5611.676000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	56483.549000 μs

Relative perf in group SubmitKernel CPU count (3)

Benchmark	This PR
api_overhead_benchmark_ur SubmitKernel out of order CPU count	104593.000000 instr
api_overhead_benchmark_ur SubmitKernel in order CPU count	109936.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	122806.000000 instr

Velocity Bench

Relative perf in group Other (8)

Benchmark	This PR
Velocity-Bench Hashtable	357.020483 M keys/sec
Velocity-Bench Bitcracker	35.410900 s
Velocity-Bench CudaSift	202.614000 ms
Velocity-Bench QuickSilver	117.350000 MMS/CTT
Velocity-Bench Sobel Filter	608.239000 ms
Velocity-Bench dl-cifar	23.907000 s
Velocity-Bench dl-mnist	2.740000 s
Velocity-Bench svm	0.135400 s

SYCL-Bench

Relative perf in group Other (53)

Benchmark	This PR
Runtime_IndependentDAGTaskThroughput_SingleTask	268.524000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	293.891000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	276.167000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	278.026000 ms
Runtime_DAGTaskThroughput_SingleTask	1697.255000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	1767.704000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1751.062000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	1711.114000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.848000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.858000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.825000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.756000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	617.570000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	617.605000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.764000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.090000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.909000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.934000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.027000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	616.918000 ms
MicroBench_LocalMem_int32_4096	29.927000 ms
MicroBench_LocalMem_fp32_4096	29.881000 ms
Pattern_Reduction_NDRange_int32	16.968000 ms
Pattern_Reduction_Hierarchical_int32	16.685000 ms
ScalarProduct_NDRange_int32	3.760000 ms
ScalarProduct_NDRange_int64	5.458000 ms
ScalarProduct_NDRange_fp32	3.741000 ms
ScalarProduct_Hierarchical_int32	10.530000 ms
ScalarProduct_Hierarchical_int64	11.497000 ms
ScalarProduct_Hierarchical_fp32	10.164000 ms
Pattern_SegmentedReduction_NDRange_int16	2.265000 ms
Pattern_SegmentedReduction_NDRange_int32	2.161000 ms
Pattern_SegmentedReduction_NDRange_int64	2.333000 ms
Pattern_SegmentedReduction_NDRange_fp32	2.167000 ms
Pattern_SegmentedReduction_Hierarchical_int16	11.804000 ms
Pattern_SegmentedReduction_Hierarchical_int32	11.587000 ms
Pattern_SegmentedReduction_Hierarchical_int64	11.770000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	11.588000 ms
USM_Allocation_latency_fp32_device	0.064000 ms
USM_Allocation_latency_fp32_host	37.563000 ms
USM_Allocation_latency_fp32_shared	0.055000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.672000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.048000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.814000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.205000 ms
VectorAddition_int32	1.460000 ms
VectorAddition_int64	3.130000 ms
VectorAddition_fp32	1.486000 ms
Polybench_2mm	1.042000 ms
Polybench_3mm	1.480000 ms
Polybench_Atax	6.390000 ms
Kmeans_fp32	14.046000 ms
MolecularDynamics	0.030000 ms

llama.cpp bench

Relative perf in group Other (6)

Benchmark	This PR
llama.cpp Prompt Processing Batched 128	827.954592 token/s
llama.cpp Text Generation Batched 128	62.510353 token/s
llama.cpp Prompt Processing Batched 256	868.599537 token/s
llama.cpp Text Generation Batched 256	62.527584 token/s
llama.cpp Prompt Processing Batched 512	420.065042 token/s
llama.cpp Text Generation Batched 512	62.487604 token/s

UMF

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (7)

Benchmark	This PR
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2742.670000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2067.960000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3129.460000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4559.790000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3355.410000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	295.297000 ns
alloc/size:10000/0/4096/iterations:200000/threads:4 umfProxy	135.108000 ns

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (7)

Benchmark	This PR
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	707.619000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	188.767000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	275.108000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	495.126000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	119.030000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	214.015000 ns
alloc/size:10000/0/4096/iterations:200000/threads:1 umfProxy	97.362400 ns

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (7)

Benchmark	This PR
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1444.760000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1980.130000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3256.670000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 disjoint_pool<os_provider>	4472.580000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 jemalloc_pool<os_provider>	3525.540000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	262.952000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:4 umfProxy	132.116000 ns

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (7)

Benchmark	This PR
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	978.814000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	194.471000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	305.558000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 disjoint_pool<os_provider>	496.014000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 jemalloc_pool<os_provider>	118.534000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	269.829000 ns
alloc/size:10000/100000/4096/iterations:200000/threads:1 umfProxy	201.943000 ns

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 (4)

Benchmark	This PR
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	854.820000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 jemalloc_pool<os_provider>	4153.800000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1013.870000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 umfProxy	722.673000 ns

Relative perf in group alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 (4)

Benchmark	This PR
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.874000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 jemalloc_pool<os_provider>	345.899000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	969.503000 ns
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 umfProxy	614.514000 ns

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 (7)

Benchmark	This PR
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32389.300000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1183460.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1182330.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 disjoint_pool<os_provider>	1740640.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 jemalloc_pool<os_provider>	512444.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	48513.600000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 umfProxy	705291.000000 ns

Relative perf in group multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 (7)

Benchmark	This PR
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4328.220000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	164126.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	143840.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 disjoint_pool<os_provider>	213370.000000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 jemalloc_pool<os_provider>	24692.900000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	15469.900000 ns
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 umfProxy	117218.000000 ns

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 (4)

Benchmark	This PR
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	140408.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 jemalloc_pool<os_provider>	623862.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	77464.100000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 umfProxy	664466.000000 ns

Relative perf in group multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 (4)

Benchmark	This PR
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30428.300000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 jemalloc_pool<os_provider>	60592.000000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25434.600000 ns
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 umfProxy	131628.000000 ns

Details

Benchmark details contain too many chars to display

EuphoricThinking requested a review from a team as a code owner February 11, 2025 20:31

github-actions bot added the ci/cd Continuous integration/devliery label Feb 11, 2025

lukaszstolarczuk reviewed Feb 12, 2025

View reviewed changes

EuphoricThinking force-pushed the benchmark_markdown branch 2 times, most recently from 2e8d039 to 09ce9ac Compare February 13, 2025 14:08

pbalcer reviewed Feb 14, 2025

View reviewed changes

EuphoricThinking force-pushed the benchmark_markdown branch from 09ce9ac to 3b3e942 Compare February 14, 2025 18:37

EuphoricThinking force-pushed the benchmark_markdown branch from 3b3e942 to 37a046c Compare February 16, 2025 17:56

pbalcer reviewed Feb 17, 2025

View reviewed changes

EuphoricThinking force-pushed the benchmark_markdown branch from 37a046c to 7f8ea38 Compare February 17, 2025 10:35

EuphoricThinking force-pushed the benchmark_markdown branch from 7f8ea38 to 46a6bac Compare February 17, 2025 11:47

EuphoricThinking force-pushed the benchmark_markdown branch 3 times, most recently from 36ff0a5 to 7028a34 Compare February 17, 2025 15:21

EuphoricThinking force-pushed the benchmark_markdown branch from 7028a34 to bbd9097 Compare February 17, 2025 15:22

lukaszstolarczuk reviewed Feb 17, 2025

View reviewed changes

EuphoricThinking force-pushed the benchmark_markdown branch from bbd9097 to 73a774b Compare February 17, 2025 15:50

pbalcer merged commit 8682bbc into oneapi-src:main Feb 17, 2025
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change markdown output in benchmark PR comments #2693

change markdown output in benchmark PR comments #2693

EuphoricThinking commented Feb 11, 2025 •

edited

Loading

bratpiorka commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

pbalcer commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

lukaszstolarczuk Feb 12, 2025

lukaszstolarczuk Feb 17, 2025

pbalcer left a comment

pbalcer Feb 14, 2025

EuphoricThinking Feb 14, 2025

pbalcer Feb 14, 2025

EuphoricThinking commented Feb 14, 2025 •

edited

Loading

github-actions bot commented Feb 14, 2025

github-actions bot commented Feb 14, 2025

github-actions bot commented Feb 16, 2025

github-actions bot commented Feb 16, 2025

pbalcer Feb 17, 2025

pbalcer left a comment

pbalcer Feb 17, 2025

pbalcer Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

lukaszstolarczuk Feb 17, 2025

lukaszstolarczuk Feb 17, 2025

lukaszstolarczuk Feb 17, 2025

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025



		def get_relative_perf_summary(group_size: int, diffs_product: int,
		root_for_geometric_mean: int, group_name: str):

		@@ -37,11 +37,16 @@ By default, the benchmark results are not stored. To store them, use the option

		To compare a benchmark run with a previously stored result, use the option `--compare <name>`. You can compare with more than one result.

		## Output formats
		You can display the results in the form of a HTML file by using `--ouptut-html` and a markdown file by using `--output-markdown`. Due to character limits for posting PR comments, the final content of the markdown file might be reduced. In order to obtain the full markdown output, use `--output-markdown full`.

change markdown output in benchmark PR comments #2693

change markdown output in benchmark PR comments #2693

Conversation

EuphoricThinking commented Feb 11, 2025 • edited Loading

bratpiorka commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

pbalcer commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

Summary

Performance change in benchmark groups

Details

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbalcer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EuphoricThinking commented Feb 14, 2025 • edited Loading

github-actions bot commented Feb 14, 2025

github-actions bot commented Feb 14, 2025

Summary

Performance change in benchmark groups

Details

github-actions bot commented Feb 16, 2025

github-actions bot commented Feb 16, 2025

Summary

Performance change in benchmark groups

Details

Choose a reason for hiding this comment

pbalcer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

Summary

Performance change in benchmark groups

Details

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

Summary

Performance change in benchmark groups

Details

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

Summary

Performance change in benchmark groups

Details

EuphoricThinking commented Feb 11, 2025 •

edited

Loading

EuphoricThinking commented Feb 14, 2025 •

edited

Loading