Improve throughput for hot cache file reading on GH200 #629

GregoryKimball · 2025-02-08T23:00:33Z

When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.

Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.

import cudf
import cupy
import rmm
import nvtx
import time
from io import BytesIO

rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

nrows= int(1.6 * 10**8)
df = cudf.DataFrame({
    'a': cupy.random.rand(nrows)
})
df.to_parquet(
    '/raid/gkimball/tmp.pq',
    compression=None,   
    column_encoding='PLAIN',    
)


for r in range(10):
    with nvtx.annotate(f"read hot cache file"):
        t0 = time.time()
        _ = cudf.read_parquet('/raid/gkimball/tmp.pq')
        t1 = time.time()
        print(f"read hot cache file: {t1-t0}")


buf = BytesIO()
df.to_parquet(
    buf,
    compression=None,   
    column_encoding='PLAIN',    
)

for r in range(10):
    with nvtx.annotate("read host buffer"):
        buf.seek(0)
        t0 = time.time()
        _ = cudf.read_parquet(buf)
        t1 = time.time()
        print(f"read host buffer: {t1-t0}")

On x86, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).

More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2025-02-08T23:01:23Z

629 profiles.zip

GregoryKimball · 2025-02-11T15:00:34Z

So far the fastest I could go was about 50 GiB/s on GH200. If you go wide to use all 72 threads, and increase the task size and bounce buffer size to 16 MB. You can push things a bit better.

You see all the threads spin up, but it takes 9 ms before the first copy happens. Somehow the threads are getting serialized. Maybe in the OS memory management system somewhere.

GregoryKimball assigned kingcrimsontianyu Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve throughput for hot cache file reading on GH200 #629

Improve throughput for hot cache file reading on GH200 #629

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 11, 2025

Improve throughput for hot cache file reading on GH200 #629

Improve throughput for hot cache file reading on GH200 #629

Comments

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 11, 2025