Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve throughput for hot cache file reading on GH200 #629

Open
GregoryKimball opened this issue Feb 8, 2025 · 2 comments
Open

Improve throughput for hot cache file reading on GH200 #629

GregoryKimball opened this issue Feb 8, 2025 · 2 comments
Assignees

Comments

@GregoryKimball
Copy link

When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.

Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.

import cudf
import cupy
import rmm
import nvtx
import time
from io import BytesIO

rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

nrows= int(1.6 * 10**8)
df = cudf.DataFrame({
    'a': cupy.random.rand(nrows)
})
df.to_parquet(
    '/raid/gkimball/tmp.pq',
    compression=None,   
    column_encoding='PLAIN',    
)


for r in range(10):
    with nvtx.annotate(f"read hot cache file"):
        t0 = time.time()
        _ = cudf.read_parquet('/raid/gkimball/tmp.pq')
        t1 = time.time()
        print(f"read hot cache file: {t1-t0}")


buf = BytesIO()
df.to_parquet(
    buf,
    compression=None,   
    column_encoding='PLAIN',    
)

for r in range(10):
    with nvtx.annotate("read host buffer"):
        buf.seek(0)
        t0 = time.time()
        _ = cudf.read_parquet(buf)
        t1 = time.time()
        print(f"read host buffer: {t1-t0}")

On x86, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).

More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.

@GregoryKimball
Copy link
Author

629 profiles.zip

@GregoryKimball
Copy link
Author

So far the fastest I could go was about 50 GiB/s on GH200. If you go wide to use all 72 threads, and increase the task size and bounce buffer size to 16 MB. You can push things a bit better.

You see all the threads spin up, but it takes 9 ms before the first copy happens. Somehow the threads are getting serialized. Maybe in the OS memory management system somewhere.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants