You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.
Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.
import cudf
import cupy
import rmm
import nvtx
import time
from io import BytesIO
rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())
nrows= int(1.6 * 10**8)
df = cudf.DataFrame({
'a': cupy.random.rand(nrows)
})
df.to_parquet(
'/raid/gkimball/tmp.pq',
compression=None,
column_encoding='PLAIN',
)
for r in range(10):
with nvtx.annotate(f"read hot cache file"):
t0 = time.time()
_ = cudf.read_parquet('/raid/gkimball/tmp.pq')
t1 = time.time()
print(f"read hot cache file: {t1-t0}")
buf = BytesIO()
df.to_parquet(
buf,
compression=None,
column_encoding='PLAIN',
)
for r in range(10):
with nvtx.annotate("read host buffer"):
buf.seek(0)
t0 = time.time()
_ = cudf.read_parquet(buf)
t1 = time.time()
print(f"read host buffer: {t1-t0}")
On x86, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).
More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.
The text was updated successfully, but these errors were encountered:
So far the fastest I could go was about 50 GiB/s on GH200. If you go wide to use all 72 threads, and increase the task size and bounce buffer size to 16 MB. You can push things a bit better.
You see all the threads spin up, but it takes 9 ms before the first copy happens. Somehow the threads are getting serialized. Maybe in the OS memory management system somewhere.
When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.
Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.
On x86, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).
More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.
The text was updated successfully, but these errors were encountered: