Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when run allgather with npkit_trace_generator.py #33

Open
zhuo121 opened this issue Nov 19, 2024 · 0 comments
Open

Errors when run allgather with npkit_trace_generator.py #33

zhuo121 opened this issue Nov 19, 2024 · 0 comments

Comments

@zhuo121
Copy link

zhuo121 commented Nov 19, 2024

Issue

When I use npkit_trace_generator.py to convert the trace file generated by npkit to a json file, I get some errors.

Traceback (most recent call last):
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 232, in <module>
    convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
    gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhangshizhuo/msccl/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
    'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
          ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'

Specifically, I used the msccl-tools/examples/mscclang/allgather_recursive_doubling.py to generate the xml file and communicate on the cluster. This error also occurs when testing reduce scatter, but allreduce and alltoall not. Can you help me with this error? Looking forward to your reply.

Details

Generate xml file:

python /home/zhangshizhuo/msccl-tools/examples/mscclang/allgather_recursive_doubling.py 4 1 --protocol='Simple'> /home/zhangshizhuo/xml2/Allgather_test.xml

mpirun test:

 mpirun --prefix /usr/local/openmpi \
        -np 4 \
        -H gpu1:4\
        -map-by slot \
        -mca btl_tcp_if_include 10.1.1.0/24 \
        -x NCCL_SOCKET_IFNAME=ens16f0,enp75s0f0np0,ens6f0 \
        -x LD_LIBRARY_PATH=/home/zhangshizhuo/msccl/build/lib/:$LD_LIBRARY_PATH \
        -x NCCL_NET_SHARED_BUFFERS=0 \
        -x NCCL_IGNORE_DISABLED_P2P=1 \
        -x NCCL_SHM_Disable=1 \
        -x NCCL_DEBUG=INFO \
        -x NCCL_ALGO=MSCCL,RING  \
        -x MSCCL_XML_FILES=/home/zhangshizhuo/xml2/Allgather_test.xml \
        -x NPKIT_DUMP_DIR=/home/zhangshizhuo/trace/trace_allgather/ \
        -x CUDA_VISIBLE_DEVICES=0,1,2,3 \
        bash -c ' cd /home/zhangshizhuo/nccl-tests/build/; \
        ./all_gather_perf -b 32M -e 32M -f 2 -g 1 -n 5 -w 3 -c 0 -z 1 '
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant