Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not train an epoch #43

Open
Licolas opened this issue Jul 11, 2024 · 0 comments
Open

Can not train an epoch #43

Licolas opened this issue Jul 11, 2024 · 0 comments

Comments

@Licolas
Copy link

Licolas commented Jul 11, 2024

It always stops at here.

(subdivnet) lm@lm:~/0-majorRevision/SubdivNet-master$ sh scripts/manifold40/train.sh
[i 0711 09:55:59.289227 96 compiler.py:956] Jittor(1.3.8.5) src: /home/lm/anaconda3/envs/subdivnet/lib/python3.7/site-packages/jittor
[i 0711 09:55:59.298060 96 compiler.py:957] g++ at /usr/bin/g++(11.4.0)
[i 0711 09:55:59.298134 96 compiler.py:958] cache_path: /home/lm/.cache/jittor/jt1.3.8/g++11.4.0/py3.7.16/Linux-6.5.0-41xc8/IntelRXeonRSilxdc/default
[i 0711 09:55:59.307678 96 init.py:411] Found nvcc(11.7.99) at /usr/local/cuda-11.7/bin/nvcc.
[i 0711 09:55:59.383568 96 init.py:411] Found gdb(22.04.2) at /usr/bin/gdb.
[i 0711 09:55:59.397309 96 init.py:411] Found addr2line(2.38) at /usr/bin/addr2line.
[i 0711 09:55:59.510948 96 compiler.py:1011] cuda key:cu11.7.99_sm_89
[i 0711 09:56:00.005350 96 init.py:227] Total mem: 62.44GB, using 16 procs for compiling.
Compiling jittor_core(151/151) used: 2.437s eta: 0.000s
[i 0711 09:56:02.815749 96 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0711 09:56:02.888116 96 init.cc:62] Found cuda archs: [89,]
[w 0711 09:56:02.903832 96 compiler.py:1384] CUDA arch(89)>86 will be backward-compatible
[w 0711 09:56:02.935237 96 compile_extern.py:203] CUDA related path found in LD_LIBRARY_PATH or PATH(['', '/usr/local/cuda-11.7/lib64', '/home/lm/anaconda3/envs/subdivnet/bin', '/home/lm/anaconda3/condabin', '/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin', '/usr/games', '/usr/local/games', '/snap/bin', '/snap/bin', '/usr/local/cuda-11.7/bin']), This path may cause jittor found the wrong libs, please unset LD_LIBRARY_PATH and remove cuda lib path in Path.
Or you can let jittor install cuda for you: python3.x -m jittor_utils.install_cuda
[i 0711 09:56:12.951927 96 cuda_flags.cc:49] CUDA enabled.
name: manifold40
Train 0: 0%|▍ | 12/3278 [00:06<20:55, 2.60it/s][w 0711 09:56:20.710701 96 cudnn_conv__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFORMAT_abcd_____hash_4d5b3e2d24c769d3_op.cc:419] forward_ algorithm cache is full
Train 0: 0%|▍ | 13/3278 [00:06<21:05, 2.58it/s][w 0711 09:56:20.865463 96 cudnn_conv_backward_w__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFO___hash_8e480e8564e59906_op.cc:418] backward w algorithm cache is full
Train 0: 0%|▍ | 15/3278 [00:07<19:45, 2.75it/s][w 0711 09:56:21.510013 96 cudnn_conv_backward_x__Tx_float32__Ty_float32__Tw_float32__XFORMAT_abcd__WFORMAT_oihw__YFO___hash_af8994a8aef53c1c_op.cc:410] backward x algorithm cache is full
Train 0: 67%|████████████████████████████████████████████████████████████████████▌ | 2184/3278 [10:19<05:21, 3.40it/s]

log is as follow:


Async error was detected. To locate the async backtrace and get better error report, please rerun your code with two enviroment variables set:

export JT_SYNC=1
export trace_py_var=3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant