Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练到中途:torch.distributed.elastic.multiprocessing.api.SignalException: Process 17871 got signal: 1 #73

Closed
Tian14267 opened this issue Apr 14, 2023 · 2 comments

Comments

@Tian14267
Copy link

大神好,我在训练的时候,finetune.sh,训练到1000多epoch时,就报下面这个错了:

{'loss': 0.9042, 'learning_rate': 0.00029342138051333165, 'epoch': 0.24}
{'loss': 0.8922, 'learning_rate': 0.0002932718664340892, 'epoch': 0.24}
{'eval_loss': 1.1987414360046387, 'eval_runtime': 0.471, 'eval_samples_per_second': 2.123, 'eval_steps_per_second': 2.123, 'epoch': 0.24}
{'loss': 0.9076, 'learning_rate': 0.00029312235235484673, 'epoch': 0.25}
{'loss': 0.9039, 'learning_rate': 0.00029297283827560427, 'epoch': 0.25}
{'loss': 0.895, 'learning_rate': 0.0002928233241963618, 'epoch': 0.26}
{'loss': 0.8948, 'learning_rate': 0.0002926738101171193, 'epoch': 0.26}
{'loss': 0.892, 'learning_rate': 0.0002925242960378769, 'epoch': 0.27}
{'loss': 0.8835, 'learning_rate': 0.00029237478195863443, 'epoch': 0.27}
{'loss': 0.8846, 'learning_rate': 0.00029222526787939197, 'epoch': 0.28}
{'loss': 0.8739, 'learning_rate': 0.00029207575380014945, 'epoch': 0.28}
{'loss': 0.8855, 'learning_rate': 0.00029192623972090705, 'epoch': 0.29}
{'loss': 0.8815, 'learning_rate': 0.0002917767256416646, 'epoch': 0.29}
{'eval_loss': 1.2023004293441772, 'eval_runtime': 0.4713, 'eval_samples_per_second': 2.122, 'eval_steps_per_second': 2.122, 'epoch': 0.29}
{'loss': 0.8792, 'learning_rate': 0.00029162721156242207, 'epoch': 0.3}
{'loss': 0.8816, 'learning_rate': 0.00029147769748317967, 'epoch': 0.3}
{'loss': 0.8785, 'learning_rate': 0.0002913281834039372, 'epoch': 0.31}
  3%|▎         | 1278/40230 [2:09:51<65:48:22,  6.08s/it]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17950 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17951 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17952 closing signal SIGHUP
Traceback (most recent call last):
  File "/root/anaconda3/envs/Belle/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 17871 got signal: 1

训练使用3张 V100,请问这个错是啥原因啊。不止一次的。batch是没爆的,显存正常。

@Facico
Copy link
Owner

Facico commented Apr 14, 2023

你这个是1000多个step吧。你这个训练的log看起来没啥问题。
然后你这个是完整的报错信息吗,我感觉你这个错误一般有以下可能:
1、你主动把这个程序中断了(不小心按了ctrl c,后者后台把它kill了之类的)
2、CPU炸了(比如被其他程序占用之类的),你可以检查一下GPU 和CPU的占用情况

@richhh520
Copy link

同问

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants