We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
大神好,我在训练的时候,finetune.sh,训练到1000多epoch时,就报下面这个错了:
{'loss': 0.9042, 'learning_rate': 0.00029342138051333165, 'epoch': 0.24} {'loss': 0.8922, 'learning_rate': 0.0002932718664340892, 'epoch': 0.24} {'eval_loss': 1.1987414360046387, 'eval_runtime': 0.471, 'eval_samples_per_second': 2.123, 'eval_steps_per_second': 2.123, 'epoch': 0.24} {'loss': 0.9076, 'learning_rate': 0.00029312235235484673, 'epoch': 0.25} {'loss': 0.9039, 'learning_rate': 0.00029297283827560427, 'epoch': 0.25} {'loss': 0.895, 'learning_rate': 0.0002928233241963618, 'epoch': 0.26} {'loss': 0.8948, 'learning_rate': 0.0002926738101171193, 'epoch': 0.26} {'loss': 0.892, 'learning_rate': 0.0002925242960378769, 'epoch': 0.27} {'loss': 0.8835, 'learning_rate': 0.00029237478195863443, 'epoch': 0.27} {'loss': 0.8846, 'learning_rate': 0.00029222526787939197, 'epoch': 0.28} {'loss': 0.8739, 'learning_rate': 0.00029207575380014945, 'epoch': 0.28} {'loss': 0.8855, 'learning_rate': 0.00029192623972090705, 'epoch': 0.29} {'loss': 0.8815, 'learning_rate': 0.0002917767256416646, 'epoch': 0.29} {'eval_loss': 1.2023004293441772, 'eval_runtime': 0.4713, 'eval_samples_per_second': 2.122, 'eval_steps_per_second': 2.122, 'epoch': 0.29} {'loss': 0.8792, 'learning_rate': 0.00029162721156242207, 'epoch': 0.3} {'loss': 0.8816, 'learning_rate': 0.00029147769748317967, 'epoch': 0.3} {'loss': 0.8785, 'learning_rate': 0.0002913281834039372, 'epoch': 0.31} 3%|▎ | 1278/40230 [2:09:51<65:48:22, 6.08s/it]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17950 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17951 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17952 closing signal SIGHUP Traceback (most recent call last): File "/root/anaconda3/envs/Belle/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent result = agent.run() File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/root/anaconda3/envs/Belle/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 17871 got signal: 1
训练使用3张 V100,请问这个错是啥原因啊。不止一次的。batch是没爆的,显存正常。
The text was updated successfully, but these errors were encountered:
你这个是1000多个step吧。你这个训练的log看起来没啥问题。 然后你这个是完整的报错信息吗,我感觉你这个错误一般有以下可能: 1、你主动把这个程序中断了(不小心按了ctrl c,后者后台把它kill了之类的) 2、CPU炸了(比如被其他程序占用之类的),你可以检查一下GPU 和CPU的占用情况
Sorry, something went wrong.
同问
No branches or pull requests
大神好,我在训练的时候,finetune.sh,训练到1000多epoch时,就报下面这个错了:
训练使用3张 V100,请问这个错是啥原因啊。不止一次的。batch是没爆的,显存正常。
The text was updated successfully, but these errors were encountered: