栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Torch.distributed.elastic 关于 pytorch 不稳定

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Torch.distributed.elastic 关于 pytorch 不稳定

错误日志:

Epoch: [229] Total time: 0:17:21
Test:   [ 0/49]  eta: 0:05:00  loss: 1.7994 (1.7994)  acc1: 78.0822 (78.0822)  acc5: 95.2055 (95.2055)  time: 6.1368  data: 5.9411  max mem: 10624
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/biometrics/miniconda3/envs/torch/bin/torchrun", line 33, in 
    sys.exit(load_entry_point('torch==1.12.0.dev20220502', 'console_scripts', 'torchrun')())
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 44343 got signal: 1

网上的解决办法是:

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/879203.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号