集群环境
重点关注cuda版本。
尝试运行 【Step1】
docker run --name yen_v3 --runtime=nvidia --gpus=all -v ~/workspace/NeRF:/zj -tid 172.20.208.7/zhaojing_repo/nerf:yen37_v2 bash
报错:
docker: Error response from daemon: Unknown runtime specified nvidia.
尝试解决方案1:
在 /etc/docker/daemon.json里的内容如下:
{
"registry-mirrors": ["https://f1z25q5p.mirror.aliyuncs.com"],
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
然后命令
sudo systemctl daemon-reload
sudo systemctl restart docker
执行 sudo systemctl restart docker 的时候,报错:Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
错误原因: 上一步在编辑 daemon.json文档的时候,存在编辑错误。
重新尝试运行【Step1】,运行成功
【Step2】配置环境参数如下:
Package Version ----------------------- ----------- absl-py 1.0.0 cachetools 5.0.0 certifi 2021.10.8 charset-normalizer 2.0.12 ConfigArgParse 1.5.3 cycler 0.11.0 dataclasses 0.6 fonttools 4.31.2 future 0.18.2 google-auth 2.6.2 google-auth-oauthlib 0.4.6 grpcio 1.44.0 idna 3.3 imageio 2.16.1 imageio-ffmpeg 0.4.5 importlib-metadata 4.11.3 kiwisolver 1.4.0 Markdown 3.3.6 matplotlib 3.5.1 numpy 1.21.5 oauthlib 3.2.0 opencv-python 4.5.5.64 packaging 21.3 Pillow 9.0.1 pip 22.0.4 protobuf 3.19.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pyparsing 3.0.7 python-dateutil 2.8.2 requests 2.27.1 requests-oauthlib 1.3.1 rsa 4.8 setuptools 57.5.0 six 1.16.0 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 torch 1.7.0+cu110 torchaudio 0.7.0 torchvision 0.8.1+cu110 tqdm 4.63.1 typing_extensions 4.1.1 urllib3 1.26.9 Werkzeug 2.0.3 wheel 0.37.1 zipp 3.7.0
查看nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 30% 42C P8 12W / 250W | 240MiB / 11014MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
【Step3】尝试本地运行:python run_nerf.py --config configs/fern.txt.运行成功。
【Step4】将环境提交并push
docker commit -m "hh" yen_v3 172.20.208.7/zhaojing_repo/nerf:yen_cu110 docker push 172.20.208.7/zhaojing_repo/nerf:yen_cu110
【Step5】服务器运行测试
失败。
和可运行版本进行了严格的对比,所有配置参数都一致。
暂时没有找到解决的办法,后面有时间了再探索。



