在Linux服务器上多GPU环境运行模型时,有时会出现图形处理器数量超过最大限制数。
RuntimeError: num_gpus <= 16 INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1603728993639/work/c10/cuda/CUDAStream.cpp":208, please report a bug to PyTorch. Number of CUDA devices on the machine is larger than the compiled max number of gpus expected (16). Increase that and recompile.
原因分析:
我这里用的是Pytorch,错误说的很明白了,GPU数量太多了。
解决方案:
1.指定使用特定的GPU6
import os# os.environ['CUDA_VISIBLE_DEVICES']='7,8'
2.直接更改配置文件(.bashrc文件),指明可见的GPU数量
export CUDA_VISIBLE_DEVICES=0,1



