解决YOLOV5出现全为nan和0的问题

yolov5训练时，出现系数为nan和0的问题。

cpu跑没有问题，gpu出现nan和0的问题。一般问题cuda问题和显卡的原因。

显卡为GTX 16XX系列的在cuda使用较新版本时会出现该问题。

例如我自己的问题：飞行堡垒7锐龙版显卡：GTX 1650 cuda11.3（cuda11.5调试过）都会出现该问题 pytorch为1.11.0 。

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runstrainexp7
Starting training for 100 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.88G       nan       nan       nan        10       640: 100%|██████████| 14/14 [00:35<00:00,  2.52s/it]
D:19837anaconda3envspytorchlibsite-packagestorchoptimlr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:07<00:00,  1.09s/it]
                 all        106          0          0          0          0          0

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G       nan       nan       nan       104       640:   7%|▋         | 1/14 [00:02<00:38,  2.92s/it]
Process finished with exit code -1

解决方案为将cuda换为10.2的版本，链接如下，直接进行下载

CUDA Toolkit Archive | NVIDIA DeveloperPrevious releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release you want from the list below, and be sure to check www.nvidia.com/drivers for more recent production drivers appropriate for your hardware configuration.https://developer.nvidia.com/cuda-toolkit-archive

cudnn下载:

cuDNN Archive | NVIDIA DeveloperNVIDIA cuDNN is a GPU-accelerated library of primitives for deep neural networks.https://developer.nvidia.com/rdp/cudnn-archive#a-collapse51b选择对应的版本

安装cuda过后将cudnn里面的放入C:Program FilesNVIDIA GPU Computing ToolkitCUDAv10.2这个路径下。根据自己的路径进行修改

然后继续安装pytorch cu102版本

pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

接下来回到运行程序阶段

AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset 
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runstrainexp10
Starting training for 100 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      0/99     1.85G    0.1244    0.0515   0.06827        10       640: 100%|██████████| 14/14 [01:59<00:00,  8.53s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:11<00:00,  1.62s/it]
                 all        106        433    0.00107    0.00842   0.000487   0.000122

     Epoch   gpu_mem       box       obj       cls    labels  img_size
      1/99     1.96G    0.1171   0.06178   0.06603        63       640:  50%|█████     | 7/14 [00:30<00:30,  4.37s/it]

至此就完成了

解决YOLOV5出现全为nan和0的问题

Python相关栏目本月热门文章