PyTorch 源码解读之 torch.cuda.amp: 自动混合精度详解 - 知乎
Automatic Mixed Precision examples — PyTorch 1.9.1 documentation
torch.cuda.amp 提供了较为方便的混合精度训练机制
用户不需要手动对模型参数 dtype 转换 amp 会自动为算子选择合适的数值精度
对于反向传播的时候 FP16 的梯度数值溢出的问题 amp 提供了梯度 scaling 操作 而且在优化器更新参数前 会自动对梯度 unscaling 所以 对用于模型优化的超参数不会有任何影响
以上两点 分别是通过使用amp.autocast和amp.GradScaler来实现的。
basic# Creates model and optimizer in default precision model Net().cuda() optimizer optim.SGD(model.parameters(), ...) # Creates a GradScaler once at the beginning of training. scaler GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() # Runs the forward pass with autocasting. with autocast(): output model(input) loss loss_fn(output, target) # Scales loss. Calls backward() on scaled loss to create scaled gradients. # Backward passes under autocast are not recommended. # Backward ops run in the same dtype autocast chose for corresponding forward ops. scaler.scale(loss).backward() # scaler.step() first unscales the gradients of the optimizer s assigned params. # If these gradients do not contain infs or NaNs, optimizer.step() is then called, # otherwise, optimizer.step() is skipped. scaler.step(optimizer) # Updates the scale for next iteration. scaler.update()
gradient clipping
scaler GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() with autocast(): output model(input) loss loss_fn(output, target) scaler.scale(loss).backward() # Unscales the gradients of optimizer s assigned params in-place scaler.unscale_(optimizer) # Since the gradients of optimizer s assigned params are unscaled, clips as usual: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) # optimizer s gradients are already unscaled, so scaler.step does not unscale them, # although it still skips optimizer.step() if the gradients contain infs or NaNs. scaler.step(optimizer) # Updates the scale for next iteration. scaler.update()gradient accumulation
scaler GradScaler() for epoch in epochs: for i, (input, target) in enumerate(data): with autocast(): output model(input) loss loss_fn(output, target) loss loss / accumulate_steps # Accumulates scaled gradients. scaler.scale(loss).backward() if i % accumulate_steps 0: # may unscale_ here if desired (e.g., to allow clipping unscaled gradients) # unscale 梯度 可以不影响clip的threshold scaler.unscale_(optimizer) # clip梯度 torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) scaler.step(optimizer) scaler.update() optimizer.zero_grad()AMP in DDP
autocast 设计为 “thread local” 的 所以只在 main thread 上设 autocast 区域是不 work 的 所以 还需要对model的forward进行修饰
MyModel(nn.Module): ... autocast() def forward(self, input): ...
或者在forward中设置autocast区域
MyModel(nn.Module): ... def forward(self, input): with autocast(): ...
第一种在使用DDP时出错了 显示forward的某些参数没有正常获取到 未解决……



