栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

PyTorch grad 与 Optimizer(params) 区别

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

PyTorch grad 与 Optimizer(params) 区别

目录
  • PyTorch grad 与 Optimizer(params) 区别

PyTorch grad 与 Optimizer(params) 区别

Tensor 可以设置属性 requires_grad=True/False 说明其是否进行梯度更新,而 Optimizer(params) 可以用来指定要进行优化的参数有哪些。

那么二者究竟有啥区别,但需要冻结某些参数时,正确的做法又应该是什么,二选一还是都应该设置?

我们通过下面的一系列实验进行说明:(除实验一外,其他所有实验中的省略部分参考实验一部分)

实验一:只更新 p2 的参数

import torch
import torch.nn as nn
import torch.optim as optim


n_dim = 3

p1 = nn.Linear(n_dim, 1)
p2 = nn.Linear(n_dim, 1)

optimizer = optim.SGD(list(p2.parameters()), lr=0.01)

for i in range(4):
    dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
    dummy_loss.backward()
    optimizer.step()

    print('p1: requires_grad =', p1.weight.requires_grad, ', gradient:', p1.weight.grad, ', weight:', p1.weight.data)
    print('p2: requires_grad =', p2.weight.requires_grad, ', gradient:', p2.weight.grad, ', weight:', p2.weight.data)
    print()
p1: requires_grad = True , gradient: tensor([[0.7032, 0.1679, 0.6566]]) , weight: tensor([[-0.0106, -0.0158,  0.1727]])
p2: requires_grad = True , gradient: tensor([[0.5745, 0.8857, 0.5528]]) , weight: tensor([[-0.5515, -0.3085,  0.1407]])

p1: requires_grad = True , gradient: tensor([[0.8713, 0.6881, 0.7685]]) , weight: tensor([[-0.0106, -0.0158,  0.1727]])
p2: requires_grad = True , gradient: tensor([[0.7531, 1.0932, 1.2518]]) , weight: tensor([[-0.5591, -0.3194,  0.1282]])

p1: requires_grad = True , gradient: tensor([[1.0434, 1.6635, 1.2951]]) , weight: tensor([[-0.0106, -0.0158,  0.1727]])
p2: requires_grad = True , gradient: tensor([[1.3844, 1.3802, 2.1245]]) , weight: tensor([[-0.5729, -0.3332,  0.1069]])

p1: requires_grad = True , gradient: tensor([[1.7212, 2.0117, 2.2632]]) , weight: tensor([[-0.0106, -0.0158,  0.1727]])
p2: requires_grad = True , gradient: tensor([[2.0722, 1.4745, 2.5086]]) , weight: tensor([[-0.5936, -0.3479,  0.0818]])

可以看到,P1 的 weight 参数值始终不变,P2 的 weight 值发生变化,但是无论 P1 还是 P2 的 weight 梯度值始终都被计算更新。

因此,如果仅仅是通过 Optimizer(params) 方式指定要被更新的参数,虽然结果上达到了目标,但是计算量一点都没有减少,在 P1 上该算的也都算了,只是参数值不更新而已。


实验二:初始时把 P1 的参数设置为 requires_grad=False

...
optimizer = optim.SGD(list(p2.parameters()), lr=0.01)
p1.weight.requires_grad = False
...
p1: requires_grad = False , gradient: None , weight: tensor([[-0.3768,  0.4092,  0.3842]])
p2: requires_grad = True , gradient: tensor([[0.3417, 0.3507, 0.7667]]) , weight: tensor([[-0.3647,  0.5510, -0.4988]])

p1: requires_grad = False , gradient: None , weight: tensor([[-0.3768,  0.4092,  0.3842]])
p2: requires_grad = True , gradient: tensor([[1.2031, 0.4654, 1.6340]]) , weight: tensor([[-0.3768,  0.5464, -0.5152]])

p1: requires_grad = False , gradient: None , weight: tensor([[-0.3768,  0.4092,  0.3842]])
p2: requires_grad = True , gradient: tensor([[2.0553, 0.6239, 2.2633]]) , weight: tensor([[-0.3973,  0.5401, -0.5378]])

p1: requires_grad = False , gradient: None , weight: tensor([[-0.3768,  0.4092,  0.3842]])
p2: requires_grad = True , gradient: tensor([[2.0692, 0.7120, 2.9475]]) , weight: tensor([[-0.4180,  0.5330, -0.5673]])

可以看到,现在 P1 的参数梯度值全部为 None,且 weight 始终保持不变,也就是说,此时的 P1 参数不参与反向计算过程,因此计算量大大减小。


实验三:更新 n 轮后,再把 P1 的参数设置为 requires_grad=False

这种情况实际开发过程中可能很少会这么玩,此处仅仅是为了测试看看会发生什么?

...
for i in range(4):
    ....
    if i == 1:
        p1.weight.requires_grad = False
p1: requires_grad = True , gradient: tensor([[0.6836, 0.3433, 0.7205]]) , weight: tensor([[0.1997, 0.3478, 0.0399]])
p2: requires_grad = True , gradient: tensor([[0.7296, 0.6570, 0.4011]]) , weight: tensor([[0.5543, 0.4755, 0.2041]])

p1: requires_grad = True , gradient: tensor([[0.7939, 0.5364, 0.8332]]) , weight: tensor([[0.1997, 0.3478, 0.0399]])
p2: requires_grad = True , gradient: tensor([[1.4348, 0.7973, 0.4244]]) , weight: tensor([[0.5400, 0.4676, 0.1999]])

p1: requires_grad = False , gradient: tensor([[0.7939, 0.5364, 0.8332]]) , weight: tensor([[0.1997, 0.3478, 0.0399]])
p2: requires_grad = True , gradient: tensor([[2.1993, 1.0440, 1.1380]]) , weight: tensor([[0.5180, 0.4571, 0.1885]])

p1: requires_grad = False , gradient: tensor([[0.7939, 0.5364, 0.8332]]) , weight: tensor([[0.1997, 0.3478, 0.0399]])
p2: requires_grad = True , gradient: tensor([[2.7767, 1.6092, 1.9681]]) , weight: tensor([[0.4902, 0.4410, 0.1688]])

可以看到,第 3、4 次 P1 的梯度值与第 2 次一致,这说明每一步的(current_grad)将不再变化,由于 P1 的参数不会更新,所以此处关于计算量有没有减少,我们可能看的还不是很清楚。


实验四:将实验三中 P1 的 required_grad=False 替换为 P2

...
for i in range(4):
    ....
    if i == 1:
        p2.weight.requires_grad = False
p1: requires_grad = True , gradient: tensor([[0.5496, 0.9729, 0.9772]]) , weight: tensor([[-0.5734, -0.0520, -0.3301]])
p2: requires_grad = True , gradient: tensor([[0.8374, 0.8170, 0.0988]]) , weight: tensor([[-0.4295, -0.0373, -0.1664]])

p1: requires_grad = True , gradient: tensor([[0.6601, 1.8183, 1.9499]]) , weight: tensor([[-0.5734, -0.0520, -0.3301]])
p2: requires_grad = True , gradient: tensor([[1.6356, 1.3443, 0.3622]]) , weight: tensor([[-0.4458, -0.0507, -0.1700]])

p1: requires_grad = True , gradient: tensor([[0.8481, 2.4562, 2.6540]]) , weight: tensor([[-0.5734, -0.0520, -0.3301]])
p2: requires_grad = False , gradient: tensor([[1.6356, 1.3443, 0.3622]]) , weight: tensor([[-0.4622, -0.0642, -0.1736]])

p1: requires_grad = True , gradient: tensor([[1.2119, 2.9074, 2.9041]]) , weight: tensor([[-0.5734, -0.0520, -0.3301]])
p2: requires_grad = False , gradient: tensor([[1.6356, 1.3443, 0.3622]]) , weight: tensor([[-0.4786, -0.0776, -0.1773]])

我们看到,P2 的梯度 tensor 从第 2 轮开始不再更新,但是由于梯度值始终存在,因此 weight 就一直在变化。

因此参数更不更新取决于 Optimizer(param) ,只要设置了,则必然需要计算 w = w' + lr * grad 这一步。

但问题是公式中的 grad 在这里,有没有计算?表面上看,貌似是直接 copy 的,gradient tensor 不再变化,好像是不再计算了。为了进一步探索这个问题,我们将梯度设置为 False 的参数梯度值更新为 0 。


实验五:将梯度为 False 的参数梯度值设置为 0

...
dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
optimizer.zero_grad()			# 将梯度为 False 的参数梯度值设置为 0
dummy_loss.backward()
...
p1: requires_grad = True , gradient: tensor([[0.9447, 0.9727, 0.3478]]) , weight: tensor([[-0.1004,  0.0273, -0.0463]])
p2: requires_grad = True , gradient: tensor([[0.4189, 0.0154, 0.6112]]) , weight: tensor([[ 0.1703, -0.5230, -0.5777]])

p1: requires_grad = True , gradient: tensor([[1.3491, 1.5271, 0.9843]]) , weight: tensor([[-0.1004,  0.0273, -0.0463]])
p2: requires_grad = True , gradient: tensor([[0.2000, 0.5229, 0.6771]]) , weight: tensor([[ 0.1683, -0.5283, -0.5845]])

p1: requires_grad = True , gradient: tensor([[2.2633, 1.7631, 1.0647]]) , weight: tensor([[-0.1004,  0.0273, -0.0463]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[ 0.1683, -0.5283, -0.5845]])

p1: requires_grad = True , gradient: tensor([[2.4285, 1.9984, 1.3841]]) , weight: tensor([[-0.1004,  0.0273, -0.0463]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[ 0.1683, -0.5283, -0.5845]])

现在从第 3 轮开始,P2 的参数梯度值变为 0 了,相应的参数值也不再发生变化了

那么是否可以理解为,只要将 False 的参数梯度值设置为 0 了,grad 的计算量以及 W 更新的计算量都不存在了呢?

答案是否定的,实际上,grad 的值依然算了,只不过算的结果还是 0,W 的值也依然更新了,只不过更新结果不变而已,该有的计算量依然存在,唯一不需要计算的仅仅是 current_grad。

有很多优化器,如 Adam 等,它在最终计算公式中的 grad 可能是累积了过去很多 step 的梯度值,与 current_grad 不同。我们可以通过实验六看出来

实验六:在实验五基础上更换 Adm 优化器

...
optimizer = optim.Adam(list(p2.parameters()))
# optimizer = optim.SGD(list(p2.parameters()), lr=0.01)

...
dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
optimizer.zero_grad()			# 将梯度为 False 的参数梯度值设置为 0
dummy_loss.backward()
...
p1: requires_grad = True , gradient: tensor([[0.1740, 0.5654, 0.2655]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = True , gradient: tensor([[0.4864, 0.8729, 0.4499]]) , weight: tensor([[0.2652, 0.1471, 0.1539]])

p1: requires_grad = True , gradient: tensor([[0.5266, 1.4552, 1.0486]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = True , gradient: tensor([[0.9205, 0.3132, 0.6762]]) , weight: tensor([[0.2642, 0.1462, 0.1529]])

p1: requires_grad = True , gradient: tensor([[1.0443, 1.9516, 1.9927]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[0.2635, 0.1455, 0.1521]])

p1: requires_grad = True , gradient: tensor([[1.8763, 2.4474, 2.4814]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[0.2629, 0.1450, 0.1515]])

p1: requires_grad = True , gradient: tensor([[2.1703, 3.4469, 2.6281]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[0.2623, 0.1445, 0.1509]])

p1: requires_grad = True , gradient: tensor([[3.1643, 3.8317, 3.6267]]) , weight: tensor([[ 0.1911,  0.3606, -0.0334]])
p2: requires_grad = False , gradient: tensor([[0., 0., 0.]]) , weight: tensor([[0.2619, 0.1441, 0.1505]])

现在可以清楚的看到,即便 current_grad 从第三步开始就已经是 0,但是 grad 却不是 0,因此 weight 也在一直更新,本次测试大约在 36 轮后不变了。


实验七:多加个参数,optimizer.zero_grad(set_to_none=True)

第六步基础上,多加个参数

...
optimizer = optim.Adam(list(p2.parameters()))
# optimizer = optim.SGD(list(p2.parameters()), lr=0.01)

...
dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
optimizer.zero_grad(set_to_none=True)			# 将梯度为 False 的参数梯度值设置为 None
dummy_loss.backward()
...
p1: requires_grad = True , gradient: tensor([[0.9402, 0.4059, 0.8981]]) , weight: tensor([[-0.4066, -0.5065, -0.2141]])
p2: requires_grad = True , gradient: tensor([[0.6820, 0.4095, 0.5958]]) , weight: tensor([[ 0.3368, -0.5517, -0.5268]])

p1: requires_grad = True , gradient: tensor([[1.4824, 0.5806, 1.4759]]) , weight: tensor([[-0.4066, -0.5065, -0.2141]])
p2: requires_grad = True , gradient: tensor([[0.1375, 0.7506, 0.4249]]) , weight: tensor([[ 0.3360, -0.5527, -0.5278]])

p1: requires_grad = True , gradient: tensor([[2.0119, 0.7838, 2.4706]]) , weight: tensor([[-0.4066, -0.5065, -0.2141]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.3360, -0.5527, -0.5278]])

p1: requires_grad = True , gradient: tensor([[2.6378, 1.1331, 3.1580]]) , weight: tensor([[-0.4066, -0.5065, -0.2141]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.3360, -0.5527, -0.5278]])

这就很有意思了,现在 P2 的 grad Tensor 直接为 None 了,而且即使 Optimizer(params) 中说明了要更新 P2 的参数,但是实际上也没有更新。

这里自然关于 P2 也就没有进行反向计算,也没有参数更新的计算过程。同样那我们很自然的可以猜测,一个参数,即便 Optimizer(params) 中说明要对他进行更新,但是如果它的 requires_grad=False,那就更新不了,见实验八。

实验八:先听 requires_grad 的话

...
optimizer = optim.Adam(list(p1.parameters()) + list(p2.parameters()))   # p1,p2 我都想更新
p2.weight.requires_grad = False											# 不好意思 p2 不能让你动
...
p1: requires_grad = True , gradient: tensor([[0.7974, 0.6948, 0.1333]]) , weight: tensor([[ 0.0300,  0.4244, -0.3977]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.2297,  0.5398, -0.2307]])

p1: requires_grad = True , gradient: tensor([[0.1905, 0.1488, 0.9019]]) , weight: tensor([[ 0.0291,  0.4236, -0.3986]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.2297,  0.5398, -0.2307]])

p1: requires_grad = True , gradient: tensor([[0.6946, 0.0313, 0.6055]]) , weight: tensor([[ 0.0282,  0.4229, -0.3995]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.2297,  0.5398, -0.2307]])

p1: requires_grad = True , gradient: tensor([[0.0347, 0.1364, 0.3199]]) , weight: tensor([[ 0.0275,  0.4223, -0.4003]])
p2: requires_grad = False , gradient: None , weight: tensor([[ 0.2297,  0.5398, -0.2307]])

小结:

上面被划掉的那句话,正确的说法应该是:一个参数 requires_grad=True 的前提下,其参数更不更新取决于 Optimizer(params) 中有没有把该参数包含在里面,如果包含了,则一定会计算 w = w' - lr * grad。而只要 grad Tensor 没有被设置为 None,grad 就会计算,即便是 0,计算量的多少取决于不同的优化器自身计算复杂度。

反之,如果 requires_grad=False,则一定不存在反向梯度更新以及参数更新的过程,即便你把该参数加入到 Optimizer(params) 中。

所以在需要冻结部分参数时,正确的做法是把不需要更新的参数的 requires_grad 设置为 False,然后把需要更新的参数写到 Optimizer(params) 中即可。

还有个语法是 with torch.no_grad():,这个语法的目的是只计算前向计算过程,不构建计算图,因此,自然也就不存在反向过程的计算及参数更新了,适用于推理阶段。

参考: pytorch freeze weights and update param_groups

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/341602.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号