We have one gradient appearing for v, we can approximate this gradient by
v = model(s) v.backward()
This gives us a gradient of v
which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:
for i, p in enumerate(model.parameters()):
z_theta[i][: ] = gamma * lamda * z_theta[i] + l * p.grad
p.grad[: ] = alpha * delta * z_theta[i]
Setting requires_grad should be the main way you control which parts of the model are part of the gradient computation, for example, if you need to freeze parts of your pretrained model during model fine-tuning.,Whether a tensor will be packed into a different tensor object depends on whether it is an output of its own grad_fn, which is an implementation detail subject to change and that users should not rely on.,To disable gradients across entire blocks of code, there are context managers like no-grad mode and inference mode. For more fine-grained exclusion of subgraphs from gradient computation, there is setting the requires_grad field of a tensor.,Apart from setting requires_grad there are also three possible modes enableable from Python that can affect how computations in PyTorch are processed by autograd internally: default mode (grad mode), no-grad mode, and inference mode, all of which can be togglable via context managers and decorators.
x = torch.randn(5, requires_grad = True) y = x.pow(2) print(x.equal(y.grad_fn._saved_self)) # True print(x is y.grad_fn._saved_self) # True
x = torch.randn(5, requires_grad = True) y = x.exp() print(y.equal(y.grad_fn._saved_result)) # True print(y is y.grad_fn._saved_result) # False
# Define a train
function to be used in different threads
def train_fn():
x = torch.ones(5, 5, requires_grad = True)
# forward
y = (x + 3) * (x + 4) * 0.5
# backward
y.sum().backward()
# potential optimizer update
# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
p = threading.Thread(target = train_fn, args = ())
p.start()
threads.append(p)
for p in threads:
p.join()
class SelfDeletingTempFile():
def __init__(self):
self.name = os.path.join(tmp_dir, str(uuid.uuid4()))
def __del__(self):
os.remove(self.name)
def pack_hook(tensor):
temp_file = SelfDeletingTempFile()
torch.save(tensor, temp_file.name)
return temp_file
def unpack_hook(temp_file):
return torch.load(temp_file.name)
x = torch.randn(5, requires_grad = True) y = x.pow(2) y.grad_fn._raw_saved_self.register_hooks(pack_hook, unpack_hook)
# Only save on disk tensors that have size >= 1000 SAVE_ON_DISK_THRESHOLD = 1000 def pack_hook(x): if x.numel() < SAVE_ON_DISK_THRESHOLD: return x temp_file = SelfDeletingTempFile() torch.save(tensor, temp_file.name) return temp_file def unpack_hook(tensor_or_sctf): if isinstance(tensor_or_sctf, torch.Tensor): return tensor_or_sctf return torch.load(tensor_or_sctf.name) class Model(nn.Module): def forward(self, x): with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook): #...compute output output = x return output model = Model() net = nn.DataParallel(model)
You can get all the code in this post, (and other posts as well) in the Github repo here.,A lot of tutorial series on PyTorch would start begin with a rudimentary discussion of what the basic structures are. However, I'd like to instead start by discussing automatic differentiation first. , In this article, we dive into how PyTorch's Autograd engine performs automatic differentiation. ,The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the requires_grad in the constructor itself, others require you to set it manually after creation of the Tensor.
Tensor
is a data structure which is a fundamental building block of PyTorch. Tensor
s are pretty much like numpy arrays, except that unlike numpy, tensors are designed to take advantage of parallel computation capabilities of a GPU. A lot of Tensor syntax is similar to that of numpy arrays.
In[1]: import torch
In[2]: tsr = torch.Tensor(3, 5)
In[3]: tsr
Out[3]:
tensor([
[0.0000e+00, 0.0000e+00, 8.4452e-29, -1.0842e-19, 1.2413e-35],
[1.4013e-45, 1.2416e-35, 1.4013e-45, 2.3331e-35, 1.4013e-45],
[1.0108e-36, 1.4013e-45, 8.3641e-37, 1.4013e-45, 1.0040e-36]
])
The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the requires_grad
in the constructor itself, others require you to set it manually after creation of the Tensor.
>> t1 = torch.randn((3, 3), requires_grad = True) >> t2 = torch.FloatTensor(3, 3) # No way to specify requires_grad while initiating >> t2.requires_grad = True
In our example where, $ d = f(w_3b , w_4c) $, d's grad function would be the addition operator, since f adds it's to input together. Notice, addition operator is also the node in our graph that output's d. If our Tensor
is a leaf node (initialised by the user), then the grad_fn
is also None.
import torch
a = torch.randn((3, 3), requires_grad = True)
w1 = torch.randn((3, 3), requires_grad = True)
w2 = torch.randn((3, 3), requires_grad = True)
w3 = torch.randn((3, 3), requires_grad = True)
w4 = torch.randn((3, 3), requires_grad = True)
b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = 10 - d
print("The grad fn for a is", a.grad_fn)
print("The grad fn for d is", d.grad_fn)
Algorithmically, here's how backpropagation happens with a computation graph. (Not the actual implementation, only representative)
def backward(incoming_gradients):
self.Tensor.grad = incoming_gradients
for inp in self.inputs:
if inp.grad_fn is not None:
new_incoming_gradients = //
incoming_gradient * local_grad(self.Tensor, inp)
inp.grad_fn.backward(new_incoming_gradients)
else:
pass
One thing to note here is that PyTorch gives an error if you call backward()
on vector-valued Tensor. This means you can only call backward
on a scalar valued Tensor. In our example, if we assume a
to be a vector valued Tensor, and call backward
on L, it will throw up an error.
import torch
a = torch.randn((3, 3), requires_grad = True)
w1 = torch.randn((3, 3), requires_grad = True)
w2 = torch.randn((3, 3), requires_grad = True)
w3 = torch.randn((3, 3), requires_grad = True)
w4 = torch.randn((3, 3), requires_grad = True)
b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = (10 - d)
L.backward()