pytorch: how to create an update rule that doesn't come from derivatives?

  • Last Update :
  • Techknowledgy :

We have one gradient appearing for v, we can approximate this gradient by

v = model(s)
v.backward()

This gives us a gradient of v which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:

for i, p in enumerate(model.parameters()):
   z_theta[i][: ] = gamma * lamda * z_theta[i] + l * p.grad
p.grad[: ] = alpha * delta * z_theta[i]

Suggestion : 2

Setting requires_grad should be the main way you control which parts of the model are part of the gradient computation, for example, if you need to freeze parts of your pretrained model during model fine-tuning.,Whether a tensor will be packed into a different tensor object depends on whether it is an output of its own grad_fn, which is an implementation detail subject to change and that users should not rely on.,To disable gradients across entire blocks of code, there are context managers like no-grad mode and inference mode. For more fine-grained exclusion of subgraphs from gradient computation, there is setting the requires_grad field of a tensor.,Apart from setting requires_grad there are also three possible modes enableable from Python that can affect how computations in PyTorch are processed by autograd internally: default mode (grad mode), no-grad mode, and inference mode, all of which can be togglable via context managers and decorators.

x = torch.randn(5, requires_grad = True)
y = x.pow(2)
print(x.equal(y.grad_fn._saved_self)) # True
print(x is y.grad_fn._saved_self) # True
x = torch.randn(5, requires_grad = True)
y = x.exp()
print(y.equal(y.grad_fn._saved_result)) # True
print(y is y.grad_fn._saved_result) # False
# Define a train
function to be used in different threads
def train_fn():
   x = torch.ones(5, 5, requires_grad = True)
# forward
y = (x + 3) * (x + 4) * 0.5
# backward
y.sum().backward()
# potential optimizer update

# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
   p = threading.Thread(target = train_fn, args = ())
p.start()
threads.append(p)

for p in threads:
   p.join()
class SelfDeletingTempFile():
   def __init__(self):
   self.name = os.path.join(tmp_dir, str(uuid.uuid4()))

def __del__(self):
   os.remove(self.name)

def pack_hook(tensor):
   temp_file = SelfDeletingTempFile()
torch.save(tensor, temp_file.name)
return temp_file

def unpack_hook(temp_file):
   return torch.load(temp_file.name)
x = torch.randn(5, requires_grad = True)
y = x.pow(2)
y.grad_fn._raw_saved_self.register_hooks(pack_hook, unpack_hook)
# Only save on disk tensors that have size >= 1000
SAVE_ON_DISK_THRESHOLD = 1000

def pack_hook(x):
   if x.numel() < SAVE_ON_DISK_THRESHOLD:
   return x
temp_file = SelfDeletingTempFile()
torch.save(tensor, temp_file.name)
return temp_file

def unpack_hook(tensor_or_sctf):
   if isinstance(tensor_or_sctf, torch.Tensor):
   return tensor_or_sctf
return torch.load(tensor_or_sctf.name)

class Model(nn.Module):
   def forward(self, x):
   with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
   #...compute output
output = x
return output

model = Model()
net = nn.DataParallel(model)

Suggestion : 3

You can get all the code in this post, (and other posts as well) in the Github repo here.,A lot of tutorial series on PyTorch would start begin with a rudimentary discussion of what the basic structures are. However, I'd like to instead start by discussing automatic differentiation first. , In this article, we dive into how PyTorch's Autograd engine performs automatic differentiation. ,The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the requires_grad in the constructor itself, others require you to set it manually after creation of the Tensor.

Tensor is a data structure which is a fundamental building block of PyTorch. Tensors are pretty much like numpy arrays, except that unlike numpy, tensors are designed to take advantage of parallel computation capabilities of a GPU. A lot of Tensor syntax is similar to that of numpy arrays.

In[1]: import torch

In[2]: tsr = torch.Tensor(3, 5)

In[3]: tsr
Out[3]:
   tensor([
      [0.0000e+00, 0.0000e+00, 8.4452e-29, -1.0842e-19, 1.2413e-35],
      [1.4013e-45, 1.2416e-35, 1.4013e-45, 2.3331e-35, 1.4013e-45],
      [1.0108e-36, 1.4013e-45, 8.3641e-37, 1.4013e-45, 1.0040e-36]
   ])

The API can be a bit confusing here. There are multiple ways to initialise tensors in PyTorch. While some ways can let you explicitly define that the requires_grad in the constructor itself, others require you to set it manually after creation of the Tensor.

>> t1 = torch.randn((3, 3), requires_grad = True)

   >>
   t2 = torch.FloatTensor(3, 3) # No way to specify requires_grad
while initiating
   >>
   t2.requires_grad = True

In our example where, $  d = f(w_3b , w_4c) $, d's grad function would be the addition operator, since f adds it's to input together. Notice, addition operator is also the node in our graph that output's d. If our Tensor is a leaf node (initialised by the user), then the grad_fn is also None.

import torch

a = torch.randn((3, 3), requires_grad = True)

w1 = torch.randn((3, 3), requires_grad = True)
w2 = torch.randn((3, 3), requires_grad = True)
w3 = torch.randn((3, 3), requires_grad = True)
w4 = torch.randn((3, 3), requires_grad = True)

b = w1 * a
c = w2 * a

d = w3 * b + w4 * c

L = 10 - d

print("The grad fn for a is", a.grad_fn)
print("The grad fn for d is", d.grad_fn)

Algorithmically, here's how backpropagation happens with a computation graph. (Not the actual implementation, only representative)

def backward(incoming_gradients):
   self.Tensor.grad = incoming_gradients

for inp in self.inputs:
   if inp.grad_fn is not None:
   new_incoming_gradients = //
   incoming_gradient * local_grad(self.Tensor, inp)

inp.grad_fn.backward(new_incoming_gradients)
else:
   pass

One thing to note here is that PyTorch gives an error if you call backward() on vector-valued Tensor. This means you can only call backward on a scalar valued Tensor. In our example, if we assume a to be a vector valued Tensor, and call backward on L, it will throw up an error.

import torch

a = torch.randn((3, 3), requires_grad = True)

w1 = torch.randn((3, 3), requires_grad = True)
w2 = torch.randn((3, 3), requires_grad = True)
w3 = torch.randn((3, 3), requires_grad = True)
w4 = torch.randn((3, 3), requires_grad = True)

b = w1 * a
c = w2 * a

d = w3 * b + w4 * c

L = (10 - d)

L.backward()