how can i disable gradient updates for some modules in autograd backpropagation?

  • Last Update :
  • Techknowledgy :

This code works to minimize the execution of unnecessary graph traversals and gradient updates. I still need to refactor it for staggered updates over time so that it can accumulate gradients for several cycles before stepping the optimizers, but this definitely works as intended.

#Step through all models in a chain to create gradient paths from critic back through the world model, to the actor.
def step(self):
   #Get the current state from the simulation
state =
#Fire the actor to select a softmax action.
#run the world simulation on that action.
#Combine the action and starting state as input to the world model.
   action_state =[, state], dim = 0)
   #Push softmax action closer to 1.0
action_state =[, state], dim = 0)
#Run the model and then the critic on the action_state
self.model.requires_grad = False
self.critic.requires_grad = False
#Pick loss For maximizing the value of the action choice
loss = -self.critic.value *
loss.backward(retain_graph = True)
if self.model.calibrating:
   #Don 't need to backpropagate through actor again
self.model.requires_grad = True
#Reduce loss
for ambiguous actions
loss = self.model.get_loss() * ** 2
loss.backward(retain_graph = True)
if self.critic.calibrating:
   #Don 't need to backpropagate through the model or actor again
self.critic.requires_grad = True
#Reduce loss
for ambiguous actions
loss = self.critic.get_loss(self.goal) * ** 2
loss.backward(retain_graph = True)

Suggestion : 2

There are several mechanisms available from Python to locally disable gradient computation:,To disable gradients across entire blocks of code, there are context managers like no-grad mode and inference mode. For more fine-grained exclusion of subgraphs from gradient computation, there is setting the requires_grad field of a tensor.,Apart from setting requires_grad there are also three possible modes enableable from Python that can affect how computations in PyTorch are processed by autograd internally: default mode (grad mode), no-grad mode, and inference mode, all of which can be togglable via context managers and decorators.,Below, in addition to discussing the mechanisms above, we also describe evaluation mode (nn.Module.eval()), a method that is not actually used to disable gradient computation but, because of its name, is often mixed up with the three.

x = torch.randn(5, requires_grad = True)
y = x.pow(2)
print(x.equal(y.grad_fn._saved_self)) # True
print(x is y.grad_fn._saved_self) # True
x = torch.randn(5, requires_grad = True)
y = x.exp()
print(y.equal(y.grad_fn._saved_result)) # True
print(y is y.grad_fn._saved_result) # False
# Define a train
function to be used in different threads
def train_fn():
   x = torch.ones(5, 5, requires_grad = True)
# forward
y = (x + 3) * (x + 4) * 0.5
# backward
# potential optimizer update

# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
   p = threading.Thread(target = train_fn, args = ())

for p in threads:
class SelfDeletingTempFile():
   def __init__(self): = os.path.join(tmp_dir, str(uuid.uuid4()))

def __del__(self):

def pack_hook(tensor):
   temp_file = SelfDeletingTempFile(),
return temp_file

def unpack_hook(temp_file):
   return torch.load(
x = torch.randn(5, requires_grad = True)
y = x.pow(2)
y.grad_fn._raw_saved_self.register_hooks(pack_hook, unpack_hook)
# Only save on disk tensors that have size >= 1000

def pack_hook(x):
   if x.numel() < SAVE_ON_DISK_THRESHOLD:
   return x
temp_file = SelfDeletingTempFile(),
return temp_file

def unpack_hook(tensor_or_sctf):
   if isinstance(tensor_or_sctf, torch.Tensor):
   return tensor_or_sctf
return torch.load(

class Model(nn.Module):
   def forward(self, x):
   with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
   #...compute output
output = x
return output

model = Model()
net = nn.DataParallel(model)

Suggestion : 3

In 1847, Augustin-Louis Cauchy used negative of gradients to develop the Gradient Descent algorithm as an iterative method to minimize a continuous and (ideally) differentiable function of many variables.,Hence, all we need is to calculate the gradient of the loss function with respect to the learnable parameters (i.e., weights):,By breaking the computation into simple operations on intermediate variables, we can use the chain rule to calculate any gradient:,Before introducing the gradient descent algorithm, let’s review a very important property of gradients. The gradient of a function always points in the direction of the steepest ascent. The following exercise will help clarify this.

# @title Install dependencies
   !pip install git + https: // --quiet

   from evaltools.airtable
import AirtableForm
atform = AirtableForm('appn7VdPRseSoMXEG', 'W1D2_T1', '')
# Imports
import torch
import numpy as np
from torch
import nn
from math
import pi
import matplotlib.pyplot as plt
# @title Figure settings
import ipywidgets as widgets # Interactive display
   config InlineBackend.figure_format = 'retina'"")
# @title Plotting functions

from mpl_toolkits.axes_grid1
import make_axes_locatable

def ex3_plot(model, x, y, ep, lss):
Plot training loss

   model: nn.module
Model implementing regression
x: np.ndarray
Training Data
y: np.ndarray
ep: int
Number of epochs
lss: function

f, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 4))
ax1.plot(x, model(x).detach().numpy(), color = 'r', label = 'prediction')
ax1.scatter(x, y, c = 'c', label = 'targets')

ax2.set_title("Training loss")
ax2.plot(np.linspace(1, epochs, epochs), losses, color = 'y')

def ex1_plot(fun_z, fun_dz):
Plots the
function and gradient vectors

   fun_z: f.__name__
Function implementing sine
fun_dz: f.__name__
Function implementing sine
function as gradient vector

   Nothing ""
x, y = np.arange(-3, 3.01, 0.02), np.arange(-3, 3.01, 0.02)
xx, yy = np.meshgrid(x, y, sparse = True)
zz = fun_z(xx, yy)
xg, yg = np.arange(-2.5, 2.6, 0.5), np.arange(-2.5, 2.6, 0.5)
xxg, yyg = np.meshgrid(xg, yg, sparse = True)
zxg, zyg = fun_dz(xxg, yyg)

plt.figure(figsize = (8, 7))
plt.title("Gradient vectors point towards steepest ascent")
contplt = plt.contourf(x, y, zz, levels = 20)
plt.quiver(xxg, yyg, zxg, zyg, scale = 50, color = 'r', )
ax = plt.gca()
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size = "5%", pad = 0.05)
cbar = plt.colorbar(contplt, cax = cax)
cbar.set_label('$z = h(x, y)$')
# @title Set random seed

# @markdown Executing `set_seed(seed=seed)`
you are setting the seed

# For DL its critical to set the random seed so that students can have a
# baseline to compare their results to expected results.
# Read more here: https: //

   # Call `set_seed`
function in the exercises to ensure reproducibility.
import random
import torch

def set_seed(seed = None, seed_torch = True):
Function that controls randomness.NumPy and random modules must be imported.

   seed: Integer
A non - negative integer that defines the random state.Default is `None`.
seed_torch: Boolean
If `True`
sets the random seed
for pytorch tensors, so pytorch module
must be imported.Default is `True`.

if seed is None:
   seed = np.random.choice(2 ** 32)
if seed_torch:
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

print(f 'Random seed {seed} has been set.')

# In
case that `DataLoader`
is used
def seed_worker(worker_id):
DataLoader will reseed workers following randomness in
   multi - process data loading algorithm.

   worker_id: integer
ID of subprocess to seed.0 means that
the data will be loaded in the main process
Refer: https: // for more details

   Nothing ""
worker_seed = torch.initial_seed() % 2 ** 32
# @title Set device(GPU or CPU).Execute `set_device()`
# especially
if torch modules used.

# inform the user
if the notebook uses GPU or CPU.

def set_device():
Set the device.CUDA
if available, CPU otherwise


   Nothing ""
device = "cuda"
if torch.cuda.is_available()
else "cpu"
if device != "cuda":
   print("GPU is not enabled in this notebook. \n"
      "If you want to enable it, in the menu under `Runtime` -> \n"
      "`Hardware accelerator.` and select `GPU` from the dropdown menu")
   print("GPU is enabled in this notebook. \n"
      "If you want to disable it, in the menu under `Runtime` -> \n"
      "`Hardware accelerator.` and select `None` from the dropdown menu")

return device

Suggestion : 4

Now that we have the gradients, what’s the next step? We use our optimization algorithm to update our weights! These are our current weights:,Note: you might wonder why PyTorch behaves like this. Well, there are some cases we might want to accumulate the gradient. For example, if we want to calculate the gradients over several batches before updating our weights. But don’t worry about that for now - most of the time, you’ll want to be “zeroing out” the gradients each iteration.,We “backpropagate” the error through the network to calculate gradients,PyTorch is tracking the operations in our network and how to calculate the gradient (more on that a bit later), but it hasn’t calculated anything yet because we don’t have a loss function and we haven’t done a forward pass to calculate the loss so there’s nothing to backpropagate yet!

import numpy as np
import pandas as pd
import torch
from torch
import nn
from torchvision
import transforms, datasets, utils
import DataLoader, TensorDataset
from utils.plotting
import *
class network(torch.nn.Module):
   def __init__(self, input_size, hidden_size, output_size):
self.hidden = torch.nn.Linear(input_size, hidden_size)
self.output = torch.nn.Linear(hidden_size, output_size)

def forward(self, x):
   x = self.hidden(x)
x = torch.sigmoid(x)
x = self.output(x)
return x
model = network(1, 2, 1) # make an instance of our network
model.state_dict()['hidden.weight'][: ] = torch.tensor([
]) # fix the weights manually based on the earlier figure
model.state_dict()['hidden.bias'][: ] = torch.tensor([1, 2])
model.state_dict()['output.weight'][: ] = torch.tensor([
   [1, 2]
model.state_dict()['output.bias'][: ] = torch.tensor([-1])
x, y = torch.tensor([1.0]), torch.tensor([3.0]) # our x, y data
criterion = torch.nn.MSELoss()

Suggestion : 5

TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant operations executed inside the context of a tf.GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.,You can also request gradients of the output with respect to intermediate values computed inside the tf.GradientTape context.,Integers and strings are not differentiable. If a calculation path uses these data types there will be no gradient.,To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.


import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf

Here is a simple example:

x = tf.Variable(3.0)

with tf.GradientTape() as tape:
   y = x ** 2

Once you've recorded some operations, use GradientTape.gradient(target, sources) to calculate the gradient of some target (often a loss) relative to some source (often the model's variables):

# dy = 2 x * dx
dy_dx = tape.gradient(y, x)

The above example uses scalars, but tf.GradientTape works as easily on any tensor:

w = tf.Variable(tf.random.normal((3, 2)), name = 'w')
b = tf.Variable(tf.zeros(2, dtype = tf.float32), name = 'b')
x = [
   [1., 2., 3.]

with tf.GradientTape(persistent = True) as tape:
   y = x @ w + b
loss = tf.reduce_mean(y ** 2)

To get the gradient of loss with respect to both variables, you can pass both as sources to the gradient method. The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way (see tf.nest).

[dl_dw, dl_db] = tape.gradient(loss, [w, b])

Suggestion : 6

We differentiate. MXNet Gluon uses Reverse Mode Automatic Differentiation (autograd) to backprogate gradients from the loss metric to the network parameters.,It’s flexible, automatic and efficient. You can use native Python control flow operators such as if conditions and while loops and autograd will still be able to backpropogate the gradients correctly.,As a simple example, we’ll implement the regression model shown in the diagrams above, and later use autograd to automatically calculate the gradient of the loss with respect to each of the weight parameters.,Remember: if loss isn’t a single scalar value (e.g. could be a loss for each sample, rather than for whole batch) a sum operation will be applied implicitly before starting the backward propagation, and the gradients calculated will be of this sum with respect to the parameters.

from mxnet
import autograd
import mxnet as mx
from mxnet.gluon.nn
import HybridSequential, Dense
from mxnet.gluon.loss
import L2Loss

# Define network
net = HybridSequential()
net.add(Dense(units = 3))
net.add(Dense(units = 1))

# Define loss
loss_fn = L2Loss()

# Create dummy data
x = mx.nd.array([
   [0.3, 0.5]
y = mx.nd.array([
with autograd.record():
   y_hat = net(x)
loss = loss_fn(y_hat, y)
dropout = mx.gluon.nn.Dropout(rate = 0.5)
data = mx.nd.ones(shape = (3, 3))

output = dropout(data)
is_training = autograd.is_training()
print('is_training:', is_training, output)