This code works to minimize the execution of unnecessary graph traversals and gradient updates. I still need to refactor it for staggered updates over time so that it can accumulate gradients for several cycles before stepping the optimizers, but this definitely works as intended.

#Step through all models in a chain to create gradient paths from critic back through the world model, to the actor. def step(self): #Get the current state from the simulation state = self.world.state #Fire the actor to select a softmax action. self.actor(state) #run the world simulation on that action. self.world.step(self.actor.action) #Combine the action and starting state as input to the world model. if self.actor.calibrating: action_state = torch.cat([self.actor.value, state], dim = 0) else: #Push softmax action closer to 1.0 action_state = torch.cat([self.actor.hard_value, state], dim = 0) #Run the model and then the critic on the action_state self.critic(self.model(action_state)) if self.actor.calibrating: self.actor.optimizer.zero_grad() self.model.requires_grad = False self.critic.requires_grad = False #Pick loss For maximizing the value of the action choice loss = -self.critic.value * self.actor.get_confidence() loss.backward(retain_graph = True) self.actor.optimizer.step() if self.model.calibrating: #Don 't need to backpropagate through actor again self.actor.value.detach_() self.model.optimizer.zero_grad() self.model.requires_grad = True #Reduce loss for ambiguous actions loss = self.model.get_loss() * self.actor.get_confidence() ** 2 loss.backward(retain_graph = True) self.model.optimizer.step() if self.critic.calibrating: #Don 't need to backpropagate through the model or actor again self.model.value.detach_() self.critic.optimizer.zero_grad() self.critic.requires_grad = True #Reduce loss for ambiguous actions loss = self.critic.get_loss(self.goal) * self.actor.get_confidence() ** 2 loss.backward(retain_graph = True) self.critic.optimizer.step()

There are several mechanisms available from Python to locally disable gradient computation:,To disable gradients across entire blocks of code, there are context managers like no-grad mode and inference mode. For more fine-grained exclusion of subgraphs from gradient computation, there is setting the requires_grad field of a tensor.,Apart from setting requires_grad there are also three possible modes enableable from Python that can affect how computations in PyTorch are processed by autograd internally: default mode (grad mode), no-grad mode, and inference mode, all of which can be togglable via context managers and decorators.,Below, in addition to discussing the mechanisms above, we also describe evaluation mode (nn.Module.eval()), a method that is not actually used to disable gradient computation but, because of its name, is often mixed up with the three.

x = torch.randn(5, requires_grad = True) y = x.pow(2) print(x.equal(y.grad_fn._saved_self)) # True print(x is y.grad_fn._saved_self) # True

x = torch.randn(5, requires_grad = True) y = x.exp() print(y.equal(y.grad_fn._saved_result)) # True print(y is y.grad_fn._saved_result) # False

```
# Define a train
function to be used in different threads
def train_fn():
x = torch.ones(5, 5, requires_grad = True)
# forward
y = (x + 3) * (x + 4) * 0.5
# backward
y.sum().backward()
# potential optimizer update
# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
p = threading.Thread(target = train_fn, args = ())
p.start()
threads.append(p)
for p in threads:
p.join()
```

```
class SelfDeletingTempFile():
def __init__(self):
self.name = os.path.join(tmp_dir, str(uuid.uuid4()))
def __del__(self):
os.remove(self.name)
def pack_hook(tensor):
temp_file = SelfDeletingTempFile()
torch.save(tensor, temp_file.name)
return temp_file
def unpack_hook(temp_file):
return torch.load(temp_file.name)
```

x = torch.randn(5, requires_grad = True) y = x.pow(2) y.grad_fn._raw_saved_self.register_hooks(pack_hook, unpack_hook)

# Only save on disk tensors that have size >= 1000 SAVE_ON_DISK_THRESHOLD = 1000 def pack_hook(x): if x.numel() < SAVE_ON_DISK_THRESHOLD: return x temp_file = SelfDeletingTempFile() torch.save(tensor, temp_file.name) return temp_file def unpack_hook(tensor_or_sctf): if isinstance(tensor_or_sctf, torch.Tensor): return tensor_or_sctf return torch.load(tensor_or_sctf.name) class Model(nn.Module): def forward(self, x): with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook): #...compute output output = x return output model = Model() net = nn.DataParallel(model)

In 1847, Augustin-Louis Cauchy used negative of gradients to develop the Gradient Descent algorithm as an iterative method to minimize a continuous and (ideally) differentiable function of many variables.,Hence, all we need is to calculate the gradient of the loss function with respect to the learnable parameters (i.e., weights):,By breaking the computation into simple operations on intermediate variables, we can use the chain rule to calculate any gradient:,Before introducing the gradient descent algorithm, let’s review a very important property of gradients. The gradient of a function always points in the direction of the steepest ascent. The following exercise will help clarify this.

```
# @title Install dependencies
!pip install git + https: //github.com/NeuromatchAcademy/evaltools --quiet
from evaltools.airtable
import AirtableForm
atform = AirtableForm('appn7VdPRseSoMXEG', 'W1D2_T1', 'https://portal.neuromatchacademy.org/api/redirect/to/9c55f6cb-cdf9-4429-ac1c-ec44fe64c303')
```

```
# Imports
import torch
import numpy as np
from torch
import nn
from math
import pi
import matplotlib.pyplot as plt
```

```
# @title Figure settings
import ipywidgets as widgets # Interactive display
%
config InlineBackend.figure_format = 'retina'
plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/content-creation/main/nma.mplstyle")
```

# @title Plotting functions from mpl_toolkits.axes_grid1 import make_axes_locatable def ex3_plot(model, x, y, ep, lss): "" " Plot training loss Args: model: nn.module Model implementing regression x: np.ndarray Training Data y: np.ndarray Targets ep: int Number of epochs lss: function Loss function Returns: Nothing "" " f, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 4)) ax1.set_title("Regression") ax1.plot(x, model(x).detach().numpy(), color = 'r', label = 'prediction') ax1.scatter(x, y, c = 'c', label = 'targets') ax1.set_xlabel('x') ax1.set_ylabel('y') ax1.legend() ax2.set_title("Training loss") ax2.plot(np.linspace(1, epochs, epochs), losses, color = 'y') ax2.set_xlabel("Epoch") ax2.set_ylabel("MSE") plt.show() def ex1_plot(fun_z, fun_dz): "" " Plots the function and gradient vectors Args: fun_z: f.__name__ Function implementing sine function fun_dz: f.__name__ Function implementing sine function as gradient vector Returns: Nothing "" " x, y = np.arange(-3, 3.01, 0.02), np.arange(-3, 3.01, 0.02) xx, yy = np.meshgrid(x, y, sparse = True) zz = fun_z(xx, yy) xg, yg = np.arange(-2.5, 2.6, 0.5), np.arange(-2.5, 2.6, 0.5) xxg, yyg = np.meshgrid(xg, yg, sparse = True) zxg, zyg = fun_dz(xxg, yyg) plt.figure(figsize = (8, 7)) plt.title("Gradient vectors point towards steepest ascent") contplt = plt.contourf(x, y, zz, levels = 20) plt.quiver(xxg, yyg, zxg, zyg, scale = 50, color = 'r', ) plt.xlabel('$x$') plt.ylabel('$y$') ax = plt.gca() divider = make_axes_locatable(ax) cax = divider.append_axes("right", size = "5%", pad = 0.05) cbar = plt.colorbar(contplt, cax = cax) cbar.set_label('$z = h(x, y)$') plt.show()

# @title Set random seed # @markdown Executing `set_seed(seed=seed)` you are setting the seed # For DL its critical to set the random seed so that students can have a # baseline to compare their results to expected results. # Read more here: https: //pytorch.org/docs/stable/notes/randomness.html # Call `set_seed` function in the exercises to ensure reproducibility. import random import torch def set_seed(seed = None, seed_torch = True): "" " Function that controls randomness.NumPy and random modules must be imported. Args: seed: Integer A non - negative integer that defines the random state.Default is `None`. seed_torch: Boolean If `True` sets the random seed for pytorch tensors, so pytorch module must be imported.Default is `True`. Returns: Nothing. "" " if seed is None: seed = np.random.choice(2 ** 32) random.seed(seed) np.random.seed(seed) if seed_torch: torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.cuda.manual_seed(seed) torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True print(f 'Random seed {seed} has been set.') # In case that `DataLoader` is used def seed_worker(worker_id): "" " DataLoader will reseed workers following randomness in multi - process data loading algorithm. Args: worker_id: integer ID of subprocess to seed.0 means that the data will be loaded in the main process Refer: https: //pytorch.org/docs/stable/data.html#data-loading-randomness for more details Returns: Nothing "" " worker_seed = torch.initial_seed() % 2 ** 32 np.random.seed(worker_seed) random.seed(worker_seed)

# @title Set device(GPU or CPU).Execute `set_device()` # especially if torch modules used. # inform the user if the notebook uses GPU or CPU. def set_device(): "" " Set the device.CUDA if available, CPU otherwise Args: None Returns: Nothing "" " device = "cuda" if torch.cuda.is_available() else "cpu" if device != "cuda": print("GPU is not enabled in this notebook. \n" "If you want to enable it, in the menu under `Runtime` -> \n" "`Hardware accelerator.` and select `GPU` from the dropdown menu") else: print("GPU is enabled in this notebook. \n" "If you want to disable it, in the menu under `Runtime` -> \n" "`Hardware accelerator.` and select `None` from the dropdown menu") return device

Now that we have the gradients, what’s the next step? We use our optimization algorithm to update our weights! These are our current weights:,Note: you might wonder why PyTorch behaves like this. Well, there are some cases we might want to accumulate the gradient. For example, if we want to calculate the gradients over several batches before updating our weights. But don’t worry about that for now - most of the time, you’ll want to be “zeroing out” the gradients each iteration.,We “backpropagate” the error through the network to calculate gradients,PyTorch is tracking the operations in our network and how to calculate the gradient (more on that a bit later), but it hasn’t calculated anything yet because we don’t have a loss function and we haven’t done a forward pass to calculate the loss so there’s nothing to backpropagate yet!

```
import numpy as np
import pandas as pd
import torch
from torch
import nn
from torchvision
import transforms, datasets, utils
from torch.utils.data
import DataLoader, TensorDataset
from utils.plotting
import *
```

```
class network(torch.nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super().__init__()
self.hidden = torch.nn.Linear(input_size, hidden_size)
self.output = torch.nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.hidden(x)
x = torch.sigmoid(x)
x = self.output(x)
return x
```

model = network(1, 2, 1) # make an instance of our network model.state_dict()['hidden.weight'][: ] = torch.tensor([ [1], [-1] ]) # fix the weights manually based on the earlier figure model.state_dict()['hidden.bias'][: ] = torch.tensor([1, 2]) model.state_dict()['output.weight'][: ] = torch.tensor([ [1, 2] ]) model.state_dict()['output.bias'][: ] = torch.tensor([-1]) x, y = torch.tensor([1.0]), torch.tensor([3.0]) # our x, y data

print(model.output.bias.grad)

`None`

criterion = torch.nn.MSELoss()

TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant operations executed inside the context of a tf.GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation using reverse mode differentiation.,You can also request gradients of the output with respect to intermediate values computed inside the tf.GradientTape context.,Integers and strings are not differentiable. If a calculation path uses these data types there will be no gradient.,To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.

## Setup

```
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
```

Here is a simple example:

```
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
y = x ** 2
```

Once you've recorded some operations, use `GradientTape.gradient(target, sources)`

to calculate the gradient of some target (often a loss) relative to some source (often the model's variables):

# dy = 2 x * dx dy_dx = tape.gradient(y, x) dy_dx.numpy()

The above example uses scalars, but `tf.GradientTape`

works as easily on any tensor:

```
w = tf.Variable(tf.random.normal((3, 2)), name = 'w')
b = tf.Variable(tf.zeros(2, dtype = tf.float32), name = 'b')
x = [
[1., 2., 3.]
]
with tf.GradientTape(persistent = True) as tape:
y = x @ w + b
loss = tf.reduce_mean(y ** 2)
```

To get the gradient of `loss`

with respect to both variables, you can pass both as sources to the `gradient`

method. The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way (see `tf.nest`

).

[dl_dw, dl_db] = tape.gradient(loss, [w, b])

We differentiate. MXNet Gluon uses Reverse Mode Automatic Differentiation (autograd) to backprogate gradients from the loss metric to the network parameters.,It’s flexible, automatic and efficient. You can use native Python control flow operators such as if conditions and while loops and autograd will still be able to backpropogate the gradients correctly.,As a simple example, we’ll implement the regression model shown in the diagrams above, and later use autograd to automatically calculate the gradient of the loss with respect to each of the weight parameters.,Remember: if loss isn’t a single scalar value (e.g. could be a loss for each sample, rather than for whole batch) a sum operation will be applied implicitly before starting the backward propagation, and the gradients calculated will be of this sum with respect to the parameters.

```
from mxnet
import autograd
```

import mxnet as mx from mxnet.gluon.nn import HybridSequential, Dense from mxnet.gluon.loss import L2Loss # Define network net = HybridSequential() net.add(Dense(units = 3)) net.add(Dense(units = 1)) net.initialize() # Define loss loss_fn = L2Loss() # Create dummy data x = mx.nd.array([ [0.3, 0.5] ]) y = mx.nd.array([ [1.5] ])

```
with autograd.record():
y_hat = net(x)
loss = loss_fn(y_hat, y)
```

loss.backward()

`net[0].weight.grad()`

```
dropout = mx.gluon.nn.Dropout(rate = 0.5)
data = mx.nd.ones(shape = (3, 3))
output = dropout(data)
is_training = autograd.is_training()
print('is_training:', is_training, output)
```