PyTorch torch.optim.Adam() Method

PyTorch torch.optim.Adam() method is “used to implement the Adam algorithm that optimizes the parameters of a neural network model.”


torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), 
   eps=1e-08, weight_decay=0, amsgrad=False, *, 
   foreach=None, maximize=False, capturable=False, 
   differentiable=False, fused=None)


  1. params (iterable): Iterable of parameters to optimize or dicts defining parameter groups

  2. lr (float, optional): Learning rate (default: 1e-3)

  3. betas (Tuple[float, float], optional): Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  4. eps (float, optional): Term added to the denominator to improve numerical stability (default: 1e-8).
  5. weight_decay (float, optional): Weight decay (L2 penalty) (default: 0)
  6. amsgrad (bool, optional): Whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)
  7. foreach (bool, optional): Whether foreach implementation of the optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA since it is usually significantly more performant. (default: None)
  8. maximize (bool, optional): Maximize the params based on the objective instead of minimizing (default: False)
  9. capturable (bool, optional): Whether this instance is safe to capture in a CUDA graph. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)
  10. differentiable (bool, optional): Whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair the performance, so leave it False if you don’t intend to run autograd through this instance (default: False)
  11. fused (bool, optional): Whether the fused implementation (CUDA only) is used. Currently, torch.float64torch.float32torch.float16, and torch.bfloat16 are supported. (default: None)


Here’s a simple example of how to use torch.optim.Adam() in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

# Suppose we have a simple model
model = nn.Sequential(
  nn.Linear(10, 5),
  nn.Linear(5, 2),

# Suppose our data is a tensor of size (1, 10) 
# and target is a tensor of size (1, 2)
data = torch.randn(1, 10)
target = torch.randn(1, 2)

# Define the criterion (loss function)
criterion = nn.MSELoss()

# Initialize the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# A single optimization step would look like this:

# Zero the gradients

# Forward pass
output = model(data)

# Calculate the loss
loss = criterion(output, target)

# Backward pass

# Update the weights


  1. We first define a simple model, which could be any PyTorch model. The nn.Linear layers are just fully connected layers, and nn.ReLU is a common activation function.

  2. We create some dummy data and target.
  3. We define a loss function, which is used to measure how far the model’s predictions are from the target. We use Mean Squared Error (MSE) loss in this case, but this could be any PyTorch loss function.
  4. We initialize the optimizer. The first argument is the model parameters that should be optimized. The lr argument is the learning rate, which determines how much the weights are updated in each optimization step.
  5. We perform a single optimization step, which consists of:

    • Zeroing the gradients is necessary because PyTorch accumulates gradients on subsequent backward passes.
    • Performing a forward pass involves passing the data through the model and getting the output.
    • Calculating the loss is the difference between the output and the target.
    • Performing a backward pass involves calculating the loss’s gradients with respect to the model parameters.
    • Updating the weights, which is done by the optimizer.

That’s it.

Leave a Comment