Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • See all Red Hat products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Red Hat OpenShift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • See all technologies
    • Programming languages & frameworks

      • Java
      • Python
      • JavaScript
    • System design & architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer experience

      • Productivity
      • Tools
      • GitOps
    • Automated data processing

      • AI/ML
      • Data science
      • Apache Kafka on Kubernetes
    • Platform engineering

      • DevOps
      • DevSecOps
      • Red Hat Ansible Automation Platform for applications and services
    • Secure development & architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & cloud native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • See all learning resources

    E-books

    • GitOps cookbook
    • Podman in action
    • Kubernetes operators
    • The path to GitOps
    • See all e-books

    Cheat sheets

    • Linux commands
    • Bash commands
    • Git
    • systemd commands
    • See all cheat sheets

    Documentation

    • Product documentation
    • API catalog
    • Legacy documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore the Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Optimize PyTorch training with the autograd engine

How PyTorch builds and executes the backward graph

March 3, 2026
Vishal Goyal
Related topics:
Artificial intelligenceData sciencePython
Related products:
Red Hat AI Inference ServerRed Hat AI

    Note

    This article was originally published on Medium and has been adapted for Red Hat Developer.

    How does PyTorch calculate gradients when you run a backward() call? The answer is the autograd engine. This tool provides automatic differentiation, which calculates the gradients required for training deep learning models in PyTorch.

    This article examines how the autograd engine builds and executes the computational graph, manages memory during backpropagation, and optimizes performance. Whether you are a researcher debugging gradient flows or an engineer optimizing training pipelines, understanding the internals of the autograd engine will improve your PyTorch skills.

    What is autograd?

    The autograd (automatic gradient) engine is PyTorch's automatic differentiation tool that:

    • Records operations during the forward pass to build a computational graph.
    • Automatically calculates gradients using the chain rule during the backward pass.
    • Manages memory efficiently by saving only necessary intermediate values.
    • Optimizes the graph by pruning unnecessary computations.

    The autograd engine tracks every mathematical operation on tensors and reverses these operations to compute gradients.

    import torch
     
    # Simple example
    x = torch.tensor(2.0, requires_grad=True)
    y = x ** 2 + 3 * x + 1
    y.backward()
     
    print(x.grad)  # Output: tensor(7.0)  # dy/dx = 2*x + 3 = 2*2 + 3 = 7

    Behind this simple API lies a sophisticated system that we'll unpack piece by piece.

    Computational graphs

    A computational graph is a directed acyclic graph (DAG). In this graph, nodes represent operations or functions, and edges represent the flow of data via tensors. Leaves are the input tensors, while roots are the output tensors.

    Example: y = (x + 2) * 3
     
    Forward Graph:
    x (leaf) → [+2] → temp → [*3] → y (root)
     
    Backward Graph (reversed):
      y.grad (dy/dy)= 1.0 
      → [*3]'(means y = temp*3) 
      → temp.grad = 3 (dy/d(temp))
      → [+2]' (means temp = x + 2 -> d(temp)/dx = 1) 
      → x.grad = 3 (dy/dx)

    Forward pass vs. backward pass

    During the forward pass, the engine executes the computation y = f(x), builds the computational graph dynamically, and saves the intermediate values required for the backward pass.

    During the backward pass, the engine traverses the graph in reverse order. It applies the chain rule to compute gradients—defined as dx = dy * df/dx—and accumulates those gradients at the leaf nodes.

    # Forward: build graph + compute output
    x = torch.tensor(3.0, requires_grad=True)
    y = x * 2      # grad_fn: MulBackward0
    z = y + 1      # grad_fn: AddBackward0
    loss = z ** 2  # grad_fn: PowBackward0
     
    # Backward: traverse graph + compute gradients
    loss.backward()
    print(x.grad)  # tensor(28.0)
    # Computation: dloss/dx = dloss/dz * dz/dy * dy/dx = 2*z * 1 * 2 = 2*7*2 = 28

    Tape-based autograd vs. define-by-run

    There are two main paradigms for automatic differentiation: tape-based autograd and define-by-run.

    Traditional tape-based autograd uses a static and predefined graph structure that must be compiled before execution. While this allows the engine to optimize the entire graph ahead of time, it makes debugging difficult because of the separation between the graph and the runtime. This approach also makes dynamic control flow, such as if or while statements, challenging to implement.

    # Pseudo-code for tape-based approach
    # Define the graph structure FIRST
    graph = ComputationalGraph()
    x = graph.placeholder()
    y = graph.multiply(x, 2)
    z = graph.add(y, 1)
    loss = graph.power(z, 2)
     
    # Compile the graph
    session = graph.compile()
     
    # THEN execute with actual data
    result = session.run(loss, feed_dict={x: 3.0})
    gradients = session.run(grad(loss, x), feed_dict={x: 3.0})

    PyTorch uses the define-by-run approach, where the graph is built dynamically during the forward pass. This means the engine differentiates exactly what you run, supporting arbitrary Python control flow naturally. 

    This method is easy to debug because of eager execution, and the graph structure can change between iterations. However, dynamic graph construction does introduce a slight performance overhead.

    # PyTorch: Define and run simultaneously
    x = torch.tensor(3.0, requires_grad=True)
    y = x * 2       # Graph is built HERE during execution
    z = y + 1       # Graph continues to build
    loss = z ** 2   # Still building...
     
    # The graph exists NOW and can be inspected
    print(loss.grad_fn)  # <PowBackward0>
    print(loss.grad_fn.next_functions)  # ((<AddBackward0>, 0),)
     
    # Execute backward pass
    loss.backward() 

    The define-by-run approach allows you to use standard Python features, such as loops and conditional statements, to build a model.

    def forward(x, n):
        result = x
        for i in range(n):  # n can change at runtime!
            if result.sum() > 0:
                result = result * 2
            else:
                result = result + 1
        return result.sum()
     
    # Different values of n create different graphs
    x = torch.randn(3, requires_grad=True)
     
    # Iteration 1: n=3 creates a 3-layer graph
    loss1 = forward(x, n=3)
    loss1.backward()
     
    # Iteration 2: n=5 creates a 5-layer graph
    loss2 = forward(x.detach().requires_grad_(), n=5)
    loss2.backward()

    In this example, the forward function uses a for loop and an if statement to determine the path of the data. Because PyTorch builds the graph as the code runs, the autograd engine can differentiate through these dynamic structures. You do not need to predefine every possible path before running the model.

    This flexibility supports complex architectures, such as recurrent networks with variable sequence lengths, recursive networks with tree structures. It also simplifies neural architecture search and experimental research by allowing the model structure to change during execution.

    How PyTorch builds the backward graph 

    PyTorch constructs the backward graph by tracking mathematical operations through specific data structures. When you perform a calculation on a tensor, the engine records the operation to ensure it can calculate gradients later.

    The Node class

    Every operation in the computational graph is represented by a Node object, which PyTorch often refers to as a grad_fn. This object acts as a functional unit that knows how to compute the local gradient for a specific operation.

    // Simplified from torch/csrc/autograd/function.h
    struct Node {
      std::vector<Edge> next_edges_;     // Connections to previous nodes
      uint64_t sequence_nr_;              // For topological sorting
      
      // The backward function: converts output gradients to input gradients
      virtual variable_list apply(variable_list&& inputs) = 0;
    };
     
    // Edge connects nodes in the graph
    struct Edge {
      std::shared_ptr<Node> function;    // The node to call
      uint32_t input_nr;                  // Which input of that node
    };

    In this C++ structure, the next_edges_ vector contains Edge objects that link the current Node to the previous operations in the sequence. The apply method is the core function that executes the actual gradient calculation during the backward pass.

    Graph structure and visualization

    You can inspect the graph structure in Python by accessing the grad_fn attribute of a tensor. This attribute provides a handle to the Node at the root of the graph. By following the next_functions attribute, you can trace the computation history back to the leaf tensors.

    x = torch.tensor(2.0, requires_grad=True)
    y = x * 3
    z = y + 2
    loss = z ** 2
     
    print(f"loss.grad_fn: {loss.grad_fn}")
    # <PowBackward0>
     
    print(f"Next: {loss.grad_fn.next_functions}")
    # ((<AddBackward0>, 0), (None, 0))
     
    print(f"Next-Next: {loss.grad_fn.next_functions[0][0].next_functions}")
    # ((<MulBackward0>, 0), (None, 0))

    The resulting graph is a directed acyclic graph where the loss tensor is the root. The engine traverses this structure in reverse to move from the output back to the input tensors.

            x (leaf, requires_grad=True)
            |
            | [MulBackward0: y = x * 3]
            |
            y (intermediate)
            |
            | [AddBackward0: z = y + 2]
            |
            z (intermediate)
            |
            | [PowBackward0: loss = z ** 2]
            |
          loss (root)

    In this structure, the intermediate tensors y and z connect the operations. Each time you call an operation like multiplication or addition, PyTorch creates a new Node and updates the links between them.

    Forward pass: Recording operations

    When you perform operations on tensors with requires_grad=True, PyTorch executes the operation, such as matrix multiplication. The engine then creates a Node for the backward pass, such as MmBackward0, and attaches it to the output tensor using the grad_fn property. Finally, the engine saves the necessary inputs for the backward computation.

    a = torch.randn(3, 3, requires_grad=True)
    b = torch.randn(3, 3, requires_grad=True)
    c = torch.matmul(a, b)
     
    print(c.grad_fn)  # <MmBackward0>
    # This node knows:
    # - How to compute gradients: d(matmul)/da and d(matmul)/db
    # - What it needs: saved_a (for grad_b) and saved_b (for grad_a)
    # - Where to send gradients: edges to 'a' and 'b'

    The following simplified C++ logic shows how PyTorch records these operations internally::

    // When you call torch.matmul(a, b)
    Tensor matmul(const Tensor& a, const Tensor& b) {
      // 1. Compute the actual result
      Tensor result = at::matmul(a, b);  // ATen operation
      
      // 2. If any input requires grad, create backward node
      if (a.requires_grad() || b.requires_grad()) {
        auto grad_fn = std::make_shared<MmBackward0>();
        
        // 3. Save inputs needed for backward
        grad_fn->save_for_backward({a, b});
        
        // 4. Connect to inputs via edges
        grad_fn->set_next_edges({
          Edge(a.grad_fn(), a.output_nr()),
          Edge(b.grad_fn(), b.output_nr())
        });
        
        // 5. Attach to output
        result.set_grad_fn(grad_fn);
      }
      
      return result;
    }

    Backward pass: Traversing the graph

    When you call .backward(), PyTorch starts at the root output tensor and performs a topological sort to determine the execution order. The engine then calls the backward function for each node in reverse order and accumulates the gradients at the leaf nodes.

    # Example: Detailed backward pass
    x = torch.tensor(2.0, requires_grad=True)
    y = x * 3        # MulBackward0
    z = y + 2        # AddBackward0  
    loss = z ** 2    # PowBackward0
     
    loss.backward()
     
    # What happens:
    # Step 1: loss.backward() starts with grad_output = 1.0
    #
    # Step 2: PowBackward0.apply(grad_output=1.0)
    #   Formula: d(z^2)/dz = 2*z
    #   z.grad = 1.0 * 2 * z = 1.0 * 2 * 8 = 16.0
    #
    # Step 3: AddBackward0.apply(grad_output=16.0)
    #   Formula: d(y+2)/dy = 1
    #   y.grad = 16.0 * 1 = 16.0
    #
    # Step 4: MulBackward0.apply(grad_output=16.0)
    #   Formula: d(x*3)/dx = 3
    #   x.grad = 16.0 * 3 = 48.0
     
    print(x.grad)  # tensor(48.0) ✓

    The engine: Orchestrating backward pass

    The engine (defined in torch/csrc/autograd/engine.cpp) is responsible for executing the backward pass:

    // Simplified engine logic
    class Engine {
      variable_list execute(
          const edge_list& roots,          // Starting nodes
          const variable_list& grad_inputs // Initial gradients
      ) {
        // 1. Topological sort of the graph
        auto sorted_nodes = topological_sort(roots);
        
        // 2. Initialize gradient buffers
        std::unordered_map<Node*, variable_list> grad_map;
        
        // 3. Execute nodes in reverse topological order
        for (auto& node : sorted_nodes) {
          // Get accumulated gradient for this node
          auto grad_inputs = grad_map[node];
          
          // Call the node's backward function
          auto grad_outputs = node->apply(std::move(grad_inputs));
          
          // Send gradients to next nodes
          for (size_t i = 0; i < node->next_edges_.size(); ++i) {
            auto& edge = node->next_edges_[i];
            if (edge.function) {
              grad_map[edge.function.get()].push_back(grad_outputs[i]);
            }
          }
        }
        
        return collect_leaf_grads();
      }
    };

    SavedTensor mechanism

    The autograd engine must efficiently manage saved tensors, which are the intermediate values required for backward computation.

    Why save tensors? Many operations require specific inputs or intermediate values to compute gradients. The following table describes what the engine saves for common operations.

    OperationForward passBackward passWhat to save
    Squarey = x * xdx = 2 * x * dySave x
    Exponentialy = e^xdx = e^x * dy = y * dySave y (output)
    Matrix multiplyy = x @ w

    dx = dy @ w.T

    dw = x.T @ dy

    Save both x and w
    ReLUy = \max(0, x)dx = (x > 0) * dySave x > 0 mask

    The SavedVariable class

    PyTorch uses the SavedVariable class to store tensors:

    // From torch/csrc/autograd/saved_variable.h
    class SavedVariable {
     private:
      // The actual tensor data
      at::Tensor data_;
      
      // Metadata for reconstruction
      std::shared_ptr<Node> grad_fn_;
      uint32_t output_nr_;
      uint32_t version_counter_;  // To detect in-place modifications
      
      // Optional hooks for custom save/load behavior
      std::unique_ptr<SavedVariableHooks> hooks_;
      
     public:
      // Constructor: saves a variable during forward pass
      SavedVariable(const Variable& variable, bool is_output);
      
      // Unpack: reconstructs the variable during backward pass
      Variable unpack(std::shared_ptr<Node> saved_for = nullptr) const;
    };

    Memory optimization: What gets saved?

    PyTorch optimizes memory by saving outputs instead of inputs when possible:

    # ReLU: can reconstruct gradient from output
    # Instead of saving input x, save output y
    def relu_backward(y, grad_output):
        return (y > 0).float() * grad_output  # Uses output, not input!
    It saves views instead of copies:
    # For operations like transpose, save metadata not data
    x = torch.randn(1000, 1000, requires_grad=True)
    y = x.t()  # Transpose
     
    # Internally: saves stride/offset info, NOT a copy
    # Memory: ~0 bytes extra (just metadata)
    It also reduces memory usage by recomputing inexpensive operations during the backward pass:
    # For cheap operations like element-wise functions (GELU) ,
    # sometimes it's better to recompute in backward than save
    def gelu_backward(x, grad_output):
        # Could save intermediate values, but recomputing is cheaper
        # than memory transfer in many cases
        return grad_output * gelu_gradient(x)  # Recomputes parts of forward

    SavedTensor hooks: Custom memory management

    PyTorch lets you customize how tensors are saved:

    # Example: Save tensors to CPU to reduce GPU memory
    def pack_hook(tensor):
        return tensor.cpu()  # Move to CPU
     
    def unpack_hook(packed):
        return packed.cuda()  # Move back to GPU
     
    with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
        # Your forward pass here
        x = torch.randn(10000, 10000, device='cuda', requires_grad=True)
        y = expensive_operation(x)
        loss = y.sum()
        
    loss.backward()  # Tensors are moved back from CPU for backward

    Activation checkpointing

    Activation checkpointing reduces memory usage during training by trading computation time for memory space. Instead of storing all intermediate activations during the forward pass, PyTorch recomputes them during the backward pass.

    # Trade compute for memory by recomputing activations
    def checkpoint_function(function, *args):
        """
        During forward: Don't save intermediate activations
        During backward: Rerun forward to recompute them
        """
        class CheckpointFunction(torch.autograd.Function):
            @staticmethod
            def forward(ctx, run_function, *args):
                ctx.run_function = run_function
                ctx.save_for_backward(*args)  # Only save inputs
                
                with torch.no_grad():
                    outputs = run_function(*args)
                return outputs
            
            @staticmethod
            def backward(ctx, *grad_outputs):
                # Recompute forward to get intermediate values
                inputs = ctx.saved_tensors
                with torch.enable_grad():
                    outputs = ctx.run_function(*inputs)
                
                # Now compute gradients with recomputed values
                torch.autograd.backward(outputs, grad_outputs)
                return (None,) + tuple(inp.grad for inp in inputs)
        
        return CheckpointFunction.apply(function, *args)
     
    # Usage: Train huge models with limited memory
    for layer in huge_model.layers:
        x = checkpoint_function(layer, x)  # Recompute activations in backward

    In this example, the CheckpointFunction uses ctx.save_for_backward to store only the input tensors. During the backward pass, the engine calls the forward function again within a torch.enable_grad context to recreate the intermediate values. This approach is effective for training large models, such as deep neural networks with hundreds of layers, on hardware with limited memory.

    Version counters: Detecting invalid in-place operations

    PyTorch uses version counters to detect when an in-place modification interferes with a gradient calculation:

    # Problem: in-place modification after saving
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    y = x * 2  # y's backward will need x
     
    x.add_(1)  # DANGER: modifying x after it was saved
     
    # This will raise an error during backward:
    # "RuntimeError: one of the variables needed for gradient computation
    #  has been modified by an inplace operation"
     
    try:
        y.backward(torch.ones_like(y))
    except RuntimeError as e:
        print(e)

    The engine increments a version counter whenever an in-place operation occurs to ensure data integrity during backpropagation. The following C++ structures demonstrate how the engine validates these versions:

    // Each tensor has a version counter
    struct TensorImpl {
      uint32_t version_counter_ = 0;
      
      void bump_version() {
        version_counter_++;
      }
    };
     
    // SavedVariable stores the version at save time
    class SavedVariable {
      uint32_t saved_version_;
      
      Variable unpack() {
        TORCH_CHECK(
          tensor.version_counter() == saved_version_,
          "Variable has been modified by an in-place operation"
        );
        return tensor;
      }
    };

    Gradient accumulation

    Gradient accumulation adds multiple gradients to the same parameter. This process occurs when a variable is used in multiple paths of a graph or when you accumulate gradients across several batches using multiple backward passes.

    Multiple paths: Automatic accumulation

    Automatic accumulation occurs when a variable is used in multiple operations within the same computational graph.

    # Example: x is used twice
    x = torch.tensor(2.0, requires_grad=True)
    y = x * 2      # First use
    z = x + 3      # Second use
    loss = y + z   # Both paths contribute to x.grad
     
    loss.backward()This article examines how the autograd engine builds and executes the computational graph, manages memory during backpropagation, and optimizes performance.In th
     
    # x.grad = dy/dx + dz/dx = 2 + 1 = 3
    print(x.grad)  # tensor(3.0)

    Graph visualization:

             x (requires_grad=True)
            / \
           /   \
         [*2]  [+3]    <- Two different operations using x
          |     |
          y     z
           \   /
            \ /
            [+]
             |
           loss

    During the backward pass, gradients from both paths are accumulated at x.

    AccumulateGrad node

    Leaf variables, such as model parameters, use a specific node called AccumulateGrad to handle incoming gradients.

    // From torch/csrc/autograd/functions/accumulate_grad.h
    class AccumulateGrad : public Node {
      Variable variable;  // Reference to the parameter
      
      variable_list apply(variable_list&& grads) override {
        auto& grad = grads[0];
        
        // First gradient: just assign
        if (!variable.grad().defined()) {
          variable.mutable_grad() = grad;
        }
        // Subsequent gradients: accumulate (add)
        else {
          variable.mutable_grad() += grad;  // THIS IS THE KEY!
        }
        
        return {};  // Leaf nodes don't propagate further
      }
    };

    Multi-batch gradient accumulation

    You can use gradient accumulation to simulate a larger batch size when hardware memory is limited.

    # Simulate batch_size=128 with 4 batches of size 32
    model = MyModel()
    optimizer = torch.optim.Adam(model.parameters())
     
    accumulation_steps = 4
    optimizer.zero_grad()
     
    for i, (x, y) in enumerate(dataloader):  # batch_size=32
        # Forward pass
        output = model(x)
        loss = criterion(output, y) / accumulation_steps  # Scale the loss!
        
        # Backward pass: gradients accumulate in parameter.grad
        loss.backward()
        
        # Only update every 4 batches
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

    Scaling the loss for consistency

    Why scale the loss? When you accumulate gradients across multiple batches, you must scale the loss to maintain a consistent gradient average.

    # Without scaling:
    # total_grad = grad_batch1 + grad_batch2 + grad_batch3 + grad_batch4
    # This is 4x larger than a single batch!
     
    # With scaling (loss / 4):
    # total_grad = grad_batch1/4 + grad_batch2/4 + grad_batch3/4 + grad_batch4/4
    # This equals the average, same as a single batch of size 128

    Thread safety in gradient accumulation

    The autograd engine uses synchronization primitives to ensure data integrity during multithreaded execution.

    // From torch/csrc/autograd/functions/accumulate_grad.cpp
    void AccumulateGrad_apply_impl(
        variable_list&& grads,
        at::Tensor& variable,
        at::Tensor& variable_grad,
        std::mutex* mutex = nullptr
    ) {
      // Acquire lock for thread safety
      std::optional<std::lock_guard<std::mutex>> lock;
      if (mutex != nullptr) {
        lock.emplace(*mutex);
      }
      
      // Safely accumulate gradient
      if (variable_grad.defined()) {
        variable_grad += new_grad;
      } else {
        variable_grad = new_grad;
      }
    }

    Distributed gradient accumulation

    In a distributed environment, the engine coordinates gradient summation across multiple devices.

    # PyTorch DDP (DistributedDataParallel) example
    model = torch.nn.parallel.DistributedDataParallel(model)
     
    # During backward:
    # 1. Each GPU computes local gradients
    # 2. All-reduce operation sums gradients across GPUs
    # 3. Each GPU gets the averaged gradient
     
    loss.backward()  # Triggers all-reduce internally
    optimizer.step()

    Graph pruning and optimization

    The autograd engine uses several optimization techniques to improve performance and reduce memory usage. These features allow PyTorch to handle large models while maintaining high execution speeds.

    Pruning unused paths

    Operations on z don't create backward nodes, which saves memory.

    # Example: Pruning demonstration
    x = torch.randn(5, requires_grad=True)
    y = torch.randn(5, requires_grad=True)
    z = torch.randn(5, requires_grad=False)  # No gradient needed
     
    a = x * 2
    b = y + 1
    c = z * 3  # This path won't have grad_fn!
    loss = a.sum() + b.sum()
     
    print(a.grad_fn)  # <MulBackward0>
    print(b.grad_fn)  # <AddBackward0>
    print(c.grad_fn)  # None <- Pruned!
     
    loss.backward()
    print(x.grad)  # tensor([2., 2., 2., 2., 2.])
    print(y.grad)  # tensor([1., 1., 1., 1., 1.])
    print(z.grad)  # None <- Never computed

    In this example, the engine ignores the path for the variable z. This saves memory by excluding unnecessary Node objects from the backward graph.

    Fusing operations

    Operation fusion combines multiple mathematical operations into a single execution step, which reduces the size of the graph. This process minimizes the overhead of managing many small backward nodes and can significantly improve execution speed.

    # Multiple element-wise operations can be fused
    x = torch.randn(1000, 1000, requires_grad=True)
     
    # Without fusion: 3 separate backward nodes
    y = x + 1      # AddBackward0
    z = y * 2      # MulBackward0
    w = z.relu()   # ReluBackward0
     
    # With TorchScript or JIT: single fused kernel
    @torch.jit.script
    def fused_ops(x):
        return (x + 1) * 2.relu()  # Single optimized operation
     
    w = fused_ops(x)  # More efficient!

    In-place operations

    In-place operations modify a tensor directly instead of creating a new one. This saves memory because the engine does not need to allocate extra space for an output tensor. However, you must use in-place operations carefully because they can overwrite data that the engine needs for the backward pass.

    x = torch.randn(1000, 1000, requires_grad=True)
     
    # Out-of-place: creates a new tensor
    y = x + 1  # Memory: 2x (x and y)
     
    # In-place: modifies x directly (with care!)
    x.add_(1)  # Memory: 1x (only x)
     
    # But be careful: in-place on computational graph nodes causes errors!
    x = torch.randn(5, requires_grad=True)
    y = x * 2
    x.add_(1)  # RuntimeError: modified by in-place after being saved!

    While in-place operations are efficient, modifying a tensor after it has been recorded in the graph triggers a RuntimeError. In-place operations are generally safe when applied to leaf variables between training steps or within a torch.no_grad context.

    # Safe: in-place on leaf variables (parameters) between forward passes
    model = nn.Linear(10, 10)
    model.weight.add_(0.01)  # OK: modifying parameter directly
     
    # Safe: in-place in no_grad context
    with torch.no_grad():
        model.weight.add_(0.01)  # OK: autograd is disabled

    Graph cleanup and retention

    By default, PyTorch frees the graph after the backward pass to save memory. If you try to call .backward() a second time on the same output without intervention, the engine will raise an error because the graph no longer exists.

    x = torch.randn(3, requires_grad=True)
    y = x * 2
    loss = y.sum()
     
    # First backward: graph is freed by default
    loss.backward()
     
    # Second backward: ERROR!
    try:
        loss.backward()
    except RuntimeError as e:
        print(e)  # "Trying to backward through the graph a second time..."

    To run multiple backward passes from the same output or to calculate higher-order gradients, you must set the retain_graph parameter to True. This keeps the intermediate nodes in memory for subsequent calculations.

    # Use case 1: Multiple backward passes from the same output
    loss1.backward(retain_graph=True)  # Keep graph
    loss2 = another_operation(y)
    loss2.backward()  # Now can use y again
     
    # Use case 2: Computing Hessian (second derivatives)
    x = torch.randn(3, requires_grad=True)
    y = (x ** 2).sum()
    grad_x = torch.autograd.grad(y, x, create_graph=True)[0]
    # Need to backward through grad_x, so graph must be retained
    hessian = torch.autograd.grad(grad_x.sum(), x)[0]

    Gradient checkpointing (rematerialization)

    Gradient checkpointing, also known as rematerialization, allows you to trade computation for memory. Instead of saving every intermediate activation during the forward pass, the engine only saves a subset of them. It recomputes the missing activations as needed during the backward pass.

    from torch.utils.checkpoint import checkpoint
     
    class DeepModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.layers = nn.ModuleList([
                nn.Linear(1000, 1000) for _ in range(100)  # 100 layers!
            ])
        
        def forward(self, x):
            # Without checkpointing: saves all 100 intermediate activations
            # Memory: O(100 * 1000 * 1000) = ~400 MB
            
            # With checkpointing: only saves every 10th activation
            for i, layer in enumerate(self.layers):
                if i % 10 == 0:
                    x = checkpoint(layer, x)  # Recompute in backward
                else:
                    x = layer(x)
            
            return x
     
    # Memory usage: ~40 MB instead of ~400 MB
    # Compute cost: ~10% increase (recomputing 90% of layers)

    Optimization through compilation

    Modern versions of PyTorch offer graph optimizations through the torch.compile function.

    # TorchScript: ahead-of-time graph optimization
    @torch.jit.script
    def optimized_function(x, y):
        a = x + y
        b = a * 2
        c = b.relu()
        return c
     
    # Benefits:
    # - Dead code elimination
    # - Constant folding
    # - Operation fusion
    # - Memory planning
     
    # TorchInductor (PyTorch 2.0+): even more aggressive optimization
    model = torch.compile(model)
    # - Kernel fusion
    # - Auto-tuning
    # - GPU-specific optimizations

    Example: A complete forward and backward pass

    Let's trace a complete neural network training step:

    import torch
    import torch.nn as nn
     
    # Simple network
    class SimpleNet(nn.Module):
        def __init__(self):
            super().__init__()
            self.fc1 = nn.Linear(3, 4)
            self.fc2 = nn.Linear(4, 2)
        
        def forward(self, x):
            x = self.fc1(x)
            x = torch.relu(x)
            x = self.fc2(x)
            return x
     
    # Training setup
    model = SimpleNet()
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
     
    # Training step
    x = torch.randn(1, 3)
    target = torch.randn(1, 2)
     
    # ===== FORWARD PASS =====
    # Graph is built dynamically during execution
     
    # Layer 1: Linear
    # x -> [LinearBackward0: y = x @ W1.T + b1]
    h1 = model.fc1(x)
    print(f"h1.grad_fn: {h1.grad_fn}")
    # Output: <AddmmBackward0>
    # Saved: x, W1 (needed for backward)
     
    # Activation: ReLU
    # h1 -> [ReluBackward0: h2 = max(0, h1)]
    h2 = torch.relu(h1)
    print(f"h2.grad_fn: {h2.grad_fn}")
    # Output: <ReluBackward0>
    # Saved: h2 > 0 (mask, just bool, very memory efficient!)
     
    # Layer 2: Linear
    # h2 -> [LinearBackward0: output = h2 @ W2.T + b2]
    output = model.fc2(h2)
    print(f"output.grad_fn: {output.grad_fn}")
    # Output: <AddmmBackward0>
    # Saved: h2, W2
     
    # Loss: MSE
    # output, target -> [MseLossBackward0: loss = mean((output - target)^2)]
    loss = criterion(output, target)
    print(f"loss.grad_fn: {loss.grad_fn}")
    # Output: <MseLossBackward0>
    # Saved: output - target
     
    # ===== BACKWARD PASS =====
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Traverse graph in reverse
     
    # What happens:
    # 1. MseLossBackward0.apply(grad_output=1.0)
    #    -> Computes: grad_output = 2 * (output - target) / batch_size
    #
    # 2. AddmmBackward0.apply (fc2)
    #    -> Computes: grad_h2 = grad_output @ W2
    #    -> Computes: grad_W2 = h2.T @ grad_output
    #    -> Computes: grad_b2 = grad_output.sum(0)
    #    -> Accumulates in W2.grad and b2.grad
    #
    # 3. ReluBackward0.apply
    #    -> Computes: grad_h1 = grad_h2 * (h2 > 0)
    #
    # 4. AddmmBackward0.apply (fc1)
    #    -> Computes: grad_x = grad_h1 @ W1
    #    -> Computes: grad_W1 = x.T @ grad_h1
    #    -> Computes: grad_b1 = grad_h1.sum(0)
    #    -> Accumulates in W1.grad and b1.grad
     
    # ===== OPTIMIZER STEP =====
    optimizer.step()  # Update parameters using accumulated gradients
    # W1 -= lr * W1.grad
    # b1 -= lr * b1.grad
    # W2 -= lr * W2.grad
    # b2 -= lr * b2.grad
     
    print("\nGradients computed:")
    print(f"fc1.weight.grad shape: {model.fc1.weight.grad.shape}")
    print(f"fc1.bias.grad shape: {model.fc1.bias.grad.shape}")
    print(f"fc2.weight.grad shape: {model.fc2.weight.grad.shape}")
    print(f"fc2.bias.grad shape: {model.fc2.bias.grad.shape}")

    Advanced topics

    Use these advanced features to extend the autograd engine with custom differentiation logic or to compute higher-order gradients for specialized mathematical models.

    Custom autograd functions

    You can define custom operations with custom backward logic:

    class CustomExp(torch.autograd.Function):
        @staticmethod
        def forward(ctx, input):
            # Forward computation
            result = torch.exp(input)
            ctx.save_for_backward(result)  # Save for backward
            return result
        
        @staticmethod
        def backward(ctx, grad_output):
            # Custom backward computation
            result, = ctx.saved_tensors
            return grad_output * result  # d(exp(x))/dx = exp(x)
     
    # Usage
    x = torch.tensor(1.0, requires_grad=True)
    y = CustomExp.apply(x)
    y.backward()
    print(x.grad)  # tensor(2.7183) = e^1 

    Higher-order gradients

    PyTorch supports gradients of gradients, such as Hessians:

    x = torch.tensor(2.0, requires_grad=True)
    y = x ** 3  # y = x^3
     
    # First derivative: dy/dx = 3x^2
    grad_x = torch.autograd.grad(y, x, create_graph=True)[0]
    print(grad_x)  # tensor(12.) = 3 * 2^2
     
    # Second derivative: d²y/dx² = 6x
    grad2_x = torch.autograd.grad(grad_x, x)[0]
    print(grad2_x)  # tensor(12.) = 6 * 2

    Jacobian and Hessian computation

    The following example shows how to calculate a Jacobian matrix by calling the torch.autograd.grad function for each dimension of the output tensor:

    def compute_jacobian(func, x):
        """Compute Jacobian matrix of func at x."""
        x = x.requires_grad_()
        y = func(x)
        
        jacobian = torch.zeros(y.shape[0], x.shape[0])
        for i in range(y.shape[0]):
            grad_output = torch.zeros_like(y)
            grad_output[i] = 1
            jacobian[i] = torch.autograd.grad(y, x, grad_output, retain_graph=True)[0]
        
        return jacobian
     
    # Example
    def f(x):
        return torch.stack([x[0]**2 + x[1], x[0] * x[1]**2])
     
    x = torch.tensor([2.0, 3.0])
    J = compute_jacobian(f, x)
    print("Jacobian:")
    print(J)
    # tensor([[ 4.,  1.],  <- ∂f1/∂x1=2*x1, ∂f1/∂x2=1
    #         [ 9., 12.]])  <- ∂f2/∂x1=x2^2, ∂f2/∂x2=2*x1*x2

    Gradient hooks

    Use gradient hooks to intercept and modify gradients during the backward pass:

    # Global hook: called for every parameter
    def gradient_clipper(grad):
        return torch.clamp(grad, -1, 1)
     
    x = torch.tensor(5.0, requires_grad=True)
    y = x ** 2
     
    # Register hook
    handle = x.register_hook(gradient_clipper)
     
    y.backward()
    print(x.grad)  # tensor(1.0) instead of tensor(10.0)
    # Because d(x^2)/dx = 2x = 10, but clipped to [-1, 1]
     
    handle.remove()  # Clean up 

    Graph visualization

    Visualize the computational graph:

    from torchviz import make_dot
     
    x = torch.randn(3, requires_grad=True)
    y = x * 2
    z = y + 1
    loss = z.sum()
     
    # Generate graph visualization
    dot = make_dot(loss, params={'x': x})
    dot.render('computation_graph', format='png')

    Performance tips and best practices

    Use these strategies to optimize your code for speed and memory efficiency when working with the autograd engine.

    Memory management

    Efficiently clearing gradients helps prevent memory leaks and ensures your model trains correctly. While you can manually set gradients to None, using the built-in methods in the optimizer is more reliable.

    # ✅ Good: Clear gradients properly
    optimizer.zero_grad()
     
    # ❌ Bad: Manual zeroing (misses optimizer state)
    for param in model.parameters():
        param.grad = None
     
    # ✅ Good: Use set_to_none for slight speed up
    optimizer.zero_grad(set_to_none=True)

    Avoid unnecessary graph construction

    You can reduce computational overhead by disabling the autograd engine when you do not need to calculate gradients. This is common during model evaluation or when generating predictions.

    # When you don't need gradients:
    with torch.no_grad():
        predictions = model(test_data)  # No graph built!
        
    # Or use inference mode (even faster):
    with torch.inference_mode():
        predictions = model(test_data)

    Detach tensors to stop gradient flow

    The detach method allows you to stop the flow of gradients through specific parts of your computational graph. This is useful for freezing certain layers or when you want to use a tensor as a constant value in a separate calculation.

    # Detach to stop gradient flow
    x = torch.randn(3, requires_grad=True)
    y = x * 2
    z = y.detach() * 3  # z won't propagate gradients to y
     
    loss = z.sum()
    loss.backward()
    print(y.grad)  # None (gradient flow stopped)

    Mixed precision training

    Mixed precision training uses different numerical formats, such as 16-bit and 32-bit floating-point types, to speed up calculations. This technique reduces memory consumption on compatible hardware without sacrificing model accuracy.

    from torch.amp import autocast, GradScaler
     
    model = MyModel().cuda()
    optimizer = torch.optim.Adam(model.parameters())
    scaler = GradScaler()
     
    for data, target in dataloader:
        optimizer.zero_grad()
        
        # Forward in mixed precision
        with autocast(device_type='cuda'):
            output = model(data)
            loss = criterion(output, target)
        
        # Backward with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

    Debugging autograd

    Debugging the autograd engine requires specific tools to identify numerical instabilities and structural errors in the computational graph. Use the following methods to troubleshoot your gradient calculations.

    Detecting anomalies in gradients

    You can use the set_detect_anomaly function to locate the source of NaN or Inf values in your model. When you enable this mode, the engine tracks the forward pass operations and provides a traceback to the specific line of code where a numerical error first appeared.

    torch.autograd.set_detect_anomaly(True)
     
    try:
        loss = model(x)
        loss.backward()
    except RuntimeError as e:
        print(f"Anomaly detected: {e}")

    Visualizing gradient flow

    Monitoring the magnitude of gradients across different layers helps you identify issues such as vanishing or exploding gradients. The following function calculates the average gradient for each parameter to help you visualize how the information flows through the network during the backward pass.

    def plot_grad_flow(named_parameters):
        """Plot gradient flow through the network."""
        ave_grads = []
        layers = []
        for n, p in named_parameters:
            if p.requires_grad and p.grad is not None:
                layers.append(n)
                ave_grads.append(p.grad.abs().mean().item())
        
        import matplotlib.pyplot as plt
        plt.plot(ave_grads, alpha=0.3, color="b")
        plt.hlines(0, 0, len(ave_grads)+1, linewidth=1, color="k")
        plt.xticks(range(0,len(ave_grads)), layers, rotation="vertical")
        plt.xlabel("Layers")
        plt.ylabel("Average gradient")
        plt.title("Gradient flow")
        plt.grid(True)
        plt.show()
     
    # After backward pass:
    plot_grad_flow(model.named_parameters())

    Inspect the computational graph structure

    You can verify the hierarchy of your model by recursively printing the grad_fn attributes of a tensor. This technique allows you to confirm that the engine is recording operations in the correct order and that every path connects back to a leaf node.

    def print_graph(output, level=0):
        """Recursively print the computational graph."""
        if hasattr(output, 'grad_fn'):
            print('  ' * level, output.grad_fn)
            if hasattr(output.grad_fn, 'next_functions'):
                for fn in output.grad_fn.next_functions:
                    if fn[0] is not None:
                        print_graph(fn[0], level + 1)
     
    x = torch.randn(3, requires_grad=True)
    y = x * 2
    z = y + 1
    loss = z.sum()
     
    print_graph(loss)
    # Output:
    # <SumBackward0>
    #   <AddBackward0>
    #     <MulBackward0>
    #       <AccumulateGrad>

    Key takeaways for the autograd engine

    The PyTorch autograd engine balances several technical factors to support developer workflows. Its define-by-run approach provides flexibility by supporting standard Python control flow. The engine also ensures efficiency through memory management and graph optimizations.

    While the .backward() method provides a simple interface, it manages the complex internal operations required for differentiation. Additionally, eager execution improves debuggability by making problems easier to trace during development.

    When you call loss.backward(), PyTorch performs several operations in milliseconds. The engine traverses the dynamically built computational graph and unpacks saved tensors while performing version checks. It also accumulates gradients through multiple paths and manages edge cases, such as in-place operations. Finally, the process optimizes memory by freeing the graph.

    You now have a practical understanding of the internal functions of the autograd engine. To learn more, explore the following links:

    • PyTorch autograd documentation
    • Autograd mechanics
    • Custom autograd functions
    • Autograd engine source code
    • Saved Variable Implementation
    • Gradient checkpointing

    Related Posts

    • Understanding ATen: PyTorch's tensor library

    • vLLM with torch.compile: Efficient LLM inference on PyTorch

    • Learn how to build, train, and run a PyTorch model

    • Manage deep learning models with OpenVINO Model Server

    • Intel GPUs and OVMS: A winning combination for deep learning efficiency

    • Implement MLOps with Kubeflow Pipelines

    Recent Posts

    • Kafka Monthly Digest: March 2026

    • Introduction to Linux interfaces for virtual networking

    • Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

    • Red Hat build of Perses with the cluster observability operator

    • How to plan your RHEL lifecycle with AI

    What’s up next?

    Learning Path RHEL_AI_LP_featured_Image

    Download, serve, and interact with LLMs on RHEL AI

    Configure your Red Hat Enterprise Linux AI machine, download, serve, and...
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Platforms

    • Red Hat AI
    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform
    • See all products

    Build

    • Developer Sandbox
    • Developer tools
    • Interactive tutorials
    • API catalog

    Quicklinks

    • Learning resources
    • E-books
    • Cheat sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site status dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2026 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue