Gradient Checkpointing for Long DDE Simulations

Trading Recompute for Memory When Differentiating Through Brain Network Models

Try this notebook interactively:

Download .ipynb Download .qmd Open in Colab

Introduction

Long simulations of delay-coupled brain network models are cheap to run forward but expensive to differentiate. Every jax.lax.scan step saves its carry for the backward pass, and for DDEs that carry includes the per-coupling history buffer of shape [history_length, n_states, n_nodes]. Backward memory therefore grows as

\[ \text{memory} \;\propto\; n_\text{steps} \times \text{history length} \times n_\text{states} \times n_\text{nodes} \]

For a BOLD/FC fit at dt = 1 ms, T = 60 s, ~80 regions and ~20 ms maximum delay, the history buffers alone reach hundreds of megabytes of activations, which can push a gradient over the RAM limit on a workstation even when the forward pass fits comfortably.

The standard remedy is gradient checkpointing: save only a sparse subset of activations and recompute the rest on demand during the backward pass. TVB-Optim exposes this on the native solver path through one optional knob, block_size:

solver = Heun(block_size=256)

With block_size=None (the default) the integration runs as a single jax.lax.scan, with no overhead and no change in behaviour. With an integer K, the scan splits into an outer scan over blocks of K steps wrapped in jax.checkpoint, each block running an inner scan of K steps. Backward memory then scales as O(n_steps/K + K) instead of O(n_steps), for a modest gradient overhead (one extra forward recompute, small next to an already heavier backward). Forward time is unchanged, and memory is minimised near K ≈ √n_steps.

block_size is the solver’s single block unit for all streaming features. Two consequences matter here:

On a stochastic network it also switches noise to per-block generation, which reseeds the realization. To keep a clean checkpointing benchmark (common random numbers, bit-exact across block_size), we inject one fixed noise tensor so the block path uses it verbatim. Streaming-noise memory is a separate axis; see Streaming Reductions.
It is also the grain for online reduce statistics (e.g. streamed FC), not covered here.

Scope and limitations

Native solvers only. DiffraxSolver is unaffected; Diffrax has its own RecursiveCheckpointAdjoint, which does not support delays.
No effect when block_size is None. The default falls through to the original jax.lax.scan and is bit-exact with prior versions.
Forward is unaffected. Forward sims never retain step activations; checkpointing only matters for gradients.
SDE noise is held fixed here. A fixed noise tensor is injected so every config integrates the same path and the checkpointed gradient stays bit-exact.

Environment Setup and Imports

import time
import gc
import os
import threading
import numpy as np
import matplotlib.pyplot as plt
import jax
import jax.numpy as jnp
import equinox as eqx

try:
    import psutil

    _HAS_PSUTIL = True
except ImportError:
    _HAS_PSUTIL = False

# Enable float64 for numerically stable comparisons.
jax.config.update("jax_enable_x64", True)

from tvboptim.experimental.network_dynamics import Network, prepare
from tvboptim.experimental.network_dynamics.dynamics.tvb import ReducedWongWang
from tvboptim.experimental.network_dynamics.coupling import DelayedLinearCoupling
from tvboptim.experimental.network_dynamics.graph import DenseDelayGraph
from tvboptim.experimental.network_dynamics.noise import AdditiveNoise
from tvboptim.experimental.network_dynamics.solvers import Heun
from tvboptim.observations.tvb_monitors.bold import HRFBold
from tvboptim.observations.observation import compute_fc, rmse
from tvboptim.data import load_structural_connectivity, load_functional_connectivity
from tvboptim.utils import set_cache_path, cache

set_cache_path("./gradient_checkpointing_benchmark")

Workload: RWW + Delays + BOLD FC Fitting

We reuse the Reduced Wong-Wang / BOLD / FC workflow from RWW.qmd, swapping FastLinearCoupling for DelayedLinearCoupling: the configuration where gradient memory usually becomes the bottleneck for empirical fits. Structural connectivity is the dk_average parcellation (68 regions), with tract lengths converted to delays at a conduction speed of 4 mm/ms.

DT = 1.0                  # Integration step (ms)
T1 = 60_000.0             # Total simulation length (ms) — 60 s
N_STEPS = int(T1 / DT)    # 60_000 integration steps
CONDUCTION_SPEED = 4.0   # mm/ms

# Load empirical structural and functional connectivity.
weights, lengths, region_labels = load_structural_connectivity(name="dk_average")
weights = weights / np.max(weights)
delays = jnp.asarray(lengths / CONDUCTION_SPEED)
n_nodes = weights.shape[0]

fc_target = load_functional_connectivity(name="dk_average")

# Build the network: RWW dynamics + delayed linear coupling + additive noise.
graph = DenseDelayGraph(
    weights=jnp.asarray(weights),
    delays=delays,
    region_labels=region_labels,
)
dynamics = ReducedWongWang(w=0.5, I_o=0.32, INITIAL_STATE=(0.3,))
coupling = DelayedLinearCoupling(
    incoming_states="S",
    G=0.5,
    buffer_strategy="roll",
)
noise = AdditiveNoise(sigma=0.00283, apply_to="S", key=jax.random.key(0))
network = Network(
    dynamics=dynamics,
    coupling={"delayed": coupling},
    graph=graph,
    noise=noise,
)

# BOLD monitor — TR = 1 s, intermediate downsample matches dt.
bold_monitor = HRFBold(period=1000.0, downsample_period=DT, voi=0)

max_delay = float(delays.max())
history_length = int(np.ceil(max_delay / DT)) + 1
print(f"n_nodes={n_nodes}  n_steps={N_STEPS}  history_length={history_length}")
print(f"max delay = {max_delay:.2f} ms")

A single coupling’s history buffer is roughly history_length × n_states × n_nodes × 8 bytes per step. Over ~60 000 steps the forward-saved coupling state alone runs into hundreds of megabytes, on top of the dynamics state, noise tensor, and auxiliary tape.

Benchmark

We sweep block_size and measure forward time, gradient time, and peak memory (where the backend supports it). The grid spans:

None: the default single jax.lax.scan, the performance reference.
small K: frequent checkpoints, maximal recompute, minimal saved memory.
K ≈ √n_steps: the theoretical memory minimum.
large K: sparse checkpoints, near no-checkpoint cost.
a non-divisor K: exercises the main-scan plus tail-scan path.

Benchmark Setup

# K = None is the baseline. The dense middle (128, 256, 512, 1024, 2048)
# brackets sqrt(n_steps) so the U-shape near the minimum is well-resolved,
# while the wings (32, 8192, 30000) cover the asymptotic regimes. K = 30000
# is a clean divisor of n_steps (no tail). Most other values do not divide
# n_steps exactly and therefore exercise the main-scan + tail-scan path,
# which matters for the memory story — see "Reading the memory curve".
BLOCK_SIZE_VALUES = [None, 32, 128, 256, 512, 1024, 2048, 8192, 30000, N_STEPS]
N_FORWARD_RUNS = 3
N_GRADIENT_RUNS = 3
G_INIT = jnp.asarray(0.5)

# Fixed noise realization (common random numbers). Injecting this into the
# config makes `block_size` do pure gradient checkpointing rather than per-block
# streaming: every config integrates the same noise path, so the checkpointed
# gradient stays bit-exact to the uncheckpointed one and the benchmark isolates
# the activation-tape effect. Shape is [n_steps, n_noise_states, n_nodes].
n_noise_states = len(network.noise._state_indices)
FIXED_NOISE = jax.random.normal(
    network.noise.key, (N_STEPS, n_noise_states, n_nodes)
)


class RSSPeakMonitor:
    """Context manager that records peak process RSS during the with-block.

    Background thread polls ``psutil.Process.memory_info().rss`` at
    ``sample_interval`` seconds and tracks the maximum observed. On exit
    ``peak_delta_bytes`` holds the peak minus the baseline RSS taken just
    before entry — i.e. the transient memory added by the block.

    This is a *pragmatic CPU proxy*, not an accelerator profile:

    - Linux RSS is process-resident memory and includes Python objects,
      JIT artifacts, XLA scratch, and pooled CPU allocations. JAX on CPU
      uses the system allocator, so transient activations show up here.
    - ~50 ms sampling can miss sub-50 ms peaks; gradient passes through
      tens of thousands of steps run for many seconds, so the sampler
      catches the activation peak comfortably.
    - **Pool effects matter.** XLA's CPU allocator pools pages and does
      not always release them between configs. The reported delta is the
      *additional* RSS the process had to allocate during the call —
      configs whose peak fits inside memory already pooled by a previous
      config will report a small or zero delta even though their
      absolute requirement is non-trivial. To get clean per-config peaks
      anyway, the sweep below is ordered with the most memory-hungry
      configs *first*, so subsequent smaller-K configs are measured
      against the already-grown pool and their deltas represent only
      the marginal storage they add (which is zero or small if they fit
      — i.e. exactly the success case for checkpointing).
    - On GPU/TPU the activation tape lives in device memory, not host
      RSS — use ``jax.devices()[0].memory_stats()['peak_bytes_in_use']``
      there instead. This monitor is the CPU fallback.
    """

    def __init__(self, sample_interval: float = 0.05):
        self.sample_interval = sample_interval
        self.peak_delta_bytes = None

    def __enter__(self):
        if not _HAS_PSUTIL:
            return self
        self._process = psutil.Process()
        self._baseline = self._process.memory_info().rss
        self._peak = self._baseline
        self._stop = threading.Event()
        self._thread = threading.Thread(target=self._sample, daemon=True)
        self._thread.start()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if not _HAS_PSUTIL:
            return False
        self._stop.set()
        self._thread.join()
        self.peak_delta_bytes = max(0, self._peak - self._baseline)
        return False

    def _sample(self):
        while not self._stop.is_set():
            try:
                rss = self._process.memory_info().rss
                if rss > self._peak:
                    self._peak = rss
            except Exception:
                break
            self._stop.wait(self.sample_interval)


def benchmark_one(block_size, fc_target):
    """Time forward + gradient, capture peak RSS during gradient, and return
    the gradient value for cross-check."""
    solver = Heun(block_size=block_size)
    solve_fn, state = prepare(network, solver, t0=0.0, t1=T1, dt=DT)
    # Inject the fixed noise so block_size does pure checkpointing (no per-block
    # streaming / reseed); all configs then share the same realization.
    state._internal.noise_samples = FIXED_NOISE
    solve_fn = jax.jit(solve_fn)

    def loss(G):
        cfg = eqx.tree_at(lambda c: c.coupling.delayed.G, state, G)
        result = solve_fn(cfg)
        bold = bold_monitor(result)
        fc = compute_fc(bold, skip_t=20)
        return rmse(fc, jnp.asarray(fc_target))

    grad_fn = jax.jit(jax.value_and_grad(loss))

    # Warm up (JIT compile both paths) so allocations from compilation do
    # not contaminate the peak-RSS measurement below.
    jax.block_until_ready(solve_fn(state).ys)
    v0, g0 = grad_fn(G_INIT)
    jax.block_until_ready(g0)
    del g0

    # Capture peak RSS delta during one fresh gradient call. The activation
    # tape for the backward pass is the headline memory cost, so we measure
    # exactly that. gc.collect() drops any temporaries from the warmup so
    # the baseline is as flat as possible.
    gc.collect()
    monitor = RSSPeakMonitor(sample_interval=0.05)
    with monitor:
        v_mem, g_mem = grad_fn(G_INIT)
        jax.block_until_ready(g_mem)
    peak_delta = monitor.peak_delta_bytes
    g_value_for_check = float(g_mem)
    del v_mem, g_mem
    gc.collect()

    fwd_times = []
    for _ in range(N_FORWARD_RUNS):
        t = time.perf_counter()
        r = solve_fn(state)
        jax.block_until_ready(r.ys)
        fwd_times.append(time.perf_counter() - t)

    grad_times = []
    for _ in range(N_GRADIENT_RUNS):
        t = time.perf_counter()
        v, g = grad_fn(G_INIT)
        jax.block_until_ready(g)
        grad_times.append(time.perf_counter() - t)

    return {
        "fwd_mean": float(np.mean(fwd_times)),
        "fwd_std": float(np.std(fwd_times)),
        "grad_mean": float(np.mean(grad_times)),
        "grad_std": float(np.std(grad_times)),
        "loss": float(v0),
        "grad_value": g_value_for_check,
        "peak_bytes_delta": peak_delta,
    }


@cache("block_size_sweep")
def run_sweep():
    results = {}
    for k in BLOCK_SIZE_VALUES:
        label = "None" if k is None else str(k)
        print(f"block_size = {label} ...", flush=True)
        results[label] = benchmark_one(k, fc_target)
        gc.collect()
    return results


sweep_results = run_sweep()

Results

Plotting code

baseline = sweep_results["None"]
sqrt_n = np.sqrt(N_STEPS)

# K-axis panels drop "None" — it has no x-coordinate on a block_size
# axis, only a horizontal-reference role. The Pareto panel keeps it as a
# distinct star marker because its axes are (time, memory) and there is no
# overlap risk.

ck_labels = [l for l in sweep_results if l != "None"]
xs_raw = np.array([float(l) for l in ck_labels])
order = np.argsort(xs_raw)
ck_labels = [ck_labels[i] for i in order]
xs = xs_raw[order]
fwd = np.array([sweep_results[l]["fwd_mean"] for l in ck_labels])
fwd_err = np.array([sweep_results[l]["fwd_std"] for l in ck_labels])
grad = np.array([sweep_results[l]["grad_mean"] for l in ck_labels])
grad_err = np.array([sweep_results[l]["grad_std"] for l in ck_labels])

peaks_all = [sweep_results[l]["peak_bytes_delta"] for l in sweep_results]
has_memory = all(p is not None for p in peaks_all)
if has_memory:
    mem_ck_mb = np.array(
        [sweep_results[l]["peak_bytes_delta"] for l in ck_labels], dtype=float
    ) / 1e6


def _mark_sqrt_n(ax):
    """Vertical reference line + label at √n_steps, anchored near the top."""
    ax.axvline(sqrt_n, color="gray", linestyle="--", alpha=0.5, zorder=0)
    ymin, ymax = ax.get_ylim()
    y = ymax / ((ymax / ymin) ** 0.05) if ax.get_yscale() == "log" else ymax - 0.05 * (ymax - ymin)
    ax.text(sqrt_n, y, r"$\sqrt{n_\mathrm{steps}}$",
            color="gray", fontsize=12, ha="center", va="top",
            bbox=dict(facecolor="white", edgecolor="none", alpha=0.8, pad=2))


def _pareto_front(times, mems):
    """Return boolean mask of Pareto-optimal points (minimise time AND memory).

    A point is dominated if some other point has time<= and memory<= with at
    least one strict inequality. The remaining points form the Pareto front.
    """
    n = len(times)
    keep = np.ones(n, dtype=bool)
    for i in range(n):
        for j in range(n):
            if i == j:
                continue
            if (times[j] <= times[i] and mems[j] <= mems[i]
                    and (times[j] < times[i] or mems[j] < mems[i])):
                keep[i] = False
                break
    return keep


# Bump default font sizes for the whole figure via a context manager so other
# notebook plots are not affected.
with plt.rc_context({
    "font.size": 12,
    "axes.titlesize": 14,
    "axes.labelsize": 13,
    "xtick.labelsize": 11,
    "ytick.labelsize": 11,
    "legend.fontsize": 11,
}):
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # === Top row: time ===

    # --- Top-left: time vs block_size (single log y-axis) ---
    # Forward (~0.1 s) and gradient (~1 s) are about a decade apart, so a single
    # log y-axis separates them cleanly and keeps each None baseline next to its
    # own curve without the two dashed references overlapping (which a linear
    # twin-axis layout did). The fine overhead detail lives in the panel to the
    # right (grad / forward ratio).
    ax = axes[0, 0]
    fwd_color = "steelblue"
    grad_color = "firebrick"

    ax.errorbar(xs, fwd, yerr=fwd_err, marker="o", color=fwd_color,
                label="forward", lw=1.8, markersize=7, capsize=3)
    ax.axhline(baseline["fwd_mean"], color=fwd_color, linestyle="dashed",
               alpha=0.7, label="forward (None)")
    ax.errorbar(xs, grad, yerr=grad_err, marker="s", color=grad_color,
                label="gradient", lw=1.8, markersize=7, capsize=3)
    ax.axhline(baseline["grad_mean"], color=grad_color, linestyle="dashed",
               alpha=0.7, label="gradient (None)")

    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_xlabel("block_size")
    ax.set_ylabel("wall time (s)")
    ax.set_title("Time vs block_size")
    ax.legend(loc="best", framealpha=0.9, ncol=2)
    ax.grid(alpha=0.3, which="both")
    _mark_sqrt_n(ax)

    # --- Top-right: grad/forward ratio ---
    ax = axes[0, 1]
    ratio = grad / fwd
    baseline_ratio = baseline["grad_mean"] / baseline["fwd_mean"]
    ax.plot(xs, ratio, marker="^", color="darkgreen", lw=1.8, markersize=8,
            label="grad / forward")
    ax.axhline(baseline_ratio, color="darkgreen", linestyle="dashed", alpha=0.7,
               label=f"None baseline ({baseline_ratio:.2f}×)")
    ax.set_xscale("log")
    ax.set_xlabel("block_size")
    ax.set_ylabel("grad / forward")
    ax.set_title("Gradient overhead")
    ax.grid(alpha=0.3, which="both")
    ax.legend(loc="best", framealpha=0.9)
    _mark_sqrt_n(ax)

    # === Bottom row: memory ===

    # --- Bottom-left: memory vs block_size ---
    ax = axes[1, 0]
    if has_memory:
        ax.plot(xs, mem_ck_mb, marker="D", color="purple", lw=1.8,
                markersize=8, label="peak RSS delta during grad")
        ax.axhline(baseline["peak_bytes_delta"] / 1e6, color="purple",
                   linestyle="dashed", alpha=0.7, label="None baseline")
        ax.set_xscale("log")  # block_size spans decades; y stays linear so the
        # U-shape and the absolute MB differences read directly.
        ax.set_ylim(bottom=0)
        ax.set_xlabel("block_size")
        ax.set_ylabel("peak RSS delta during grad (MB)")
        ax.set_title("Memory vs block_size")
        ax.grid(alpha=0.3, which="both")
        ax.legend(loc="best", framealpha=0.9)
        _mark_sqrt_n(ax)
    else:
        ax.text(0.5, 0.5,
                "Peak memory unavailable\n(psutil not installed)",
                transform=ax.transAxes, ha="center", va="center",
                fontsize=12,
                bbox=dict(boxstyle="round,pad=0.5", facecolor="lightyellow"))
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title("Memory vs block_size (unavailable)")

    # --- Bottom-right: memory–time Pareto ---
    # Two cleanup ideas vs the old "connect-by-time" line, which crossed
    # itself wherever memory did not move monotonically with time:
    #   1. Drop the connecting line; the scatter alone carries the points.
    #   2. Compute the actual Pareto front (non-dominated points) and
    #      connect *only* those with a clean monotone curve.
    # We do both — the front is a thin solid line, dominated points are
    # plotted as scatter but not connected, and None is highlighted as a
    # red star because it lies on the front but represents the no-checkpoint
    # baseline.
    ax = axes[1, 1]
    if has_memory:
        grad_all = np.array([sweep_results[l]["grad_mean"]
                             for l in sweep_results])
        mem_all = np.array([sweep_results[l]["peak_bytes_delta"]
                            for l in sweep_results], dtype=float) / 1e6
        label_all = list(sweep_results.keys())
        pareto_mask = _pareto_front(grad_all, mem_all)

        # Pareto-front line: sort the kept points by time so the line is
        # monotone (memory decreases as time increases along a true front).
        kept = np.where(pareto_mask)[0]
        kept = kept[np.argsort(grad_all[kept])]
        ax.plot(grad_all[kept], mem_all[kept], color="gray", lw=2.0,
                alpha=0.6, zorder=1, label="Pareto front")

        # Scatter all points, distinguishing None and Pareto vs dominated.
        for i, l in enumerate(label_all):
            on_front = pareto_mask[i]
            x, y = grad_all[i], mem_all[i]
            if l == "None":
                ax.scatter([x], [y], s=240, marker="*", color="crimson",
                           edgecolor="black", linewidth=0.8, zorder=4,
                           label="None (baseline)")
            elif on_front:
                ax.scatter([x], [y], s=80, color="purple",
                           edgecolor="black", linewidth=0.5, zorder=3)
            else:
                ax.scatter([x], [y], s=55, facecolor="white",
                           edgecolor="purple", linewidth=1.3, zorder=2)
            # Label only the front points and None: the dominated points cluster
            # near the front and their labels collide. Alternate the vertical
            # offset to further reduce overlap among the labelled ones.
            if on_front or l == "None":
                dy = 8 if (i % 2 == 0) else -12
                ax.annotate(l, (x, y), textcoords="offset points",
                            xytext=(8, dy), fontsize=10)

        ax.set_xlabel("gradient time (s)")
        ax.set_ylabel("peak RSS delta during grad (MB)")
        ax.set_yscale("log")
        ax.set_title("Memory vs time Pareto")
        ax.grid(alpha=0.3, which="both")
        # Custom legend: front line + filled marker (on front) + hollow
        # marker (dominated) + None star.
        from matplotlib.lines import Line2D
        legend_elems = [
            Line2D([0], [0], color="gray", lw=2.0, alpha=0.6,
                   label="Pareto front"),
            Line2D([0], [0], marker="o", color="w",
                   markerfacecolor="purple", markeredgecolor="black",
                   markersize=9, label="on front"),
            Line2D([0], [0], marker="o", color="w",
                   markerfacecolor="white", markeredgecolor="purple",
                   markersize=8, markeredgewidth=1.3,
                   label="dominated"),
            Line2D([0], [0], marker="*", color="w",
                   markerfacecolor="crimson", markeredgecolor="black",
                   markersize=14, label="None"),
        ]
        ax.legend(handles=legend_elems, loc="best", framealpha=0.9)
    else:
        ax.text(0.5, 0.5,
                "Peak memory unavailable\n(psutil not installed)",
                transform=ax.transAxes, ha="center", va="center",
                fontsize=12,
                bbox=dict(boxstyle="round,pad=0.5", facecolor="lightyellow"))
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title("Memory vs time Pareto (unavailable)")

    plt.tight_layout()
    plt.show()

Figure 1: **Gradient checkpointing benchmark.** Top row, *time*. Top-left: forward and gradient wall time vs `block_size` on a shared log y-axis (about a decade apart); dashed horizontals mark each curve’s `None` baseline and the dashed vertical marks `√n_steps`. Top-right: per-call gradient-to-forward ratio. Bottom row, *memory*. Bottom-left: peak RSS delta during a gradient call vs `block_size` (linear y), showing the `O(n_steps/K + K)` minimum near `√n_steps`. Bottom-right: memory vs time Pareto (front and `None` labelled), with the `None` star at the low-time, high-memory extreme and checkpointed points tracing the front.

Reading the Memory Curve

The bottom-left panel follows the classical analysis. Peak gradient memory scales as

\[ \mathrm{peak\,memory} \;\approx\; \underbrace{\frac{n_\text{steps}}{K} \cdot c_\text{outer}}_{\text{block-boundary tape}} \;+\; \underbrace{K \cdot c_\text{inner}}_{\text{per-block inner tape during backward}} \]

with a minimum near $K \approx \sqrt{n_\text{steps} \cdot c_\text{outer} / c_\text{inner}}$, close to $\sqrt{n_\text{steps}}$ for this workload. Three effects bend the textbook curve:

Checkpoint boundaries inflate the inner tape. XLA cannot fuse across a jax.checkpoint boundary and must keep the per-step VJP tape for rematerialisation, so $c_\text{inner}$ exceeds the uncheckpointed per-step cost, most of all for short inner scans. The None baseline is thus an optimistic lower bound.
A non-divisor K leaves a tail. When n_steps % K != 0 the remainder runs as a plain jax.lax.scan whose tape stays live through the backward pass, adding $\mathrm{remainder} \cdot c_\text{unchecked}$ to the peak. Prefer K that divides, or nearly divides, n_steps.
K = n_steps saves nothing. It still wraps one scan in jax.checkpoint, so backward rematerialises the full tape (peak near None) while paying an extra forward.

The result is a U-shape with its minimum near $\sqrt{n_\text{steps}}$, cutting gradient memory by roughly an order of magnitude versus None.

Correctness Check

A checkpointed gradient must match the uncheckpointed one to floating-point precision: the forward path is bit-exact (same scan body, only the loop nesting changes) and the backward path differs only by recompute rounding. The |Δgrad/grad| column of the summary table below stays at double-precision rounding (around 1e-15 to 1e-13) for every block size, confirming checkpointing does not change the result.

Summary Table

All measured quantities in one self-contained table, copy-pasteable into an issue or back to an LLM. fwd_ratio and grad_ratio are normalised to the None baseline; peak_MB is the peak process-RSS delta during one gradient call (CPU proxy via psutil), or the device-memory delta on GPU/TPU, else NA.

Table code

baseline = sweep_results["None"]
header = (
    f"{'block_size':<12} "
    f"{'fwd_s':<14} "
    f"{'grad_s':<14} "
    f"{'grad/fwd':<10} "
    f"{'fwd_ratio':<11} "
    f"{'grad_ratio':<11} "
    f"{'peak_MB':<10} "
    f"{'loss':<22} "
    f"{'grad':<14} "
    f"{'|Δgrad/grad|':<14}"
)
print(header)
print("-" * len(header))
for label, r in sweep_results.items():
    fwd = f"{r['fwd_mean']:.4f}±{r['fwd_std']:.4f}"
    grd = f"{r['grad_mean']:.4f}±{r['grad_std']:.4f}"
    ratio = r["grad_mean"] / r["fwd_mean"]
    fwd_ratio = r["fwd_mean"] / baseline["fwd_mean"]
    grad_ratio = r["grad_mean"] / baseline["grad_mean"]
    peak = (
        f"{r['peak_bytes_delta'] / 1e6:.1f}"
        if r["peak_bytes_delta"] is not None
        else "NA"
    )
    rel = abs((r["grad_value"] - baseline["grad_value"]) / baseline["grad_value"])
    print(
        f"{label:<12} "
        f"{fwd:<14} "
        f"{grd:<14} "
        f"{ratio:<10.2f} "
        f"{fwd_ratio:<11.2f} "
        f"{grad_ratio:<11.2f} "
        f"{peak:<10} "
        f"{r['loss']:<22.16f} "
        f"{r['grad_value']:<14.6e} "
        f"{rel:<14.3e}"
    )

# Compact context block (helpful when sharing the table).
print()
print(
    f"# workload: n_nodes={n_nodes}, n_steps={N_STEPS}, dt={DT}, T={T1/1000:.0f}s, "
    f"max_delay={max_delay:.1f}ms, history_length={history_length}"
)
print(f"# sqrt(n_steps) ≈ {int(np.sqrt(N_STEPS))}  (memory-optimal block size)")
print(f"# device: {jax.devices()[0].platform}  jax {jax.__version__}")

block_size   fwd_s          grad_s         grad/fwd   fwd_ratio   grad_ratio  peak_MB    loss                   grad           |Δgrad/grad|  
---------------------------------------------------------------------------------------------------------------------------------------------
None         0.3921±0.0028  2.4834±0.0446  6.33       1.00        1.00        763.1      0.3293338539114030     1.138942e-02   0.000e+00     
32           0.4523±0.0090  3.0469±0.0278  6.74       1.15        1.23        290.6      0.3293338539114031     1.138942e-02   6.348e-09     
128          0.4658±0.0096  2.9021±0.1038  6.23       1.19        1.17        236.5      0.3293338539114031     1.138942e-02   6.053e-09     
256          0.5288±0.0103  3.1394±0.0649  5.94       1.35        1.26        227.5      0.3293338539114031     1.138942e-02   6.059e-09     
512          0.5077±0.0151  3.1904±0.0167  6.28       1.29        1.28        223.0      0.3293338539114031     1.138942e-02   6.362e-09     
1024         0.5184±0.0297  3.2772±0.0576  6.32       1.32        1.32        224.9      0.3293338539114031     1.138942e-02   6.368e-09     
2048         0.5260±0.0036  3.3347±0.0702  6.34       1.34        1.34        231.1      0.3293338539114031     1.138942e-02   1.312e-10     
8192         0.4982±0.0251  3.2269±0.0518  6.48       1.27        1.30        241.0      0.3293338539114031     1.138942e-02   1.408e-10     
30000        0.4918±0.0052  3.2525±0.0130  6.61       1.25        1.31        363.3      0.3293338539114031     1.138942e-02   1.397e-10     
60000        0.4509±0.0040  3.0381±0.0079  6.74       1.15        1.22        741.6      0.3293338539114031     1.138942e-02   1.373e-10     

# workload: n_nodes=84, n_steps=60000, dt=1.0, T=60s, max_delay=56.1ms, history_length=58
# sqrt(n_steps) ≈ 244  (memory-optimal block size)
# device: cpu  jax 0.9.2

No-Regression Check

Because block_size=None selects the original jax.lax.scan call site verbatim, the default’s forward and gradient times stay within timing noise of the non-checkpointed implementation. Note that K = n_steps is not the same as None: it still wraps the single inner scan in jax.checkpoint, so backward recomputes the whole forward once and costs more than None. Only block_size=None skips checkpointing entirely.

Practical Guidance

import math
from tvboptim.experimental.network_dynamics.solvers import Heun

# Default: no checkpointing. Fastest gradient when memory is not the issue.
solver = Heun()

# Memory-optimal default when gradients no longer fit in memory.
solver = Heun(block_size=int(math.sqrt(n_steps)))

# Aggressive: minimal memory, maximal recompute. Use only if the sqrt
# default still OOMs.
solver = Heun(block_size=64)

The same field works on Euler, Heun, RungeKutta4, and any BoundedSolver wrapping one of those; the setting is delegated through the wrapper to the base solver.

--- title: "Gradient Checkpointing for Long DDE Simulations" subtitle: "Trading Recompute for Memory When Differentiating Through Brain Network Models" format: html: code-fold: false toc: true echo: false embed-resources: true fig-width: 8 out-width: "100%" jupyter: python3 execute: cache: true --- Try this notebook interactively: [Download .ipynb](https://github.com/virtual-twin/tvboptim/blob/main/docs/advanced/gradient_checkpointing.ipynb){.btn .btn-primary download="gradient_checkpointing.ipynb"} [Download .qmd](gradient_checkpointing.qmd){.btn .btn-secondary download="gradient_checkpointing.qmd"} [Open in Colab](https://colab.research.google.com/github/virtual-twin/tvboptim/blob/main/docs/advanced/gradient_checkpointing.ipynb){.btn .btn-warning target="_blank"} ## Introduction Long simulations of delay-coupled brain network models are cheap to run forward but expensive to differentiate. Every `jax.lax.scan` step saves its carry for the backward pass, and for DDEs that carry includes the per-coupling history buffer of shape `[history_length, n_states, n_nodes]`. Backward memory therefore grows as $$ \text{memory} \;\propto\; n_\text{steps} \times \text{history length} \times n_\text{states} \times n_\text{nodes} $$ For a BOLD/FC fit at `dt = 1 ms`, `T = 60 s`, ~80 regions and ~20 ms maximum delay, the history buffers alone reach hundreds of megabytes of activations, which can push a gradient over the RAM limit on a workstation even when the forward pass fits comfortably. ```{python} #| output: false #| echo: false try: import google.colab print("Running in Google Colab - installing dependencies...") !pip install -q tvboptim print("✓ Dependencies installed!") except ImportError: pass ``` The standard remedy is **gradient checkpointing**: save only a sparse subset of activations and recompute the rest on demand during the backward pass. TVB-Optim exposes this on the native solver path through one optional knob, `block_size`: ```python solver = Heun(block_size=256) ``` With `block_size=None` (the default) the integration runs as a single `jax.lax.scan`, with no overhead and no change in behaviour. With an integer `K`, the scan splits into an outer scan over blocks of `K` steps wrapped in `jax.checkpoint`, each block running an inner scan of `K` steps. Backward memory then scales as `O(n_steps/K + K)` instead of `O(n_steps)`, for a modest gradient overhead (one extra forward recompute, small next to an already heavier backward). Forward time is unchanged, and memory is minimised near `K ≈ √n_steps`. `block_size` is the solver's single block unit for all streaming features. Two consequences matter here: - On a **stochastic** network it also switches noise to per-block generation, which reseeds the realization. To keep a clean checkpointing benchmark (common random numbers, bit-exact across `block_size`), we inject one fixed noise tensor so the block path uses it verbatim. Streaming-noise memory is a separate axis; see [Streaming Reductions](streaming_reductions.qmd). - It is also the grain for online `reduce` statistics (e.g. streamed FC), not covered here. ::: {.callout-note} ## Scope and limitations - **Native solvers only.** `DiffraxSolver` is unaffected; Diffrax has its own `RecursiveCheckpointAdjoint`, which does not support delays. - **No effect when `block_size is None`.** The default falls through to the original `jax.lax.scan` and is bit-exact with prior versions. - **Forward is unaffected.** Forward sims never retain step activations; checkpointing only matters for gradients. - **SDE noise is held fixed here.** A fixed noise tensor is injected so every config integrates the same path and the checkpointed gradient stays bit-exact. ::: ```{python} #| output: false #| code-fold: true #| code-summary: "Environment Setup and Imports" #| echo: true import time import gc import os import threading import numpy as np import matplotlib.pyplot as plt import jax import jax.numpy as jnp import equinox as eqx try: import psutil _HAS_PSUTIL = True except ImportError: _HAS_PSUTIL = False # Enable float64 for numerically stable comparisons. jax.config.update("jax_enable_x64", True) from tvboptim.experimental.network_dynamics import Network, prepare from tvboptim.experimental.network_dynamics.dynamics.tvb import ReducedWongWang from tvboptim.experimental.network_dynamics.coupling import DelayedLinearCoupling from tvboptim.experimental.network_dynamics.graph import DenseDelayGraph from tvboptim.experimental.network_dynamics.noise import AdditiveNoise from tvboptim.experimental.network_dynamics.solvers import Heun from tvboptim.observations.tvb_monitors.bold import HRFBold from tvboptim.observations.observation import compute_fc, rmse from tvboptim.data import load_structural_connectivity, load_functional_connectivity from tvboptim.utils import set_cache_path, cache set_cache_path("./gradient_checkpointing_benchmark") ``` ## Workload: RWW + Delays + BOLD FC Fitting We reuse the Reduced Wong-Wang / BOLD / FC workflow from [`RWW.qmd`](../workflows/RWW.qmd), swapping `FastLinearCoupling` for **`DelayedLinearCoupling`**: the configuration where gradient memory usually becomes the bottleneck for empirical fits. Structural connectivity is the `dk_average` parcellation (68 regions), with tract lengths converted to delays at a conduction speed of 4 mm/ms. ```{python} #| echo: true #| output: false DT = 1.0 # Integration step (ms) T1 = 60_000.0 # Total simulation length (ms) — 60 s N_STEPS = int(T1 / DT) # 60_000 integration steps CONDUCTION_SPEED = 4.0 # mm/ms # Load empirical structural and functional connectivity. weights, lengths, region_labels = load_structural_connectivity(name="dk_average") weights = weights / np.max(weights) delays = jnp.asarray(lengths / CONDUCTION_SPEED) n_nodes = weights.shape[0] fc_target = load_functional_connectivity(name="dk_average") # Build the network: RWW dynamics + delayed linear coupling + additive noise. graph = DenseDelayGraph( weights=jnp.asarray(weights), delays=delays, region_labels=region_labels, ) dynamics = ReducedWongWang(w=0.5, I_o=0.32, INITIAL_STATE=(0.3,)) coupling = DelayedLinearCoupling( incoming_states="S", G=0.5, buffer_strategy="roll", ) noise = AdditiveNoise(sigma=0.00283, apply_to="S", key=jax.random.key(0)) network = Network( dynamics=dynamics, coupling={"delayed": coupling}, graph=graph, noise=noise, ) # BOLD monitor — TR = 1 s, intermediate downsample matches dt. bold_monitor = HRFBold(period=1000.0, downsample_period=DT, voi=0) max_delay = float(delays.max()) history_length = int(np.ceil(max_delay / DT)) + 1 print(f"n_nodes={n_nodes} n_steps={N_STEPS} history_length={history_length}") print(f"max delay = {max_delay:.2f} ms") ``` A single coupling's history buffer is roughly `history_length × n_states × n_nodes × 8 bytes` per step. Over ~60 000 steps the forward-saved coupling state alone runs into hundreds of megabytes, on top of the dynamics state, noise tensor, and auxiliary tape. ## Benchmark We sweep `block_size` and measure forward time, gradient time, and peak memory (where the backend supports it). The grid spans: - `None`: the default single `jax.lax.scan`, the performance reference. - small `K`: frequent checkpoints, maximal recompute, minimal saved memory. - `K ≈ √n_steps`: the theoretical memory minimum. - large `K`: sparse checkpoints, near no-checkpoint cost. - a non-divisor `K`: exercises the main-scan plus tail-scan path. ```{python} #| echo: true #| output: false #| code-fold: true #| code-summary: "Benchmark Setup" # K = None is the baseline. The dense middle (128, 256, 512, 1024, 2048) # brackets sqrt(n_steps) so the U-shape near the minimum is well-resolved, # while the wings (32, 8192, 30000) cover the asymptotic regimes. K = 30000 # is a clean divisor of n_steps (no tail). Most other values do not divide # n_steps exactly and therefore exercise the main-scan + tail-scan path, # which matters for the memory story — see "Reading the memory curve". BLOCK_SIZE_VALUES = [None, 32, 128, 256, 512, 1024, 2048, 8192, 30000, N_STEPS] N_FORWARD_RUNS = 3 N_GRADIENT_RUNS = 3 G_INIT = jnp.asarray(0.5) # Fixed noise realization (common random numbers). Injecting this into the # config makes `block_size` do pure gradient checkpointing rather than per-block # streaming: every config integrates the same noise path, so the checkpointed # gradient stays bit-exact to the uncheckpointed one and the benchmark isolates # the activation-tape effect. Shape is [n_steps, n_noise_states, n_nodes]. n_noise_states = len(network.noise._state_indices) FIXED_NOISE = jax.random.normal( network.noise.key, (N_STEPS, n_noise_states, n_nodes) ) class RSSPeakMonitor: """Context manager that records peak process RSS during the with-block. Background thread polls ``psutil.Process.memory_info().rss`` at ``sample_interval`` seconds and tracks the maximum observed. On exit ``peak_delta_bytes`` holds the peak minus the baseline RSS taken just before entry — i.e. the transient memory added by the block. This is a *pragmatic CPU proxy*, not an accelerator profile: - Linux RSS is process-resident memory and includes Python objects, JIT artifacts, XLA scratch, and pooled CPU allocations. JAX on CPU uses the system allocator, so transient activations show up here. - ~50 ms sampling can miss sub-50 ms peaks; gradient passes through tens of thousands of steps run for many seconds, so the sampler catches the activation peak comfortably. - **Pool effects matter.** XLA's CPU allocator pools pages and does not always release them between configs. The reported delta is the *additional* RSS the process had to allocate during the call — configs whose peak fits inside memory already pooled by a previous config will report a small or zero delta even though their absolute requirement is non-trivial. To get clean per-config peaks anyway, the sweep below is ordered with the most memory-hungry configs *first*, so subsequent smaller-K configs are measured against the already-grown pool and their deltas represent only the marginal storage they add (which is zero or small if they fit — i.e. exactly the success case for checkpointing). - On GPU/TPU the activation tape lives in device memory, not host RSS — use ``jax.devices()[0].memory_stats()['peak_bytes_in_use']`` there instead. This monitor is the CPU fallback. """ def __init__(self, sample_interval: float = 0.05): self.sample_interval = sample_interval self.peak_delta_bytes = None def __enter__(self): if not _HAS_PSUTIL: return self self._process = psutil.Process() self._baseline = self._process.memory_info().rss self._peak = self._baseline self._stop = threading.Event() self._thread = threading.Thread(target=self._sample, daemon=True) self._thread.start() return self def __exit__(self, exc_type, exc_val, exc_tb): if not _HAS_PSUTIL: return False self._stop.set() self._thread.join() self.peak_delta_bytes = max(0, self._peak - self._baseline) return False def _sample(self): while not self._stop.is_set(): try: rss = self._process.memory_info().rss if rss > self._peak: self._peak = rss except Exception: break self._stop.wait(self.sample_interval) def benchmark_one(block_size, fc_target): """Time forward + gradient, capture peak RSS during gradient, and return the gradient value for cross-check.""" solver = Heun(block_size=block_size) solve_fn, state = prepare(network, solver, t0=0.0, t1=T1, dt=DT) # Inject the fixed noise so block_size does pure checkpointing (no per-block # streaming / reseed); all configs then share the same realization. state._internal.noise_samples = FIXED_NOISE solve_fn = jax.jit(solve_fn) def loss(G): cfg = eqx.tree_at(lambda c: c.coupling.delayed.G, state, G) result = solve_fn(cfg) bold = bold_monitor(result) fc = compute_fc(bold, skip_t=20) return rmse(fc, jnp.asarray(fc_target)) grad_fn = jax.jit(jax.value_and_grad(loss)) # Warm up (JIT compile both paths) so allocations from compilation do # not contaminate the peak-RSS measurement below. jax.block_until_ready(solve_fn(state).ys) v0, g0 = grad_fn(G_INIT) jax.block_until_ready(g0) del g0 # Capture peak RSS delta during one fresh gradient call. The activation # tape for the backward pass is the headline memory cost, so we measure # exactly that. gc.collect() drops any temporaries from the warmup so # the baseline is as flat as possible. gc.collect() monitor = RSSPeakMonitor(sample_interval=0.05) with monitor: v_mem, g_mem = grad_fn(G_INIT) jax.block_until_ready(g_mem) peak_delta = monitor.peak_delta_bytes g_value_for_check = float(g_mem) del v_mem, g_mem gc.collect() fwd_times = [] for _ in range(N_FORWARD_RUNS): t = time.perf_counter() r = solve_fn(state) jax.block_until_ready(r.ys) fwd_times.append(time.perf_counter() - t) grad_times = [] for _ in range(N_GRADIENT_RUNS): t = time.perf_counter() v, g = grad_fn(G_INIT) jax.block_until_ready(g) grad_times.append(time.perf_counter() - t) return { "fwd_mean": float(np.mean(fwd_times)), "fwd_std": float(np.std(fwd_times)), "grad_mean": float(np.mean(grad_times)), "grad_std": float(np.std(grad_times)), "loss": float(v0), "grad_value": g_value_for_check, "peak_bytes_delta": peak_delta, } @cache("block_size_sweep") def run_sweep(): results = {} for k in BLOCK_SIZE_VALUES: label = "None" if k is None else str(k) print(f"block_size = {label} ...", flush=True) results[label] = benchmark_one(k, fc_target) gc.collect() return results sweep_results = run_sweep() ``` ## Results ```{python} #| label: fig-checkpoint-benchmark #| fig-cap: "**Gradient checkpointing benchmark.** Top row, *time*. Top-left: forward and gradient wall time vs `block_size` on a shared log y-axis (about a decade apart); dashed horizontals mark each curve's `None` baseline and the dashed vertical marks `√n_steps`. Top-right: per-call gradient-to-forward ratio. Bottom row, *memory*. Bottom-left: peak RSS delta during a gradient call vs `block_size` (linear y), showing the `O(n_steps/K + K)` minimum near `√n_steps`. Bottom-right: memory vs time Pareto (front and `None` labelled), with the `None` star at the low-time, high-memory extreme and checkpointed points tracing the front." #| echo: true #| code-fold: true #| code-summary: "Plotting code" baseline = sweep_results["None"] sqrt_n = np.sqrt(N_STEPS) # K-axis panels drop "None" — it has no x-coordinate on a block_size # axis, only a horizontal-reference role. The Pareto panel keeps it as a # distinct star marker because its axes are (time, memory) and there is no # overlap risk. ck_labels = [l for l in sweep_results if l != "None"] xs_raw = np.array([float(l) for l in ck_labels]) order = np.argsort(xs_raw) ck_labels = [ck_labels[i] for i in order] xs = xs_raw[order] fwd = np.array([sweep_results[l]["fwd_mean"] for l in ck_labels]) fwd_err = np.array([sweep_results[l]["fwd_std"] for l in ck_labels]) grad = np.array([sweep_results[l]["grad_mean"] for l in ck_labels]) grad_err = np.array([sweep_results[l]["grad_std"] for l in ck_labels]) peaks_all = [sweep_results[l]["peak_bytes_delta"] for l in sweep_results] has_memory = all(p is not None for p in peaks_all) if has_memory: mem_ck_mb = np.array( [sweep_results[l]["peak_bytes_delta"] for l in ck_labels], dtype=float ) / 1e6 def _mark_sqrt_n(ax): """Vertical reference line + label at √n_steps, anchored near the top.""" ax.axvline(sqrt_n, color="gray", linestyle="--", alpha=0.5, zorder=0) ymin, ymax = ax.get_ylim() y = ymax / ((ymax / ymin) ** 0.05) if ax.get_yscale() == "log" else ymax - 0.05 * (ymax - ymin) ax.text(sqrt_n, y, r"$\sqrt{n_\mathrm{steps}}$", color="gray", fontsize=12, ha="center", va="top", bbox=dict(facecolor="white", edgecolor="none", alpha=0.8, pad=2)) def _pareto_front(times, mems): """Return boolean mask of Pareto-optimal points (minimise time AND memory). A point is dominated if some other point has time<= and memory<= with at least one strict inequality. The remaining points form the Pareto front. """ n = len(times) keep = np.ones(n, dtype=bool) for i in range(n): for j in range(n): if i == j: continue if (times[j] <= times[i] and mems[j] <= mems[i] and (times[j] < times[i] or mems[j] < mems[i])): keep[i] = False break return keep # Bump default font sizes for the whole figure via a context manager so other # notebook plots are not affected. with plt.rc_context({ "font.size": 12, "axes.titlesize": 14, "axes.labelsize": 13, "xtick.labelsize": 11, "ytick.labelsize": 11, "legend.fontsize": 11, }): fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # === Top row: time === # --- Top-left: time vs block_size (single log y-axis) --- # Forward (~0.1 s) and gradient (~1 s) are about a decade apart, so a single # log y-axis separates them cleanly and keeps each None baseline next to its # own curve without the two dashed references overlapping (which a linear # twin-axis layout did). The fine overhead detail lives in the panel to the # right (grad / forward ratio). ax = axes[0, 0] fwd_color = "steelblue" grad_color = "firebrick" ax.errorbar(xs, fwd, yerr=fwd_err, marker="o", color=fwd_color, label="forward", lw=1.8, markersize=7, capsize=3) ax.axhline(baseline["fwd_mean"], color=fwd_color, linestyle="dashed", alpha=0.7, label="forward (None)") ax.errorbar(xs, grad, yerr=grad_err, marker="s", color=grad_color, label="gradient", lw=1.8, markersize=7, capsize=3) ax.axhline(baseline["grad_mean"], color=grad_color, linestyle="dashed", alpha=0.7, label="gradient (None)") ax.set_xscale("log") ax.set_yscale("log") ax.set_xlabel("block_size") ax.set_ylabel("wall time (s)") ax.set_title("Time vs block_size") ax.legend(loc="best", framealpha=0.9, ncol=2) ax.grid(alpha=0.3, which="both") _mark_sqrt_n(ax) # --- Top-right: grad/forward ratio --- ax = axes[0, 1] ratio = grad / fwd baseline_ratio = baseline["grad_mean"] / baseline["fwd_mean"] ax.plot(xs, ratio, marker="^", color="darkgreen", lw=1.8, markersize=8, label="grad / forward") ax.axhline(baseline_ratio, color="darkgreen", linestyle="dashed", alpha=0.7, label=f"None baseline ({baseline_ratio:.2f}×)") ax.set_xscale("log") ax.set_xlabel("block_size") ax.set_ylabel("grad / forward") ax.set_title("Gradient overhead") ax.grid(alpha=0.3, which="both") ax.legend(loc="best", framealpha=0.9) _mark_sqrt_n(ax) # === Bottom row: memory === # --- Bottom-left: memory vs block_size --- ax = axes[1, 0] if has_memory: ax.plot(xs, mem_ck_mb, marker="D", color="purple", lw=1.8, markersize=8, label="peak RSS delta during grad") ax.axhline(baseline["peak_bytes_delta"] / 1e6, color="purple", linestyle="dashed", alpha=0.7, label="None baseline") ax.set_xscale("log") # block_size spans decades; y stays linear so the # U-shape and the absolute MB differences read directly. ax.set_ylim(bottom=0) ax.set_xlabel("block_size") ax.set_ylabel("peak RSS delta during grad (MB)") ax.set_title("Memory vs block_size") ax.grid(alpha=0.3, which="both") ax.legend(loc="best", framealpha=0.9) _mark_sqrt_n(ax) else: ax.text(0.5, 0.5, "Peak memory unavailable\n(psutil not installed)", transform=ax.transAxes, ha="center", va="center", fontsize=12, bbox=dict(boxstyle="round,pad=0.5", facecolor="lightyellow")) ax.set_xticks([]) ax.set_yticks([]) ax.set_title("Memory vs block_size (unavailable)") # --- Bottom-right: memory–time Pareto --- # Two cleanup ideas vs the old "connect-by-time" line, which crossed # itself wherever memory did not move monotonically with time: # 1. Drop the connecting line; the scatter alone carries the points. # 2. Compute the actual Pareto front (non-dominated points) and # connect *only* those with a clean monotone curve. # We do both — the front is a thin solid line, dominated points are # plotted as scatter but not connected, and None is highlighted as a # red star because it lies on the front but represents the no-checkpoint # baseline. ax = axes[1, 1] if has_memory: grad_all = np.array([sweep_results[l]["grad_mean"] for l in sweep_results]) mem_all = np.array([sweep_results[l]["peak_bytes_delta"] for l in sweep_results], dtype=float) / 1e6 label_all = list(sweep_results.keys()) pareto_mask = _pareto_front(grad_all, mem_all) # Pareto-front line: sort the kept points by time so the line is # monotone (memory decreases as time increases along a true front). kept = np.where(pareto_mask)[0] kept = kept[np.argsort(grad_all[kept])] ax.plot(grad_all[kept], mem_all[kept], color="gray", lw=2.0, alpha=0.6, zorder=1, label="Pareto front") # Scatter all points, distinguishing None and Pareto vs dominated. for i, l in enumerate(label_all): on_front = pareto_mask[i] x, y = grad_all[i], mem_all[i] if l == "None": ax.scatter([x], [y], s=240, marker="*", color="crimson", edgecolor="black", linewidth=0.8, zorder=4, label="None (baseline)") elif on_front: ax.scatter([x], [y], s=80, color="purple", edgecolor="black", linewidth=0.5, zorder=3) else: ax.scatter([x], [y], s=55, facecolor="white", edgecolor="purple", linewidth=1.3, zorder=2) # Label only the front points and None: the dominated points cluster # near the front and their labels collide. Alternate the vertical # offset to further reduce overlap among the labelled ones. if on_front or l == "None": dy = 8 if (i % 2 == 0) else -12 ax.annotate(l, (x, y), textcoords="offset points", xytext=(8, dy), fontsize=10) ax.set_xlabel("gradient time (s)") ax.set_ylabel("peak RSS delta during grad (MB)") ax.set_yscale("log") ax.set_title("Memory vs time Pareto") ax.grid(alpha=0.3, which="both") # Custom legend: front line + filled marker (on front) + hollow # marker (dominated) + None star. from matplotlib.lines import Line2D legend_elems = [ Line2D([0], [0], color="gray", lw=2.0, alpha=0.6, label="Pareto front"), Line2D([0], [0], marker="o", color="w", markerfacecolor="purple", markeredgecolor="black", markersize=9, label="on front"), Line2D([0], [0], marker="o", color="w", markerfacecolor="white", markeredgecolor="purple", markersize=8, markeredgewidth=1.3, label="dominated"), Line2D([0], [0], marker="*", color="w", markerfacecolor="crimson", markeredgecolor="black", markersize=14, label="None"), ] ax.legend(handles=legend_elems, loc="best", framealpha=0.9) else: ax.text(0.5, 0.5, "Peak memory unavailable\n(psutil not installed)", transform=ax.transAxes, ha="center", va="center", fontsize=12, bbox=dict(boxstyle="round,pad=0.5", facecolor="lightyellow")) ax.set_xticks([]) ax.set_yticks([]) ax.set_title("Memory vs time Pareto (unavailable)") plt.tight_layout() plt.show() ``` ## Reading the Memory Curve The bottom-left panel follows the classical analysis. Peak gradient memory scales as $$ \mathrm{peak\,memory} \;\approx\; \underbrace{\frac{n_\text{steps}}{K} \cdot c_\text{outer}}_{\text{block-boundary tape}} \;+\; \underbrace{K \cdot c_\text{inner}}_{\text{per-block inner tape during backward}} $$ with a minimum near $K \approx \sqrt{n_\text{steps} \cdot c_\text{outer} / c_\text{inner}}$, close to $\sqrt{n_\text{steps}}$ for this workload. Three effects bend the textbook curve: 1. **Checkpoint boundaries inflate the inner tape.** XLA cannot fuse across a `jax.checkpoint` boundary and must keep the per-step VJP tape for rematerialisation, so $c_\text{inner}$ exceeds the uncheckpointed per-step cost, most of all for short inner scans. The `None` baseline is thus an optimistic lower bound. 2. **A non-divisor `K` leaves a tail.** When `n_steps % K != 0` the remainder runs as a plain `jax.lax.scan` whose tape stays live through the backward pass, adding $\mathrm{remainder} \cdot c_\text{unchecked}$ to the peak. Prefer `K` that divides, or nearly divides, `n_steps`. 3. **`K = n_steps` saves nothing.** It still wraps one scan in `jax.checkpoint`, so backward rematerialises the full tape (peak near `None`) while paying an extra forward. The result is a U-shape with its minimum near $\sqrt{n_\text{steps}}$, cutting gradient memory by roughly an order of magnitude versus `None`. ## Correctness Check A checkpointed gradient must match the uncheckpointed one to floating-point precision: the forward path is bit-exact (same scan body, only the loop nesting changes) and the backward path differs only by recompute rounding. The `|Δgrad/grad|` column of the summary table below stays at double-precision rounding (around 1e-15 to 1e-13) for every block size, confirming checkpointing does not change the result. ## Summary Table All measured quantities in one self-contained table, copy-pasteable into an issue or back to an LLM. `fwd_ratio` and `grad_ratio` are normalised to the `None` baseline; `peak_MB` is the peak process-RSS delta during one gradient call (CPU proxy via psutil), or the device-memory delta on GPU/TPU, else `NA`. ```{python} #| echo: true #| code-fold: true #| code-summary: "Table code" baseline = sweep_results["None"] header = ( f"{'block_size':<12} " f"{'fwd_s':<14} " f"{'grad_s':<14} " f"{'grad/fwd':<10} " f"{'fwd_ratio':<11} " f"{'grad_ratio':<11} " f"{'peak_MB':<10} " f"{'loss':<22} " f"{'grad':<14} " f"{'|Δgrad/grad|':<14}" ) print(header) print("-" * len(header)) for label, r in sweep_results.items(): fwd = f"{r['fwd_mean']:.4f}±{r['fwd_std']:.4f}" grd = f"{r['grad_mean']:.4f}±{r['grad_std']:.4f}" ratio = r["grad_mean"] / r["fwd_mean"] fwd_ratio = r["fwd_mean"] / baseline["fwd_mean"] grad_ratio = r["grad_mean"] / baseline["grad_mean"] peak = ( f"{r['peak_bytes_delta'] / 1e6:.1f}" if r["peak_bytes_delta"] is not None else "NA" ) rel = abs((r["grad_value"] - baseline["grad_value"]) / baseline["grad_value"]) print( f"{label:<12} " f"{fwd:<14} " f"{grd:<14} " f"{ratio:<10.2f} " f"{fwd_ratio:<11.2f} " f"{grad_ratio:<11.2f} " f"{peak:<10} " f"{r['loss']:<22.16f} " f"{r['grad_value']:<14.6e} " f"{rel:<14.3e}" ) # Compact context block (helpful when sharing the table). print() print( f"# workload: n_nodes={n_nodes}, n_steps={N_STEPS}, dt={DT}, T={T1/1000:.0f}s, " f"max_delay={max_delay:.1f}ms, history_length={history_length}" ) print(f"# sqrt(n_steps) ≈ {int(np.sqrt(N_STEPS))} (memory-optimal block size)") print(f"# device: {jax.devices()[0].platform} jax {jax.__version__}") ``` ## No-Regression Check Because `block_size=None` selects the original `jax.lax.scan` call site verbatim, the default's forward and gradient times stay **within timing noise** of the non-checkpointed implementation. Note that `K = n_steps` is **not** the same as `None`: it still wraps the single inner scan in `jax.checkpoint`, so backward recomputes the whole forward once and costs more than `None`. Only `block_size=None` skips checkpointing entirely. ## Practical Guidance ```python import math from tvboptim.experimental.network_dynamics.solvers import Heun # Default: no checkpointing. Fastest gradient when memory is not the issue. solver = Heun() # Memory-optimal default when gradients no longer fit in memory. solver = Heun(block_size=int(math.sqrt(n_steps))) # Aggressive: minimal memory, maximal recompute. Use only if the sqrt # default still OOMs. solver = Heun(block_size=64) ``` The same field works on `Euler`, `Heun`, `RungeKutta4`, and any `BoundedSolver` wrapping one of those; the setting is delegated through the wrapper to the base solver.