March 23, 2026

The Optimizer State Bug: A Silent Failure in DCP Resume

Author: Robin, Kroonen AI Inc.

Genesispostmortemoptimizeradamwdcpfsdp

⚠️ Postmortem, March 23, 2026

Fixing the checkpoint save deadlock was only half the story. The checkpoint load path introduced a subtler failure: one that didn't crash, didn't hang, and produced no errors. It just silently ruined the model.

This is Part 2 of the Genesis checkpoint saga. Part 1 covered the FSDP checkpoint deadlock. This post covers the silent optimizer state bug discovered five days later.

What Happened

At step 8,500, training was stopped for maintenance. When resumed, the DCP load path only restored model weights, not the AdamW optimizer state:

❌ Broken: model only, optimizer reset to zero
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        # optimizer state NOT loaded - this is the bug
    }
    dcp.load(state_dict, checkpoint_id=dcp_latest)
    model.load_state_dict(state_dict["model"])
# optimizer starts from scratch - momentum and variance are zero

The save code was fine; it already saved both model and optimizer state. But the load path had been stripped down to model-only during an earlier debugging session to work around a RuntimeError: Missing key in checkpoint state_dict: optimizer.param_groups.0.decoupled_weight_decay error from older checkpoints that genuinely didn't contain optimizer state.

The workaround became the bug.

Why It's Silent

When AdamW's optimizer state is reset mid-training:

First moment (m, β₁=0.9): Rebuilds in ~30 steps. Fast.
Second moment (v, β₂=0.95): Takes ~60-100 steps to stabilize.
Bias correction masks the problem early. It amplifies small accumulated moments, making the first few hundred steps look deceptively normal.

So training doesn't explode. It doesn't crash. It just quietly drifts into a worse optimization basin over ~500 steps.

The Diagnostic Signature

The telltale pattern looks backwards from normal instability:

Metric	Before Reset	After Reset
Loss	~1.1-1.3	~2.0-2.5
Grad norm	~0.5-0.7	~0.2-0.3
LR / tok/s	unchanged	unchanged

If the optimizer were diverging, grad norm would spike up. Instead it drops, because without curvature information, Adam's per-parameter scaling is broken, and the model takes smaller effective steps in the wrong directions.

The Fix

Load optimizer state alongside model weights, with a try/except fallback for older checkpoints:

✅ Fixed: model + optimizer, with graceful fallback
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    # 1. Load model weights
    state_dict = {"model": model.state_dict()}
    dcp.load(state_dict, checkpoint_id=dcp_latest)
    model.load_state_dict(state_dict["model"])

    # 2. Load optimizer state (with fallback for old checkpoints)
    try:
        optim_sd = {
            "optimizer": FSDP.optim_state_dict(model, optimizer),
        }
        dcp.load(optim_sd, checkpoint_id=dcp_latest)
        optim_to_load = FSDP.optim_state_dict_to_load(
            model, optimizer, optim_sd["optimizer"]
        )
        optimizer.load_state_dict(optim_to_load)
    except Exception:
        print("Optimizer state missing - falling back to reset")

Recovery

The False Recovery (Poisoned Weights)

The first resume attempt loaded weights without ShardedStateDictConfig(offload_to_cpu=True). The PCIe bus scrambled the FSDP shards during load. The model found a fake local minimum, producing loss values that looked healthy:

Step	Loss (misleading)	Grad Norm
8,501	0.92	0.59
8,505	1.36	0.62
8,509	1.42	0.51

These numbers looked like a successful recovery. They were not. The corrupted weights had settled into a garbage local minimum. Within 500 steps, loss and gradient norms both began rising, confirming the model was converging on poisoned weights and diverging from the true loss landscape.

The True Recovery (Run 2 Script with CPU Offload)

After rewriting the resume path with ShardedStateDictConfig(offload_to_cpu=True) to force shard reassembly through system RAM, the model resumed correctly. The optimizer state was reset, producing the expected "momentum tax": high initial loss with a sawtooth pattern as AdamW rebuilds its moment estimates.

Step	Loss (real)	Grad Norm
8,500	2.68	0.68
8,505	2.64	0.23
8,510	2.60	0.19
8,517	2.14	0.18
8,522	2.22	0.16
8,545	2.09	0.16
8,550	1.92	0.18

The grad norm spike (0.68) at step 8,500 is the optimizer discovering the loss landscape from scratch. It collapsed to 0.16 within 22 steps, confirming the model weights were correctly loaded and training was on the true gradient path. The loss paid back the "optimizer momentum tax" over ~50 steps, settling into a real downward trajectory.

Lessons

Always restore optimizer state on resume. Model weights alone are not enough for AdamW. The accumulated first and second moment estimates encode critical per-parameter learning rate scaling.
Workarounds become bugs. The model-only load was a valid workaround for old checkpoints. But it was left as the default path, silently breaking all future resumes.
Monitor the grad norm / loss ratio. A sudden drop in grad norm paired with a loss increase is the signature of optimizer state loss. It looks nothing like divergence.
Test your resume path. Run 10 steps after resume and verify the metrics match the pre-checkpoint regime. Don't assume it's fine because it didn't crash.

Lessons for Scale

The "sawtooth" loss pattern after an optimizer reset is a well-documented phenomenon in deep learning: optimizer momentum recovery. When AdamW loses its accumulated first moment (mean gradient direction) and second moment (per-parameter variance), it must re-estimate both from scratch. The first moment converges quickly (~30 steps, governed by beta1=0.9). The second moment is slower (~60-100 steps, governed by beta2=0.95). During this window, the effective per-parameter learning rates are miscalibrated, producing the characteristic loss spike followed by gradual recovery.

Understanding this pattern matters at any scale. On a cloud cluster at $32/hour, an undiagnosed optimizer reset wastes thousands of dollars in compromised training steps before anyone notices the loss curve is wrong. Diagnosing it locally on consumer hardware, where the cost of failure is electricity, means the debugging is done before the expensive compute starts. By identifying both the silent optimizer divergence and the PCIe shard mapping corruption locally, the resume path is now telemetry-validated and bulletproof before it ever touches a production cluster.

Genesis

Genesis 5 min

Genesis 1B, Run 2: 3x Throughput, Same Hardware

Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.

Genesis 8 min

Genesis 1B: Run 2 Finished

Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.

Genesis 10 min

The Genesis Manifesto: Sovereign Intelligence

Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.

Postmortems

Postmortems 8 min

Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090

How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.

Research

Research 5 min

Mapping the Mind of Qwen 3.5 9B

A sparse autoencoder for mechanistic interpretability: zero dead features, 16,384 dimensions.