Debugging Replication Lag at 3AM

A war story about a replica that fell an hour behind, the single long transaction that caused it, and the dashboard that should have caught it.

The page said a read replica was an hour behind primary. Reads were served stale, a downstream job was making decisions on old data, and nobody could see why — CPU was low, disk was idle, network was fine. A healthy-looking replica falling steadily further behind.

# apply is single-threaded until it isn't

The culprit was one enormous transaction on the primary — a batch job updating millions of rows in a single statement. On the primary it ran in parallel across cores. On the replica, the replay stream applied it as one indivisible unit, and everything queued behind it. The lag wasn't a symptom; it was the replica faithfully doing exactly what it was told, slowly.

Replication lag is rarely about the replica. It's about the shape of the writes you send it.

# the dashboard that should have existed

We were graphing lag in seconds, which told us that we were behind but nothing about why. The fix was a panel showing the largest in-flight transaction by row count. Chunk the batch job into thousands of small commits and the replica keeps pace. The real lesson: monitor the cause, not just the effect.