The MySQL Deadlock Incident
2025-01-04•Resolved
A production replica froze mid-deploy, forcing us to peel back transaction logs and rethink how we batch writes.
databasedeadlockoperations
The MySQL Deadlock Incident
We shipped a harmless-looking migration that added a new status column, but the deploy overlapped with an analytics job that backfilled historical rows. Within minutes, our write node started throwing ER_LOCK_WAIT_TIMEOUT errors.
Symptoms
- API latency spiked from 80ms → 1.6s
- Queue workers stalled because acknowledgements never committed
SHOW ENGINE INNODB STATUSshowed a growing list of waiting transactions
Diagnosis
We traced the problem to two long-lived transactions:
- Migration: batching updates in chunks of 5k
- Analytics job: scanning the same tables with
FOR UPDATE
Because the migration sorted rows differently per batch, row-level locks overlapped just enough to create a classic deadlock cycle.
Fix
- Paused the analytics job
- Re-ran the migration with smaller batches and explicit ordering
- Added a
READ COMMITTEDisolation override for the analytics job
Lessons
- Long transactions are distributed systems in disguise
- Backfills should always order by primary key
- Observability saved us—slow query sampling gave us the exact transaction pair that collided