Recovering orphaned jobs after a worker restart
A backup service kept getting into a state where the worker queue showed "running" jobs that nobody was actually working on. The in-memory job queue was empty; the database said three jobs were in progress; the process had restarted in the middle of those jobs and forgotten about them. New work would queue behind the ghosts and stall for 20+ minutes until I manually cleared them.
What was happening
The architecture was the usual one. A persistent job table tracks state across restarts. An in-memory worker pulls "queued" jobs, sets them to "running," does the work, sets them to "done" or "failed." Restart the process mid-job and the row is left in "running" with no worker tracking it. The worker pool has a concurrency limit, so those ghost rows count against the limit and starve new jobs.
The bug isn't subtle. It's a class of bug that hits every distributed job system the first time someone deploys during a busy window. I just hadn't gotten around to handling it.
What I found
There are roughly three ways to solve it:
-
Lease-and-heartbeat. Workers acquire a job with a short lease and renew it; an unrenewed lease is up for grabs by another worker. Correct but invasive — requires changes to job schema and worker loop.
-
Restart-time sweep. On startup, every worker resets any jobs it was supposed to be processing back to "queued." Simple, works for the common case of "process restart while jobs were in flight." Doesn't handle the case of a worker that's still running but dead-stuck.
-
External reaper. A separate process scans for stale "running" jobs periodically and resets them.
For my failure shape — process restart caused by deploys, not stuck workers — option 2 is the right size. If I later see stuck-worker failures, I add the heartbeat. Don't pre-build for a failure you haven't seen.
The fix
A _resume_orphans_on_startup() hook that runs once at process start,
before the worker loop accepts any new jobs:
def _resume_orphans_on_startup(self):
"""Reset any 'running' jobs back to 'queued' on startup.
Called once at boot. Any job marked 'running' in the DB while the
process is not yet handling any jobs is by definition orphaned —
a previous worker process died mid-job.
"""
with self.db.transaction() as tx:
stale = tx.execute(
"UPDATE jobs SET status = 'queued', "
" attempts = attempts + 1, "
" last_error = 'orphaned by worker restart' "
"WHERE status = 'running' "
"RETURNING id"
).fetchall()
if stale:
log.warning("recovered %d orphaned jobs", len(stale))
Two small choices worth calling out:
attempts = attempts + 1so a job that keeps orphaning the worker doesn't loop forever — eventually it'll hit your max-attempts threshold and move to "failed."- The log message includes the count. If you ever see "recovered 47 orphaned jobs" in your logs, that's a signal to look at why you're restarting workers mid-job; the recovery is correct but the underlying churn might not be.
After this hook landed, the 20-minute stall pattern stopped showing up. A deploy mid-job now produces a worker that boots, resets a few rows, and immediately starts working them.
What I'd do differently
Build the orphan-recovery hook the same day you build the worker pool. It is one of those "I'll add it later" features that is fine until the day it isn't, and the failure mode is silent — your queue looks healthy from the outside because rows have valid statuses. They just happen to be statuses that mean "in flight" on a worker that doesn't exist.