Disabled the schedule. Got paged anyway.

FILE 0xCB·DISABLED THE SCHEDULE. GOT PAGED ANYWAY.

April 27, 2026 · debugging, aws, lambda, postmortem

I disabled a stuck check-in schedule at 09:05. At 09:33 my phone buzzed with "Chester Frazier hasn't responded to their check-in." The schedule was supposed to be dead. Walking the cron path turned up three separate bugs in the escalation code, all of them mine.

What was happening

A 09:00 daily check-in had been firing requests, expiring, and escalating through three SMS tiers. I disabled the schedule from the admin side and the next cron tick still escalated a stale request.

Worse: another request from the previous day was sitting at escalation_level=3 — it had been quietly bumping a tier per day for two days, only failing to flood because of a throttle I thought I had shipped a week earlier but actually hadn't.

What I found

Three things stacked.

The morning's check-in request was created at 08:58, before I flipped is_active=false on the parent schedule at 09:05. The cron's processExpiredCheckins() only consulted the parent schedule's escalation_enabled flag — not is_active. The orphan request kept marching through tiers.
A previous "per-user-per-cron-cycle escalation cap" commit was claimed-shipped in a memory note but was not actually present in the deployed Lambda. The fix had been written, partially reviewed, then somehow reverted or never merged. The note in my bug log said "deployed" and the bug log was wrong.
There was also a quiet key-schema bug in the same function. The table has a composite key (user_id, requested_at). The "expired but escalation disabled" branch was calling updateItem with ['id' => $requestId]. It would have thrown a ValidationException if anyone ever hit that branch. Nobody had until today.

The fix

I shipped three layered defenses so this can't pop again from a different angle:

// 1. Delete/PATCH on a schedule now cascades.
//    Any in-flight pending/sent/escalated requests for that schedule
//    are auto-resolved with resolved_reason = "schedule_disabled".

// 2. processExpiredCheckins now checks the parent.
$parent = getSchedule($req['schedule_id']);
if (!$parent || !$parent['is_active']) {
    resolveRequest($req, 'parent_schedule_inactive');
    continue;
}

// 3. Per-user-per-cron-cycle escalation cap (re-introduced for real this time).
if (isset($escalatedUsers[$req['user_id']])) {
    resolveRequest($req, 'sibling_request_already_escalated_this_cycle');
    continue;
}
$escalatedUsers[$req['user_id']] = true;

Also fixed the wrong-key updateItem in the dead branch so it won't silently fail the next time someone routes traffic through it.

What I'd do differently

The lesson isn't really about cron logic. It's about memory notes that document fictional state. I had marked the per-user cap as shipped a week earlier; the actual commit was never merged. Going forward, "shipped" means I can see the change in production by hash, not that I told my notes I shipped it.

Defense in depth saved me here. Any one of the three new checks would have stopped the page; having all three means a future bug in one of them still doesn't flood somebody's phone with their own name at 09:33.