Back to blog
FILE 0xCE·HOW NIGHTDESK DECIDES WHETHER TO WAKE UP AN ENGINEER AT 2AM

How NightDesk decides whether to wake up an engineer at 2am

June 10, 2026 · nightdesk, msp, ai, prompt-engineering, python, llm

The hardest part of building NightDesk wasn't the telephony. Amazon Connect handles call routing, Amazon Transcribe handles speech-to-text, and Lambda glues them together. The hard part was the decision layer: given a conversation in progress, when do you wake up the on-call engineer?

The wrong answer is "whenever there's any doubt." That recreates the problem you were solving — techs don't sleep. The other wrong answer is "only when you're sure." A customer whose server has been down for 20 minutes doesn't want to hear "we'll look at it in the morning."

The triage kernel resolves this with a constraint baked into the system prompt:

Bias toward ESCALATE when uncertain. False escalation = mild annoyance.
False non-escalation = fired client.

That asymmetry is the whole thing. Everything else is implementation.

The three actions

The kernel has exactly three output states:

The agent speaks in all three states. On GATHER, it asks the next diagnostic question. On RESOLVE, it tells the caller what it did. On ESCALATE, it tells the caller to stay on the line.

RESOLVE only triggers when the caller explicitly confirms the fix works. If they say "let me try that" and don't call back, the system writes a "pending verification" ticket and routes it for morning review. It doesn't mark the issue resolved until it hears confirmation.

The runbook

The model doesn't make decisions in a vacuum. Each customer has a runbook — a YAML file the MSP uploads during onboarding:

customer_name: Acme Manufacturing
business_hours: Mon-Fri 8a-5p CT
known_systems:
  - ERP (NetSuite)
  - VPN (Cisco AnyConnect)
  - File server (\\acme-fs01)
  - Email (Microsoft 365)
escalation_rules:
  - trigger: "multiple users can't log in"
    page: "+1-555-0100"
    method: sms
  - trigger: "production outage or server down"
    page: "oncall@msp.com"
    method: email
  - trigger: "suspected security incident"
    page: "+1-555-0100"
    method: sms
out_of_scope_phrases:
  - "I want to add a new employee"
  - "what's the price for"
  - "I need to upgrade my plan"

The known_systems list gives the model context about what's "critical" versus "inconvenient" for this customer. A NetSuite outage during business hours is a different conversation than a NetSuite outage at 2am on a Saturday — but the system knows NetSuite is in scope.

The escalation_rules let the MSP encode what warrants a wakeup without having to write prompt engineering. "Multiple users can't log in" is a P1 rule. The model sees that rule and knows: if the transcript matches this trigger, escalate to this contact via this method.

Two extra context injections

When a call comes in, the system does two CW lookups before the first turn:

Caller identity. The inbound phone number is matched against CW company contacts. If there's a match, the runbook gets caller_name: "Sarah Chen (Acme Manufacturing)". The model can greet the caller by name and can reference the company's specific systems without asking "which company are you calling from?"

Recent tickets. The five most recently closed tickets for the company are injected as runbook context. If someone called three days ago about the same VPN issue, the model knows. It might ask "you had a VPN ticket earlier this week — is this related?" — which is exactly what a good help desk tech would do.

Neither of these is required for the triage to work, but they raise the floor considerably on what "good" looks like from the caller's perspective.

What happens when the model goes off-script

The model is supposed to return a JSON object:

{"action": "escalate", "speech": "Let me get an engineer on the line.", "ticket_body": "...", "page_target": "..."}

If it returns anything else — malformed JSON, unexpected action type, missing required fields — the system treats that as an ESCALATE. The fallback ticket body says "Triage agent received malformed decision from LLM and bailed."

That's the right fallback. The edge cases where the model outputs garbage are also the edge cases where the situation is probably weird enough to warrant human judgment. Don't trust a confused AI with a P1 decision.

action_str = raw.get("action", "").lower()
if action_str not in {a.value for a in Action}:
    return Decision(
        action=Action.ESCALATE,
        speech="Let me get an engineer on the line for you.",
        ticket_body=(
            "Triage agent received malformed decision from LLM "
            f"and bailed. Raw: {raw!r}"
        ),
        page_target=None,
    )

Testing the decision kernel

Because the telephony is a thin shell around the kernel, the entire decision logic is testable without a phone:

def test_printer_offline_resolves():
    rb = Runbook(
        customer_name="Acme",
        business_hours="Mon-Fri 8a-5p CT",
        known_systems=["printer (HP LaserJet 2600)"],
    )
    transcript = [
        Turn("caller", "Hi, the printer in the accounting department isn't working."),
        Turn("agent", "Got it. Have you tried turning it off and on?"),
        Turn("caller", "Yeah, just tried that, still offline."),
        Turn("agent", "It may need a firmware reset. I can walk you through that..."),
        Turn("caller", "Oh, it came back on. Thanks!"),
    ]
    stub = lambda s, u: {"action": "resolve", "speech": "Great.", "ticket_body": "Printer resolved."}
    decision = run_turn(transcript, rb, llm=stub)
    assert decision.action == Action.RESOLVE

The llm parameter is a seam that tests inject. The production code calls Anthropic; tests inject a deterministic stub. The eight scripted scenarios in the test suite cover: printer offline (P3 resolve), password reset (P3 resolve), server down (P1 escalate), multiple users locked out (P1 escalate), sales misroute (out-of-scope ESCALATE with null page_target), upset caller requesting human (always ESCALATE), ambiguous issue where the agent should GATHER, and a scenario where the caller confirms resolution mid-call.

The metric that matters

MSPs ask about classification accuracy. The answer depends on how you define it, but the operational metric isn't "percentage of issues classified correctly." It's percentage of true P1s that got through to a human combined with percentage of P3s that didn't wake anyone up.

The asymmetry in the prompt means the model will occasionally escalate a P3. That's fine — the tech answers, says "hey, I'll get back to you in the morning," and goes back to sleep. The ticket is written either way. The unacceptable case is a P1 that got classified as P3 and sat in a queue until 9am.

From the seven pilot calls I've run in the AGJ environment: all three true P1s escalated correctly. Four P3s resolved without waking anyone. One P3 escalated unnecessarily (the caller said "this is really urgent" about a printer). That's a 6/7 routing accuracy rate, with the wrong classification in the safe direction.


NightDesk is in pilot. If you're running an MSP and this describes your after-hours problem, the pilot tier is $199/month. nightdesk.io