CostWatch now includes diagnostic hints for AWS cost anomalies

An anomaly alert that says "EC2 spend up 340% this week" is useful. An anomaly alert that says "EC2 spend up 340% this week — check: new instances without auto-termination, forgotten stopped instances still billing for attached EBS, On-Demand vs. Reserved coverage gap" is actionable.

CostWatch now adds a diagnostic hint to every anomaly alert.

How it works

No AI. No API calls at alert time. A lookup table.

_anomaly_hint(service_name) maps 24 AWS service names to a one-line diagnostic checklist. When an anomaly fires, the hint is appended to the alert payload — Slack message, Discord embed, Teams MessageCard, email, PagerDuty details field. All five channels get the same hint.

The hint shows up as a muted line below the spend numbers. It's not the headline — the headline is still the spend delta and the service name. The hint is the "where to look first" line that saves you from opening five different Cost Explorer screens before finding the obvious thing.

The 24 services covered

The hint table covers the services that drive most unexpected AWS spend:

EC2 — Check: new instances without termination protection, stopped instances with EBS, On-Demand vs. Reserved gap
RDS — Check: snapshot retention policy, Multi-AZ enabling, instance class upgrade that wasn't reverted
S3 — Check: intelligent tiering not applied, Glacier Instant Retrieval at high request volume, cross-region replication added
Lambda — Check: runaway recursive invocations, provisioned concurrency left enabled, duration ceiling hit
CloudFront — Check: cache hit ratio drop, origin request volume spike, Shield Advanced enrollment
EKS — Check: cluster autoscaler over-provisioning, Fargate pod profile costs vs. EC2 nodegroup
Bedrock — Check: input/output token ratio, model selection (Sonnet vs Haiku vs Opus pricing)

And 17 more.

Why no AI?

The obvious version of this feature would call Claude or another model at alert time: "this service spiked, tell me why." I considered it.

But:

Latency. Anomaly alerts should fire in under a second. An LLM call adds 1–3 seconds and a variable cost per alert.

Reliability. A heuristic lookup doesn't fail if the model is unavailable. Alerts during a cost spike — which is exactly when you want them — should be the most reliable part of the system.

The model doesn't know your account. Without access to your actual Cost Explorer data, an LLM can only tell you generic things about what causes EC2 costs to spike. The heuristic table is actually more specific because it's based on the most common causes we've seen in production, not the model's training data.

The hint isn't a diagnosis. It's a first-look checklist. The goal is to get you to the right Cost Explorer screen in one click, not to replace the investigation.

The hint can be wrong. It's a lookup on a service name, not a read of your account state. But in practice, the first item on the checklist is the right answer more than 50% of the time. That's enough to be worth the three extra characters per alert.

All alert channels, same hint

The hint appears in all five output channels — SES email, Slack Block Kit, Discord embed, Teams MessageCard, PagerDuty event details. The exact format varies by channel (bold in Slack, muted text in email) but the content is identical.

No new configuration needed. If you're already receiving anomaly alerts, you get hints automatically.