Mining 18,000 emails for facts about my own life
I had three years of email sitting in a DynamoDB table and a memory store wired into my assistant. The assistant was making decisions with no recall of what was actually in that inbox. So I ran a multi-pass extraction over the corpus with a fan-out of subagents and got real results — including discovering that several of my own assumptions were wrong.
The corpus
- 18,347 items in the DynamoDB
mailtable - ~500 MB on disk
- 611 unique senders
- Spans three years
Pass 1 — per-sender summary fan-out
Aggregated to 611 senders, batched 50 per subagent, fanned out 13 parallel subagents. Each subagent looked at a batch's worth of senders, classified them, and wrote durable facts directly to memory via an MCP tool.
About 422 memory entries got written across topics like
people/*, vendor/*, subscription/*, financial/*, medical/*,
organization/*, and automated/*. About 140 senders got skipped
as pure marketing or test fixtures.
Total wall-clock: under an hour with the subagents running in parallel. Sequentially this would've been most of a day.
Pass 2 — deep dive on the high-signal threads
13 senders looked worth deeper extraction — the actual humans I'd
been corresponding with. One subagent per sender, pulled full body
text via a GSI on FROM#<sender>.
This pass corrected Pass 1 in ways that surprised me:
- A name I thought was a financial advisor was actually cold prospecting from mailing lists. Same with a second "advisor." Neither has an actual account relationship with me.
- A name I'd attributed to one person was actually a different person with a similar last name; the original was a guest speaker at the event the second person organizes.
- A "city official" turned out to be a cultural arts center mailing list with the same first name.
- A debt-collection notice I'd half-mentally written off as real was paired with cold-prospecting emails from a sales rep at the same alleged creditor — strongly suggesting it's a scam against a legacy email alias.
The lesson here: Pass 1's summaries were good as orientation but wrong on details. Pass 2's per-thread reading was where the actual truth lived. You can't skip the close read with summaries.
Pass 3 — keyword sweep for known gaps
After Passes 1 and 2 I had a list of things I expected to find but hadn't. Eight gap-buckets: estate, retirement, medical, real-estate, and a few more.
Built a small keyword index over the table, pulled the matching items per bucket, trimmed to ~25-60 per bucket, fanned out 8 parallel subagents. Each subagent's job was to either resolve the gap or report "still missing, here's what I see."
Of the 8 gaps, 5 resolved (one with the answer "there is no advisor, he's fully self-directed"). 3 remained unresolvable from email alone — they would need either a PDF body extraction pass or broader keyword terms.
What I'd do differently
A fourth pass on attachments. A lot of substantive info — tax forms, medical records — lives in PDFs that I never opened. The text body of those emails is just "your document is attached." Without OCRing those attachments, the inbox is blind to a real chunk of its own content.
Also, the subagent fan-out pattern worked so well I want to formalize it as a tool. Right now it's bespoke per-job glue code. The core pattern — split a corpus into batches, assign one subagent per batch with a clear extraction schema, let them each write to a shared memory store, then audit the results — is generic. A small "fan-out runner" library would make this trivially reusable for other corpora (Slack history, call transcripts, document folders).
The other lesson: trust no single-pass summary. Use cheap summaries to find the threads worth reading, then actually read them.