Mining 18,000 emails for facts about my own life

FILE 0x9F·MINING 18,000 EMAILS FOR FACTS ABOUT MY OWN LIFE

April 21, 2026 · llm, agents, memory

I had three years of email sitting in a DynamoDB table and a memory store wired into my assistant. The assistant was making decisions with no recall of what was actually in that inbox. So I ran a multi-pass extraction over the corpus with a fan-out of subagents and got real results — including discovering that several of my own assumptions were wrong.

The corpus

18,347 items in the DynamoDB mail table
~500 MB on disk
611 unique senders
Spans three years

Pass 1 — per-sender summary fan-out

Aggregated to 611 senders, batched 50 per subagent, fanned out 13 parallel subagents. Each subagent looked at a batch's worth of senders, classified them, and wrote durable facts directly to memory via an MCP tool.

About 422 memory entries got written across topics like people/*, vendor/*, subscription/*, financial/*, medical/*, organization/*, and automated/*. About 140 senders got skipped as pure marketing or test fixtures.

Total wall-clock: under an hour with the subagents running in parallel. Sequentially this would've been most of a day.

Pass 2 — deep dive on the high-signal threads

13 senders looked worth deeper extraction — the actual humans I'd been corresponding with. One subagent per sender, pulled full body text via a GSI on FROM#<sender>.

This pass corrected Pass 1 in ways that surprised me:

A name I thought was a financial advisor was actually cold prospecting from mailing lists. Same with a second "advisor." Neither has an actual account relationship with me.
A name I'd attributed to one person was actually a different person with a similar last name; the original was a guest speaker at the event the second person organizes.
A "city official" turned out to be a cultural arts center mailing list with the same first name.
A debt-collection notice I'd half-mentally written off as real was paired with cold-prospecting emails from a sales rep at the same alleged creditor — strongly suggesting it's a scam against a legacy email alias.

The lesson here: Pass 1's summaries were good as orientation but wrong on details. Pass 2's per-thread reading was where the actual truth lived. You can't skip the close read with summaries.

Pass 3 — keyword sweep for known gaps

After Passes 1 and 2 I had a list of things I expected to find but hadn't. Eight gap-buckets: estate, retirement, medical, real-estate, and a few more.

Built a small keyword index over the table, pulled the matching items per bucket, trimmed to ~25-60 per bucket, fanned out 8 parallel subagents. Each subagent's job was to either resolve the gap or report "still missing, here's what I see."

Of the 8 gaps, 5 resolved (one with the answer "there is no advisor, he's fully self-directed"). 3 remained unresolvable from email alone — they would need either a PDF body extraction pass or broader keyword terms.

What I'd do differently

A fourth pass on attachments. A lot of substantive info — tax forms, medical records — lives in PDFs that I never opened. The text body of those emails is just "your document is attached." Without OCRing those attachments, the inbox is blind to a real chunk of its own content.

Also, the subagent fan-out pattern worked so well I want to formalize it as a tool. Right now it's bespoke per-job glue code. The core pattern — split a corpus into batches, assign one subagent per batch with a clear extraction schema, let them each write to a shared memory store, then audit the results — is generic. A small "fan-out runner" library would make this trivially reusable for other corpora (Slack history, call transcripts, document folders).

The other lesson: trust no single-pass summary. Use cheap summaries to find the threads worth reading, then actually read them.