Plaid to DynamoDB with read-time categorization
I wanted a personal-finance dashboard that I owned end-to-end. Plaid on one end, DynamoDB on the other, one Lambda in between, and categorization rules that are applied at read time rather than baked into stored rows.
The shape
- One Lambda (
finances-to-dynamodb), Python 3.12, behind API Gateway v2 with a custom domain - One DynamoDB table (
finances), pay-per-request, PITR on - One EventBridge daily rule for the scheduled Plaid sync
- The Lambda handles both webhook (Plaid pushing updates) and HTTP (the dashboard reading data)
Single-table design:
PK = ITEM#<item_id> SK = META # access token, cursor
PK = ITEM#<item_id> SK = ACCT#<account_id> # account + balance
PK = ITEM#<item_id> SK = TXN#<YYYY-MM-DD>#<id> # transactions, date-sorted
PK = ITEM#<item_id> SK = TXNIDX#<txn_id> # reverse index for webhook
PK = RULES SK = RULE#<id> # categorization rules
The TXN#<YYYY-MM-DD>#<id> sort key gives me cheap date-range
queries per account. The TXNIDX# rows are there so Plaid's
removed webhook (which only sends the transaction ID, not the
date) can find the canonical row to delete.
Read-time categorization
This is the part I'm fondest of. Instead of stamping a category on each transaction when it lands, transactions store Plaid's category as-is, and a separate rules table is applied when the dashboard asks for a summary:
def build_summary(txns, rules):
out = defaultdict(float)
for t in txns:
cat = apply_rules(t, rules) or t["personal_finance_category"]["primary"]
out[cat] += t["amount"]
return out
def apply_rules(txn, rules):
for r in rules:
field = txn.get(r["match_field"], "") or ""
if r["match_type"] == "contains" and r["pattern"].lower() in field.lower():
return r["category"]
# exact / regex variants similarly
return None
The dashboard UI is built around this. Click any transaction's name, the modal suggests the longest word as a pattern and pre-fills a "make a rule" form. Save the rule — the next page load shows the entire history reshaped by it. Zero data migration. No batch job. No "wait, why doesn't this old transaction reflect the new rule" surprise.
For example, my mortgage payment lands under LOAN_PAYMENTS per
Plaid's default categorization. I added a single rule —
name contains "tfs" → MORTGAGE — and $2,250/month immediately moved
to its own bucket across the entire history.
The income bug
Plaid's personal_finance_category.primary == "INCOME" is the
ground truth for paychecks. But my "income" total kept reading high
because I'd been counting two other things as income:
- Loan disbursements (paycheck advances, instant-loan products).
These show up as positive transactions but they're really
transfers — money I owe back. Belongs in a
TRANSFERSbucket, not income. - Refunds (negative non-INCOME transactions). These were inflating income too. They should net against spend in their original category.
Fix in build_summary: only count Plaid's INCOME primary as true
income. Bucket loan disbursements under LOAN_DISBURSEMENTS /
transfers. Let refunds reduce spend where they came from.
April income went from a reported $3,777 to the actual $2,164. The difference (~$1,613) was ~$1,161 in loan advances plus ~$452 in refunds. The bigger number had looked plausible enough that I hadn't caught it for a while. The smaller number is right.
The API Gateway v2 cookies trap
The dashboard authenticates through a single-sign-on flow with my
homelab assistant. After SSO callback set the session cookie,
/api/summary kept returning 401, the JS would redirect back to
SSO, lather, rinse, redirect loop.
Root cause: API Gateway v2 payload format 2.0 puts cookies in
event["cookies"] (a list of name=value strings), not in
headers["cookie"]. My code only looked at the headers, so the
session cookie set by /sso/callback was invisible to
/api/summary on the very next request.
Fix:
cookies_array = event.get("cookies") or []
if cookies_array:
existing = headers.get("cookie", "")
joined = "; ".join(cookies_array)
headers["cookie"] = f"{existing}; {joined}" if existing else joined
This trap will hit anything else I run on API Gateway v2 with cookie auth. Now it's in my "things that bite" list.
What I'd do differently
Read-time categorization is the right default. The temptation to "just stamp the category at write time" is strong — it feels more efficient — but the moment you want to change a rule, you've got a data migration. Doing it at read time costs you a few extra milliseconds per query and buys you a zero-friction way to reshape your history any time you want.