Plaid to DynamoDB with read-time categorization

FILE 0x38·PLAID TO DYNAMODB WITH READ-TIME CATEGORIZATION

April 17, 2026 · aws, personal-finance, dynamodb

I wanted a personal-finance dashboard that I owned end-to-end. Plaid on one end, DynamoDB on the other, one Lambda in between, and categorization rules that are applied at read time rather than baked into stored rows.

The shape

One Lambda (finances-to-dynamodb), Python 3.12, behind API Gateway v2 with a custom domain
One DynamoDB table (finances), pay-per-request, PITR on
One EventBridge daily rule for the scheduled Plaid sync
The Lambda handles both webhook (Plaid pushing updates) and HTTP (the dashboard reading data)

Single-table design:

PK = ITEM#<item_id>   SK = META                    # access token, cursor
PK = ITEM#<item_id>   SK = ACCT#<account_id>       # account + balance
PK = ITEM#<item_id>   SK = TXN#<YYYY-MM-DD>#<id>   # transactions, date-sorted
PK = ITEM#<item_id>   SK = TXNIDX#<txn_id>         # reverse index for webhook
PK = RULES            SK = RULE#<id>               # categorization rules

The TXN#<YYYY-MM-DD>#<id> sort key gives me cheap date-range queries per account. The TXNIDX# rows are there so Plaid's removed webhook (which only sends the transaction ID, not the date) can find the canonical row to delete.

Read-time categorization

This is the part I'm fondest of. Instead of stamping a category on each transaction when it lands, transactions store Plaid's category as-is, and a separate rules table is applied when the dashboard asks for a summary:

def build_summary(txns, rules):
    out = defaultdict(float)
    for t in txns:
        cat = apply_rules(t, rules) or t["personal_finance_category"]["primary"]
        out[cat] += t["amount"]
    return out

def apply_rules(txn, rules):
    for r in rules:
        field = txn.get(r["match_field"], "") or ""
        if r["match_type"] == "contains" and r["pattern"].lower() in field.lower():
            return r["category"]
        # exact / regex variants similarly
    return None

The dashboard UI is built around this. Click any transaction's name, the modal suggests the longest word as a pattern and pre-fills a "make a rule" form. Save the rule — the next page load shows the entire history reshaped by it. Zero data migration. No batch job. No "wait, why doesn't this old transaction reflect the new rule" surprise.

For example, my mortgage payment lands under LOAN_PAYMENTS per Plaid's default categorization. I added a single rule — name contains "tfs" → MORTGAGE — and $2,250/month immediately moved to its own bucket across the entire history.

The income bug

Plaid's personal_finance_category.primary == "INCOME" is the ground truth for paychecks. But my "income" total kept reading high because I'd been counting two other things as income:

Loan disbursements (paycheck advances, instant-loan products). These show up as positive transactions but they're really transfers — money I owe back. Belongs in a TRANSFERS bucket, not income.
Refunds (negative non-INCOME transactions). These were inflating income too. They should net against spend in their original category.

Fix in build_summary: only count Plaid's INCOME primary as true income. Bucket loan disbursements under LOAN_DISBURSEMENTS / transfers. Let refunds reduce spend where they came from.

April income went from a reported $3,777 to the actual $2,164. The difference (~$1,613) was ~$1,161 in loan advances plus ~$452 in refunds. The bigger number had looked plausible enough that I hadn't caught it for a while. The smaller number is right.

The API Gateway v2 cookies trap

The dashboard authenticates through a single-sign-on flow with my homelab assistant. After SSO callback set the session cookie, /api/summary kept returning 401, the JS would redirect back to SSO, lather, rinse, redirect loop.

Root cause: API Gateway v2 payload format 2.0 puts cookies in event["cookies"] (a list of name=value strings), not in headers["cookie"]. My code only looked at the headers, so the session cookie set by /sso/callback was invisible to /api/summary on the very next request.

Fix:

cookies_array = event.get("cookies") or []
if cookies_array:
    existing = headers.get("cookie", "")
    joined = "; ".join(cookies_array)
    headers["cookie"] = f"{existing}; {joined}" if existing else joined

This trap will hit anything else I run on API Gateway v2 with cookie auth. Now it's in my "things that bite" list.

What I'd do differently

Read-time categorization is the right default. The temptation to "just stamp the category at write time" is strong — it feels more efficient — but the moment you want to change a rule, you've got a data migration. Doing it at read time costs you a few extra milliseconds per query and buys you a zero-friction way to reshape your history any time you want.