Producing an audiobook with Kokoro on a Mac mini
I wrote a thriller novel. I wanted an audiobook of it. The traditional path is ACX with a human narrator, six to twelve months and a few thousand dollars. The Kindle path is Amazon Virtual Voice, which is fine but Amazon-only and not under my control. I tried local TTS instead.
What was happening
Started on ElevenLabs Pro at $99/month, narrating with Bella and using Justin for the male protagonist's dialogue. Through chapter 11 it sounded great. Two problems: (1) the credit cost was nontrivial at re-render volume — every plot edit meant re-rendering whole chapters, and (2) my voice cast was at the mercy of ElevenLabs not deprecating or repricing those specific voices mid-project.
What I found
Kokoro is an open-source 82M-parameter TTS model with ~50 included voices. It runs locally on Apple Silicon via MPS. On a Mac mini it renders 24kHz WAV faster than realtime, no per-character cost, no API limits. The voice quality of af_bella is different from ElevenLabs Bella but holds up well — a touch warmer, fewer obvious TTS artifacts on long sentences.
The decision was: re-render every chapter on Kokoro for unified voice across the whole book, rather than carrying two voice eras forever.
The render script is short. The interesting work is in stitching:
# render_all.py (roughly)
import torch
from kokoro import KModel, KPipeline
model = KModel().to("mps")
pipe = KPipeline(lang_code="a", model=model)
for chapter in chapters:
text = chapter.read_text()
audio = pipe.synth(text, voice="af_bella")
save_wav(audio, sample_rate=24000, path=f"out/ch{chapter.n:02d}.wav")
Each chapter comes out as a 24kHz WAV. ffmpeg stitches them into a single M4B with chapter markers via an ffmetadata sidecar:
;FFMETADATA1
[CHAPTER]
TIMEBASE=1/1000
START=0
END=1124800
title=Chapter 1
[CHAPTER]
TIMEBASE=1/1000
START=1124800
END=2480100
title=Chapter 2
ffmpeg -i ch01.wav -i ch02.wav ... \
-i chapters.ffmetadata \
-map_metadata 2 -c:a aac -b:a 96k \
Provenance.m4b
Two-voice era required a per-chapter splitter that identifies the male character's dialogue by occurrence index, so a future render with a second voice routes those lines through Michael instead of Bella without re-segmenting by hand.
Final book: 351 MB M4B, 6 hours 8 minutes, single-narrator. Total render time across all chapters: a few hours, run overnight. Cost: zero after electricity.
The fix
Distribution was a separate problem. M4B is native on Apple devices, awkward on Android and Windows, and not importable to Audible or Spotify without going through ACX. I shipped three things:
- KDP Virtual Voice edition. Amazon-side, 50% royalty, appears as a fourth format on the product page. Free to enable from the Kindle eBook row in Bookshelf.
- Direct M4B sale. Stripe Payment Link → Lambda webhook → presigned S3 URL emailed to buyer. Buyers on Apple devices have a native experience; everyone else gets a "how to listen" guide.
- A listen page. Single-file streaming HTML5 player behind a token-gated URL, for the people who don't want to manage files.
What I'd do differently
I would not have started on ElevenLabs. The "professional voice quality" framing made me think I needed paid TTS, but for a self-published thriller the listener does not care whether the voice came from a $99/month service or a free model on a Mac mini. They care whether the pacing is right and whether the chapters are marked. Both of those are problems you solve with ffmpeg, not with model choice.
The other thing I undersold to myself: presigned URLs without per-purchase download limits are an S3 bill waiting to happen. A 335 MB file at 24-hour expiry, leaked once, can cost more than I made on the entire book. I now issue a per-purchase token that decrements a counter in DynamoDB and re-mints a fresh five-minute presigned URL on each request. Caps the worst case to a few cents.