Back to blog
FILE 0xAD·CUTTING A FULL-BOOK TTS RENDER DOWN TO ONE CLI

Cutting a full-book TTS render down to one CLI

May 4, 2026 · tts, audiobook, python

I have a habit of writing things — long-form, novel-shaped things — and then not being able to listen to them on a walk. The commercial-quality audiobook pipelines are gated behind narrators and contracts. I wanted local, fast, free, and good enough that my ears wouldn't bleed on a fourteen-chapter manuscript.

What was happening

The state of local neural TTS is genuinely good now. Kokoro produces narration that's a little flat compared to a human but miles better than the robotic SAPI voices I grew up with. The problem isn't quality, it's the pipeline glue.

Out of the box, neural TTS gives you "synthesize this paragraph" or maybe "synthesize this chapter." You still need to:

That was a directory full of half-broken scripts before I made it one CLI.

What I found

Three things matter for a usable render pipeline:

  1. Predictable chunking. You don't want to feed the model a whole chapter at once — it'll OOM on a long one and you have to start over. Chunk on natural boundaries (paragraphs, then sentences) up to a token budget per chunk. Within a chapter the chunks concatenate cleanly because they share a voice and seed.

  2. Resumable per-chunk state. Every chunk gets a stable hash based on its text + voice + seed. If the WAV for that hash already exists in the cache directory, skip it. So a crash in chapter 12 doesn't cost you chapters 1-11 again.

  3. Real-time progress against a clear denominator. "Chunk N of M, audio rendered: H:MM:SS, render time: H:MM:SS, realtime factor: X.X" is enough information to know if the run is healthy. My target is ~3.5x realtime on the Mac mini — anything under 2x means something thermal-throttled.

The fix

The chapter-level orchestrator is the part that turned a pile of scripts into a tool. Stripped-down version:

def render_chapter(chapter_path: Path, voice: str, out_dir: Path):
    text = chapter_path.read_text()
    chunks = chunk_paragraphs(text, max_tokens=420)

    cache = out_dir / "cache" / chapter_path.stem
    cache.mkdir(parents=True, exist_ok=True)

    wavs = []
    started = time.monotonic()
    total_audio_s = 0.0

    for i, chunk in enumerate(chunks, start=1):
        h = sha256(f"{voice}|{chunk}".encode()).hexdigest()[:16]
        wav = cache / f"{i:04d}-{h}.wav"
        if not wav.exists():
            synthesize(chunk, voice=voice, out=wav)
        audio_s = wav_duration(wav)
        total_audio_s += audio_s
        wavs.append(wav)
        elapsed = time.monotonic() - started
        rtf = (total_audio_s / elapsed) if elapsed else 0
        print(
            f"  chunk {i}/{len(chunks)}: "
            f"audio={fmt_secs(total_audio_s)} "
            f"render={fmt_secs(elapsed)} "
            f"rtf={rtf:.1f}x"
        )

    concat_wav = out_dir / f"{chapter_path.stem}.wav"
    ffmpeg_concat(wavs, concat_wav)
    mp3 = out_dir / f"{chapter_path.stem}.mp3"
    ffmpeg_to_mp3(concat_wav, mp3, bitrate="64k", channels=1)
    return mp3

The book-level driver iterates chapters in order, then packages them into a single M4B with chapter markers from ffmetadata:

def build_m4b(mp3s: list[Path], book_meta: dict, out_path: Path):
    metadata = render_ffmetadata(mp3s, book_meta)
    cmd = [
        "ffmpeg", "-y",
        "-i", concat_list_file(mp3s),
        "-i", metadata,
        "-map_metadata", "1",
        "-codec:a", "aac", "-b:a", "64k",
        "-movflags", "+faststart",
        str(out_path),
    ]
    subprocess.run(cmd, check=True)

64 kbps mono is the right starting bitrate for narration. 32 kbps sounds tinny on most earbuds; 96+ is wasted on speech. Stereo is similarly wasted — narration is one voice in one mono channel, making it stereo just doubles the file size.

What I'd do differently

The first version tried to be clever about parallelizing chunk synthesis across CPU cores. It didn't help much — the model is already saturating the GPU on the Mac mini, and multi-process synthesis just made the progress output unreadable. Single- threaded with good progress logging beat parallel-with-no- visibility on every dimension I cared about, including total wall time.

The other lesson, which I keep relearning across projects: any multi-hour batch job should print enough state that you can tell from across the room whether it's healthy. "rtf=3.5x" is more useful than a spinner because three weeks from now I'll remember what 3.5x means and I won't remember what spinner-state 4 means.