Self-Hosting Whisper and Kokoro on an RTX 2080 to Replace OpenAI and ElevenLabs

The Alienware in the basement does two new things tonight that it didn't do this morning. It transcribes my Telegram voice notes, and it speaks back. Both running on the RTX 2080 that's been mostly streaming desktops over Sunshine for the last month.

This started as a Codex thread earlier in the week. I'd been sending voice notes to OpenClaw (my self-hosted multi-channel agent) via Telegram and they were silently failing. The bot would just sit there. I went digging in the logs and saw HTTP 400s coming back from gpt-4o-transcribe. I asked Codex what was going on and floated the idea of routing transcription to the Alienware and running Whisper-large locally. Cheaper, more reliable, and probably more accurate on my mumbling. The thread ended with "would you do that?" and then I picked it up.

By 2 AM I had it working in both directions. Voice in, voice out, both local. The symmetry is the punchline of the post.

Part 1: Whisper, speech to text

I used Speaches, which is the maintained fork of faster-whisper-server. It's a small FastAPI shim around faster-whisper with an OpenAI-compatible /v1/audio/transcriptions endpoint, which means anything that already speaks to the OpenAI API can point at it with a one-line base URL change.

Container, NVIDIA runtime, systemd unit, bound to the Tailscale interface on 100.64.0.11:9000 so nothing outside the tailnet can reach it. Model is Systran/faster-whisper-large-v3 in fp16, cached in a Docker volume so the first cold start downloads the weights and every subsequent restart is a few seconds. Latency on a short voice note is around 2-3 seconds on the 2080, which is well under the noise floor of "you said something to your phone and now the agent is doing something about it."

The part that didn't work: a direct OpenAI-compatible HTTP integration

OpenClaw has a media-understanding subsystem with a config schema for audio models. Naively, I should be able to drop in a model entry that looks like an OpenAI transcription model, point its baseURL at http://100.64.0.11:9000, and be done.

Two things in the way.

First, the config schema for tools.media.audio.models[].request is a .strict() Zod object. Anything you add that isn't in the explicit allowlist gets stripped at parse time. There's no allowPrivateNetwork field on it, so even if I tried to opt in to a private-network request, the schema would silently delete the field before the runner ever saw it.

Second, even if I got the request through, OpenClaw has an SSRF guard. In the bundled ssrf-B5bGsnx-.js the BLOCKED_IPV4_SPECIAL_USE_RANGES set explicitly includes carrierGradeNat, which is the 100.64.0.0/10 range Tailscale uses. So the request would get refused at the fetch layer with a polite "we don't talk to private IPs."

You can argue with either of these in isolation. You can't really argue with both at once.

The workaround: CLI-type audio model

OpenClaw's media-understanding runner supports a type: "cli" model entry in addition to the network-y ones. It expands {{MediaPath}} and {{Language}} into a command and args, runs it, and reads the transcript off stdout. The CLI runs as the same user as the gateway, on the same box, so the SSRF guard never gets a chance to weigh in.

I wrote a tiny bash bridge, whisper-alienware.sh, that POSTs the audio file to Speaches over Tailscale and prints the text field of the JSON response. Forty lines, mostly curl flags and a jq call. Wired it in as tools.media.audio.models[0] with type: "cli".

That sidestepped the schema lock entirely. The schema is happy because every field I'm using is in its allowlist. The SSRF guard is happy because it's not involved. The actual private-network request happens inside curl, where nobody is checking.

The hot reload that lied

OpenClaw watches ~/.openclaw/openclaw.json with chokidar and hot-reloads on change. So when I saved the new tools.media block, the watcher fired and the gateway logged that it had picked up the new config.

That log line was true. The follow-on assumption I made was not.

The media-understanding subsystem caches its model list at boot. openclaw capability audio transcribe --file foo.opus from the shell runs through a fresh config load every invocation, so my CLI test passed cleanly: voice file in, transcript out, looked great. But the running gateway process still had the old in-memory model list, which meant Telegram voice notes were still going through the original gpt-4o-transcribe path. I confirmed by sending a real voice note and watching the same HTTP 400 fire that started this whole thing.

A green CLI test does not prove the gateway picked up the change. openclaw gateway restart was needed. After the restart, the flow is: voice note arrives at Telegram, OpenClaw routes it to the CLI bridge, bridge POSTs to Speaches over Tailscale, Speaches returns text, OpenClaw acts on it and replies. Works.

Part 2: Kokoro, text to speech

Same machine, other direction. I wanted a /speak slash command in OpenClaw that synthesizes high-fidelity speech, mostly so my agent can read trade reconciliation summaries back to me without sounding like Stephen Hawking. Mac's built-in say is fine for "ding, build finished" but it is not fine for a paragraph of numbers and tickers.

I'd already used ElevenLabs via my sag skill and loved the Will voice's pacing on numbers, but the per-character billing model gets uncomfortable once you start automating it. I wanted local.

SOTA in this space moves fast: F5-TTS, Chatterbox, fish-speech, Kokoro, all viable. I went with Kokoro-FastAPI (the GPU variant) because Kokoro v1 is only 82M parameters and still gets rated close to ElevenLabs's mid-tier voices in blind tests on neutral text. Real-time factor is about 0.3 on the 2080, which means a 10-second clip renders in around 3 seconds. F5 and Chatterbox are more expressive but they need more glue, and tonight I wanted something that ships.

systemd unit kokoro-tts.service, bound to 100.64.0.11:8880, OpenAI-compatible /v1/audio/speech endpoint. 67 voice packs loaded, default voice am_michael, output is MP3 at 24 kHz mono, 128 kbps.

Wiring it in as a skill, not a runtime-config consumer

The integration was cleaner this time because /speak is a skill, not a piece of runtime config. Skills load on demand when an agent decides to invoke one, so I don't have to fight the SSRF guard or the schema. I just write a SKILL.md and a script and put them in ~/.openclaw/skills/speak/.

The skill description covers four invocation modes:

/speak hello there: synthesize the inline text.
/speak with no args: ask "what should I say?" and synthesize the next message.
/speak as a reply to a message: synthesize the replied text verbatim.
/speak as a reply to an uploaded document: offer a light or heavy rewrite first, then synthesize.

The bash bridge speak.sh takes text via flag, stdin, or file, POSTs to Kokoro, saves the MP3 to a temp path, and prints the path. The agent then emits MEDIA:<path> on its response and OpenClaw's Telegram adapter picks that up and uploads the file as a voice message. On local channels (CLI, desktop UI) it also afplays it through the speakers.

A Telegram-shaped quirk

Telegram has a hard cap of 100 commands per bot in BotFather's autocomplete menu. OpenClaw is currently bundled with 113+ skills, so /speak won't appear in the popup. Typed directly it still works because the agent reads the message, recognizes the slash, loads the skill, and acts on it. The cap is purely a UI thing. Worth knowing if you're wondering why your new skill isn't suggested.

What I like about this shape

Cost goes to zero per request. ElevenLabs bills per character, OpenAI bills per audio minute. The Alienware is already on, the GPU is already paid for, and Tailscale is already routing.

Reliability improves. gpt-4o-transcribe throwing silent 400s is the kind of failure that's worse than a hard outage, because it looks fine from outside. Whisper-large on my own box either works or doesn't, and if it doesn't I get to look at the logs.

The threat model is also smaller. The audio never leaves the tailnet. Some of what I say to my agent is mundane ("remind me to call the dentist") and some of it isn't ("here's how to log into the trading workstation"). The first kind I don't really care about. The second kind I'd rather not pipe through a third party.

Things I'd do differently next time

I'd assume the gateway needs a restart by default. The "hot reload detected the change" log line is technically correct and operationally misleading, and I lost about 20 minutes on it.

I'd read the SSRF source before reaching for the obvious integration shape. If I'd grepped BLOCKED_IPV4_SPECIAL_USE_RANGES first I would have skipped straight to the CLI workaround instead of trying three variations of allowPrivateNetwork that got silently dropped by the schema.

I'd benchmark Kokoro against F5-TTS for the trade-reconciliation use case specifically. Kokoro sounds great on neutral prose. Numbers, tickers, dollar amounts, and the occasional acronym are exactly the kind of thing where a more expressive model might pull ahead. That's a problem for another evening.

The shape, in one diagram's worth of words

Mac sits on the tailnet. Alienware sits on the tailnet at 100.64.0.11. OpenClaw runs on the Mac. Two containers on the Alienware, both bound to the tailnet interface: Speaches on :9000, Kokoro on :8880. Two small bash bridges on the Mac that POST to those endpoints and print results. Two OpenClaw config entries: one CLI-type audio model for STT, one skill for TTS.

Voice note in from Telegram, transcript out. Text in from a slash command, voice note out to Telegram. Same box on the basement shelf doing both.

Enjoyed this post?

Get notified when I publish something new. No spam, unsubscribe anytime.

← Previous Post CodexBar Memory Leak Fix →