The Alienware in the basement does two new things tonight that it didn't do this morning. It transcribes my Telegram voice notes, and it speaks back. Both running on the RTX 2080 that's been mostly streaming desktops over Sunshine for the last month.
This started as a Codex thread earlier in the week. I'd been
sending voice notes to OpenClaw (my self-hosted multi-channel
agent) via Telegram and they were silently failing. The bot would
just sit there. I went digging in the logs and saw HTTP 400s
coming back from gpt-4o-transcribe. I asked Codex
what was going on and floated the idea of routing transcription to
the Alienware and running Whisper-large locally. Cheaper, more
reliable, and probably more accurate on my mumbling. The thread
ended with "would you do that?" and then I picked it up.
By 2 AM I had it working in both directions. Voice in, voice out, both local. The symmetry is the punchline of the post.
Part 1: Whisper, speech to text
I used
Speaches,
which is the maintained fork of faster-whisper-server. It's a
small FastAPI shim around faster-whisper with an
OpenAI-compatible /v1/audio/transcriptions endpoint,
which means anything that already speaks to the OpenAI API can
point at it with a one-line base URL change.
Container, NVIDIA runtime, systemd unit, bound to the Tailscale
interface on 100.64.0.11:9000 so nothing outside the
tailnet can reach it. Model is
Systran/faster-whisper-large-v3 in fp16, cached in a
Docker volume so the first cold start downloads the weights and
every subsequent restart is a few seconds. Latency on a short
voice note is around 2-3 seconds on the 2080, which is well under
the noise floor of "you said something to your phone and now the
agent is doing something about it."
The part that didn't work: a direct OpenAI-compatible HTTP integration
OpenClaw has a media-understanding subsystem with a config schema
for audio models. Naively, I should be able to drop in a model
entry that looks like an OpenAI transcription model, point its
baseURL at http://100.64.0.11:9000, and
be done.
Two things in the way.
First, the config schema for
tools.media.audio.models[].request is a
.strict() Zod object. Anything you add that isn't in
the explicit allowlist gets stripped at parse time. There's no
allowPrivateNetwork field on it, so even if I tried
to opt in to a private-network request, the schema would silently
delete the field before the runner ever saw it.
Second, even if I got the request through, OpenClaw has an SSRF
guard. In the bundled ssrf-B5bGsnx-.js the
BLOCKED_IPV4_SPECIAL_USE_RANGES set explicitly
includes carrierGradeNat, which is the
100.64.0.0/10 range Tailscale uses. So the request
would get refused at the fetch layer with a polite "we don't talk
to private IPs."
You can argue with either of these in isolation. You can't really argue with both at once.
The workaround: CLI-type audio model
OpenClaw's media-understanding runner supports a
type: "cli" model entry in addition to the network-y
ones. It expands {{MediaPath}} and
{{Language}} into a command and args, runs it, and
reads the transcript off stdout. The CLI runs as the same user as
the gateway, on the same box, so the SSRF guard never gets a
chance to weigh in.
I wrote a tiny bash bridge, whisper-alienware.sh,
that POSTs the audio file to Speaches over Tailscale and prints
the text field of the JSON response. Forty lines,
mostly curl flags and a jq call. Wired it in as
tools.media.audio.models[0] with
type: "cli".
That sidestepped the schema lock entirely. The schema is happy because every field I'm using is in its allowlist. The SSRF guard is happy because it's not involved. The actual private-network request happens inside curl, where nobody is checking.
The hot reload that lied
OpenClaw watches ~/.openclaw/openclaw.json with
chokidar and hot-reloads on change. So when I saved the new
tools.media block, the watcher fired and the gateway
logged that it had picked up the new config.
That log line was true. The follow-on assumption I made was not.
The media-understanding subsystem caches its model list at boot.
openclaw capability audio transcribe --file foo.opus
from the shell runs through a fresh config load every invocation,
so my CLI test passed cleanly: voice file in, transcript out,
looked great. But the running gateway process still had the old
in-memory model list, which meant Telegram voice notes were still
going through the original gpt-4o-transcribe path. I
confirmed by sending a real voice note and watching the same HTTP
400 fire that started this whole thing.
A green CLI test does not prove the gateway picked up the change.
openclaw gateway restart was needed. After the
restart, the flow is: voice note arrives at Telegram, OpenClaw
routes it to the CLI bridge, bridge POSTs to Speaches over
Tailscale, Speaches returns text, OpenClaw acts on it and replies.
Works.
Part 2: Kokoro, text to speech
Same machine, other direction. I wanted a
/speak slash command in OpenClaw that synthesizes
high-fidelity speech, mostly so my agent can read trade
reconciliation summaries back to me without sounding like Stephen
Hawking. Mac's built-in say is fine for "ding, build
finished" but it is not fine for a paragraph of numbers and
tickers.
I'd already used ElevenLabs via my sag skill and
loved the Will voice's pacing on numbers, but the per-character
billing model gets uncomfortable once you start automating it. I
wanted local.
SOTA in this space moves fast: F5-TTS, Chatterbox, fish-speech, Kokoro, all viable. I went with Kokoro-FastAPI (the GPU variant) because Kokoro v1 is only 82M parameters and still gets rated close to ElevenLabs's mid-tier voices in blind tests on neutral text. Real-time factor is about 0.3 on the 2080, which means a 10-second clip renders in around 3 seconds. F5 and Chatterbox are more expressive but they need more glue, and tonight I wanted something that ships.
systemd unit kokoro-tts.service, bound to
100.64.0.11:8880, OpenAI-compatible
/v1/audio/speech endpoint. 67 voice packs loaded,
default voice am_michael, output is MP3 at 24 kHz
mono, 128 kbps.
Wiring it in as a skill, not a runtime-config consumer
The integration was cleaner this time because
/speak is a skill, not a piece of runtime config.
Skills load on demand when an agent decides to invoke one, so I
don't have to fight the SSRF guard or the schema. I just write a
SKILL.md and a script and put them in
~/.openclaw/skills/speak/.
The skill description covers four invocation modes:
-
/speak hello there: synthesize the inline text. -
/speakwith no args: ask "what should I say?" and synthesize the next message. -
/speakas a reply to a message: synthesize the replied text verbatim. -
/speakas a reply to an uploaded document: offer a light or heavy rewrite first, then synthesize.
The bash bridge speak.sh takes text via flag, stdin,
or file, POSTs to Kokoro, saves the MP3 to a temp path, and prints
the path. The agent then emits MEDIA:<path> on
its response and OpenClaw's Telegram adapter picks that up and
uploads the file as a voice message. On local channels (CLI,
desktop UI) it also afplays it through the speakers.
A Telegram-shaped quirk
Telegram has a hard cap of 100 commands per bot in BotFather's
autocomplete menu. OpenClaw is currently bundled with 113+ skills,
so /speak won't appear in the popup. Typed directly
it still works because the agent reads the message, recognizes the
slash, loads the skill, and acts on it. The cap is purely a UI
thing. Worth knowing if you're wondering why your new skill isn't
suggested.
What I like about this shape
Cost goes to zero per request. ElevenLabs bills per character, OpenAI bills per audio minute. The Alienware is already on, the GPU is already paid for, and Tailscale is already routing.
Reliability improves. gpt-4o-transcribe throwing
silent 400s is the kind of failure that's worse than a hard
outage, because it looks fine from outside. Whisper-large on my
own box either works or doesn't, and if it doesn't I get to look
at the logs.
The threat model is also smaller. The audio never leaves the tailnet. Some of what I say to my agent is mundane ("remind me to call the dentist") and some of it isn't ("here's how to log into the trading workstation"). The first kind I don't really care about. The second kind I'd rather not pipe through a third party.
Things I'd do differently next time
I'd assume the gateway needs a restart by default. The "hot reload detected the change" log line is technically correct and operationally misleading, and I lost about 20 minutes on it.
I'd read the SSRF source before reaching for the obvious
integration shape. If I'd grepped
BLOCKED_IPV4_SPECIAL_USE_RANGES first I would have
skipped straight to the CLI workaround instead of trying three
variations of allowPrivateNetwork that got silently
dropped by the schema.
I'd benchmark Kokoro against F5-TTS for the trade-reconciliation use case specifically. Kokoro sounds great on neutral prose. Numbers, tickers, dollar amounts, and the occasional acronym are exactly the kind of thing where a more expressive model might pull ahead. That's a problem for another evening.
The shape, in one diagram's worth of words
Mac sits on the tailnet. Alienware sits on the tailnet at
100.64.0.11. OpenClaw runs on the Mac. Two containers
on the Alienware, both bound to the tailnet interface: Speaches on
:9000, Kokoro on :8880. Two small bash
bridges on the Mac that POST to those endpoints and print results.
Two OpenClaw config entries: one CLI-type audio model for STT, one
skill for TTS.
Voice note in from Telegram, transcript out. Text in from a slash command, voice note out to Telegram. Same box on the basement shelf doing both.
Enjoyed this post?
Get notified when I publish something new. No spam, unsubscribe anytime.