Building a two-host video pipeline put me through most of the free neural TTS options that can run in GitHub Actions without a GPU. The criteria I care about: zero API cost, acceptable voice quality, runs headless in CI, and doesn't require CUDA at inference time.
Here's a comparison of the four I tested or seriously evaluated.
GitHub: rany2/edge-tts | License: MIT (wrapper) | Voices: 400+ across 100+ languages
edge-tts is a Python wrapper around Microsoft Edge's read-aloud TTS endpoint — the same one that fires when you right-click text in Edge and select "Read aloud." It streams MP3 output. Quality on the en-US-GuyNeural
and en-US-AvaNeural
voices is genuinely broadcast-quality; it's noticeably better than older open-source models and competitive with commercial APIs.
Speed is fast because it's streaming from a remote endpoint: a 10-minute audio file generates in 30-60 seconds regardless of CI runner hardware.
The catch: it calls an unofficial Microsoft endpoint. Microsoft hasn't published a public contract for it and could restrict access without warning. I've been running it daily for about a month without issues, but this is a real operational risk.
pip install edge-tts
edge-tts --voice en-US-GuyNeural --text "Hello world" --write-media out.mp3
Best for: CI pipelines where voice quality matters and you can accept an external unofficial API dependency.
HuggingFace: hexgrad/Kokoro-82M | License: Apache 2.0 | Params: 82M
Kokoro is a small TTS model that runs entirely locally. Voice quality is good for the model size — noticeably better than older models like Tacotron2 and FastSpeech2, though below edge-tts on naturalness for longer passages.
The main tradeoff for CI: inference runs on CPU at well below real-time on a standard GitHub Actions runner. A 10-minute audio job could take significantly longer than 10 minutes to render, depending on segment count and text density. For short-form content (under 3 minutes) this is usually fine; for longer videos it's the bottleneck.
First run downloads ~320MB of model weights. If you cache these in GitHub Actions, subsequent runs skip the download.
from kokoro import KPipeline
pipeline = KPipeline(lang_code="a") # "a" = American English
audio, sr = next(pipeline("Hello world", voice="af_heart"))
Best for: fully local inference without external API calls, projects where you need auditable offline-capable TTS.
GitHub: myshell-ai/MeloTTS | License: MIT | Languages: English, Chinese, Japanese, Korean, French, Spanish
MeloTTS from MyShell.ai is a multilingual model with better-than-average English naturalness in my testing. The Python package is melo-tts
(pip), and the API lets you set speaker ID and speed per utterance without re the model between clips — useful when you're rendering hundreds of short dialogue segments in a batch.
CPU inference speed is in the same range as Kokoro. Model download is around 500MB. The MIT license is a practical advantage if you're building a product on top of it — no Apache license compatibility questions.
from melo.api import TTS
tts = TTS(language="EN", device="cpu")
tts.tts_to_file("Hello world", tts.hps.data.spk2id["EN-Default"], "out.wav")
Best for: multilingual content pipelines, or when you want MIT-licensed local TTS with solid English quality.
GitHub: suno-ai/bark | License: MIT | Size: ~1.7GB (small), ~8GB (large)
Bark is the most capable of the four for voice expressiveness. You can specify laughter ([laughs]
), sighs, hesitations, and non-speech sounds inline in the prompt text. Quality on the large model is competitive with commercial TTS APIs.
The problem for standard CI: the large model needs a GPU with substantial VRAM and takes minutes to render 30 seconds of audio on CPU. The small model fits in RAM but quality drops noticeably. GitHub Actions standard runners have no GPU, making the large model impractical and the small model a significant quality downgrade.
Best for: local GPU inference where expressive voice effects justify the hardware requirement. Not practical for standard CPU-only CI runners.
| Tool | Voice quality | CPU speed | External API | CI practical |
|---|---|---|---|---|
| edge-tts | excellent | fast (streaming) | yes (unofficial) | yes |
| Kokoro-82M | good | slow | no | yes (short video) |
| MeloTTS | good | slow | no | yes (short video) |
| Bark (large) | excellent | very slow | no | no |
For automated video pipelines on standard GitHub Actions runners, edge-tts is the practical choice if you accept the unofficial API dependency. If you need fully local inference and your videos stay under 3-4 minutes, Kokoro or MeloTTS both work within a reasonable job time budget. Bark belongs on a GPU machine, not a free CI runner.
Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.