Four free neural TTS options for CI pipelines — edge-tts, Kokoro, MeloTTS, Bark

A developer evaluated four free neural TTS options—edge-tts, Kokoro, MeloTTS, and Bark—for use in CI pipelines without a GPU. edge-tts offers broadcast-quality voices via an unofficial Microsoft endpoint, while Kokoro and MeloTTS run locally with slower CPU inference. Bark provides the most expressiveness but requires large model downloads.

Building a two-host video pipeline https://dev.to/articles/two-host-video-pipeline-edge-tts-pillow-ffmpeg put me through most of the free neural TTS options that can run in GitHub Actions without a GPU. The criteria I care about: zero API cost, acceptable voice quality, runs headless in CI, and doesn't require CUDA at inference time. Here's a comparison of the four I tested or seriously evaluated. GitHub : rany2/edge-tts https://github.com/rany2/edge-tts | License : MIT wrapper | Voices : 400+ across 100+ languages edge-tts is a Python wrapper around Microsoft Edge's read-aloud TTS endpoint — the same one that fires when you right-click text in Edge and select "Read aloud." It streams MP3 output. Quality on the en-US-GuyNeural and en-US-AvaNeural voices is genuinely broadcast-quality; it's noticeably better than older open-source models and competitive with commercial APIs. Speed is fast because it's streaming from a remote endpoint: a 10-minute audio file generates in 30-60 seconds regardless of CI runner hardware. The catch: it calls an unofficial Microsoft endpoint. Microsoft hasn't published a public contract for it and could restrict access without warning. I've been running it daily for about a month without issues, but this is a real operational risk. pip install edge-tts edge-tts --voice en-US-GuyNeural --text "Hello world" --write-media out.mp3 Best for : CI pipelines where voice quality matters and you can accept an external unofficial API dependency. HuggingFace : hexgrad/Kokoro-82M https://huggingface.co/hexgrad/Kokoro-82M | License : Apache 2.0 | Params : 82M Kokoro is a small TTS model that runs entirely locally. Voice quality is good for the model size — noticeably better than older models like Tacotron2 and FastSpeech2, though below edge-tts on naturalness for longer passages. The main tradeoff for CI: inference runs on CPU at well below real-time on a standard GitHub Actions runner. A 10-minute audio job could take significantly longer than 10 minutes to render, depending on segment count and text density. For short-form content under 3 minutes this is usually fine; for longer videos it's the bottleneck. First run downloads ~320MB of model weights. If you cache these in GitHub Actions, subsequent runs skip the download. python from kokoro import KPipeline pipeline = KPipeline lang code="a" "a" = American English audio, sr = next pipeline "Hello world", voice="af heart" Best for : fully local inference without external API calls, projects where you need auditable offline-capable TTS. GitHub : myshell-ai/MeloTTS https://github.com/myshell-ai/MeloTTS | License : MIT | Languages : English, Chinese, Japanese, Korean, French, Spanish MeloTTS from MyShell.ai is a multilingual model with better-than-average English naturalness in my testing. The Python package is melo-tts pip , and the API lets you set speaker ID and speed per utterance without reloading the model between clips — useful when you're rendering hundreds of short dialogue segments in a batch. CPU inference speed is in the same range as Kokoro. Model download is around 500MB. The MIT license is a practical advantage if you're building a product on top of it — no Apache license compatibility questions. python from melo.api import TTS tts = TTS language="EN", device="cpu" tts.tts to file "Hello world", tts.hps.data.spk2id "EN-Default" , "out.wav" Best for : multilingual content pipelines, or when you want MIT-licensed local TTS with solid English quality. GitHub : suno-ai/bark https://github.com/suno-ai/bark | License : MIT | Size : ~1.7GB small , ~8GB large Bark is the most capable of the four for voice expressiveness. You can specify laughter laughs , sighs, hesitations, and non-speech sounds inline in the prompt text. Quality on the large model is competitive with commercial TTS APIs. The problem for standard CI: the large model needs a GPU with substantial VRAM and takes minutes to render 30 seconds of audio on CPU. The small model fits in RAM but quality drops noticeably. GitHub Actions standard runners have no GPU, making the large model impractical and the small model a significant quality downgrade. Best for : local GPU inference where expressive voice effects justify the hardware requirement. Not practical for standard CPU-only CI runners. | Tool | Voice quality | CPU speed | External API | CI practical | |---|---|---|---|---| | edge-tts | excellent | fast streaming | yes unofficial | yes | | Kokoro-82M | good | slow | no | yes short video | | MeloTTS | good | slow | no | yes short video | | Bark large | excellent | very slow | no | no | For automated video pipelines on standard GitHub Actions runners, edge-tts is the practical choice if you accept the unofficial API dependency. If you need fully local inference and your videos stay under 3-4 minutes, Kokoro or MeloTTS both work within a reasonable job time budget. Bark belongs on a GPU machine, not a free CI runner. Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.