cd /news/artificial-intelligence/four-free-neural-tts-options-for-ci-… · home topics artificial-intelligence article
[ARTICLE · art-41407] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=· neutral

Four free neural TTS options for CI pipelines — edge-tts, Kokoro, MeloTTS, Bark

A developer evaluated four free neural TTS options—edge-tts, Kokoro, MeloTTS, and Bark—for use in CI pipelines without a GPU. edge-tts offers broadcast-quality voices via an unofficial Microsoft endpoint, while Kokoro and MeloTTS run locally with slower CPU inference. Bark provides the most expressiveness but requires large model downloads.

read4 min views1 publishedJun 26, 2026

Building a two-host video pipeline put me through most of the free neural TTS options that can run in GitHub Actions without a GPU. The criteria I care about: zero API cost, acceptable voice quality, runs headless in CI, and doesn't require CUDA at inference time.

Here's a comparison of the four I tested or seriously evaluated.

GitHub: rany2/edge-tts | License: MIT (wrapper) | Voices: 400+ across 100+ languages

edge-tts is a Python wrapper around Microsoft Edge's read-aloud TTS endpoint — the same one that fires when you right-click text in Edge and select "Read aloud." It streams MP3 output. Quality on the en-US-GuyNeural

and en-US-AvaNeural

voices is genuinely broadcast-quality; it's noticeably better than older open-source models and competitive with commercial APIs.

Speed is fast because it's streaming from a remote endpoint: a 10-minute audio file generates in 30-60 seconds regardless of CI runner hardware.

The catch: it calls an unofficial Microsoft endpoint. Microsoft hasn't published a public contract for it and could restrict access without warning. I've been running it daily for about a month without issues, but this is a real operational risk.

pip install edge-tts
edge-tts --voice en-US-GuyNeural --text "Hello world" --write-media out.mp3

Best for: CI pipelines where voice quality matters and you can accept an external unofficial API dependency.

HuggingFace: hexgrad/Kokoro-82M | License: Apache 2.0 | Params: 82M

Kokoro is a small TTS model that runs entirely locally. Voice quality is good for the model size — noticeably better than older models like Tacotron2 and FastSpeech2, though below edge-tts on naturalness for longer passages.

The main tradeoff for CI: inference runs on CPU at well below real-time on a standard GitHub Actions runner. A 10-minute audio job could take significantly longer than 10 minutes to render, depending on segment count and text density. For short-form content (under 3 minutes) this is usually fine; for longer videos it's the bottleneck.

First run downloads ~320MB of model weights. If you cache these in GitHub Actions, subsequent runs skip the download.

from kokoro import KPipeline
pipeline = KPipeline(lang_code="a")  # "a" = American English
audio, sr = next(pipeline("Hello world", voice="af_heart"))

Best for: fully local inference without external API calls, projects where you need auditable offline-capable TTS.

GitHub: myshell-ai/MeloTTS | License: MIT | Languages: English, Chinese, Japanese, Korean, French, Spanish

MeloTTS from MyShell.ai is a multilingual model with better-than-average English naturalness in my testing. The Python package is melo-tts

(pip), and the API lets you set speaker ID and speed per utterance without re the model between clips — useful when you're rendering hundreds of short dialogue segments in a batch.

CPU inference speed is in the same range as Kokoro. Model download is around 500MB. The MIT license is a practical advantage if you're building a product on top of it — no Apache license compatibility questions.

from melo.api import TTS
tts = TTS(language="EN", device="cpu")
tts.tts_to_file("Hello world", tts.hps.data.spk2id["EN-Default"], "out.wav")

Best for: multilingual content pipelines, or when you want MIT-licensed local TTS with solid English quality.

GitHub: suno-ai/bark | License: MIT | Size: ~1.7GB (small), ~8GB (large)

Bark is the most capable of the four for voice expressiveness. You can specify laughter ([laughs]

), sighs, hesitations, and non-speech sounds inline in the prompt text. Quality on the large model is competitive with commercial TTS APIs.

The problem for standard CI: the large model needs a GPU with substantial VRAM and takes minutes to render 30 seconds of audio on CPU. The small model fits in RAM but quality drops noticeably. GitHub Actions standard runners have no GPU, making the large model impractical and the small model a significant quality downgrade.

Best for: local GPU inference where expressive voice effects justify the hardware requirement. Not practical for standard CPU-only CI runners.

Tool Voice quality CPU speed External API CI practical
edge-tts excellent fast (streaming) yes (unofficial) yes
Kokoro-82M good slow no yes (short video)
MeloTTS good slow no yes (short video)
Bark (large) excellent very slow no no

For automated video pipelines on standard GitHub Actions runners, edge-tts is the practical choice if you accept the unofficial API dependency. If you need fully local inference and your videos stay under 3-4 minutes, Kokoro or MeloTTS both work within a reasonable job time budget. Bark belongs on a GPU machine, not a free CI runner.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @microsoft 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/four-free-neural-tts…] indexed:0 read:4min 2026-06-26 ·