Chivox MCP — Speech assessment for AI language agents

Chivox released a Model Context Protocol server that turns any LLM into a speech examiner, providing structured scoring for pronunciation, fluency, and rhythm. The tool supports both English and Mandarin, offering granular feedback at syllable, word, and phoneme levels. It is designed for AI language tutors, test-prep agents, and corporate training tools.

If you’re building an AI language tutor, you’ve already solved the easy part: listening. Speech-to-text APIs like Whisper or Realtime give you what the user said. But here’s the harder problem — and the one that actually matters for learning outcomes: how well did they say it? That gap between transcription and assessment is exactly where Chivox MCP lives. What it is Chivox MCP is a Model Context Protocol server that turns any LLM into a linguistics-grade speech examiner. Connect your agent to our streamable HTTP endpoint, pass it an audio clip, and you get back a fully structured, typed score matrix that goes far beyond “correct” or “incorrect.” You receive: Holistic dimensions: overall, accuracy, fluency, and rhythm scores. Granular drill-downs: syllable-, word-, and phoneme-level feedback. Tonal precision: native support for tonal Mandarin, scoring pronunciation, pinyin, and character enunciation with the same rigor as English. One schema, sixteen tools, zero friction. The public catalog ships 16 production-ready tools today: 10 English tasks: word, sentence, paragraph, phonics, multiple-choice, semi-open, correction, and realtime evaluation 6 Mandarin tasks: character, pinyin, sentence, paragraph, constrained recognition, and AI-Talk Every tool returns the same top-level schema. That means switching a user from an English paragraph exercise to a Mandarin tone drill — or from a word-level prompt to a realtime stream — requires zero schema migration work on your end. If you’ve already wired your agent to MCP, you’ve already integrated Chivox. This is not a transcription layer.Let’s be clear: Chivox MCP is not another STT wrapper around Whisper. While transcription tells you “the user said ‘beach’”, Chivox tells you “the user substituted /iː/ with /ɪ/ and dropped the final affricate” — and wraps that finding in a typed JSON document your agent can act on immediately. Our engine is a dedicated pronunciation-scoring model trained on exam-grade reference audio, and it has served as the backbone of national-scale Chinese English examinations for over a decade. The MCP layer is simply the modern, standardized interface to that battle-tested infrastructure. Built for the agents you’re already shipping.If you have shipped an integration with OpenAI Realtime or a Whisper pipeline, you know exactly where transcription ends and pedagogy begins. Chivox MCP is the assessment layer you slot on top: Tutoring agents that generate targeted phoneme drills instead of generic “try again” feedback. Test-prep agents that need rubric-aligned, defensible scoring. Language-learning apps that must evaluate spoken Mandarin tones with linguistic fidelity. Corporate training tools that assess fluency and rhythm at scale. Because it speaks MCP, it drops into your existing tool-calling architecture without custom SDKs or fragile REST adapters. Get started Chivox MCP is available now through the CHIVOX API Portal. Grab an API key, point your agent at the streamable HTTP endpoint, and start returning exam-room precision in your next conversation turn. Don’t just transcribe speech. Examine it. Visit CHIVOX website: https://api-portal.cloud.chivox.com/global https://api-portal.cloud.chivox.com/global