# ESP32 Into a Speech-to-Text Device

> Source: <https://dev.to/david_thomas/esp32-into-a-speech-to-text-device-c3m>
> Published: 2026-05-22 10:22:57+00:00

Typing commands into a serial monitor feels old once you start playing with voice interfaces.
So I decided to try something more interesting — building a small ESP32 Speech to Text system using an INMP441 I2S microphone and an OLED display. The setup listens to speech, sends audio to a cloud API, and converts spoken words into text almost instantly.
And honestly, seeing your own words appear live on a tiny OLED screen feels surprisingly futuristic for such a small project.
At first, I thought about running everything directly on the ESP32.
Then reality hit.
Speech recognition models are heavy. The ESP32 simply doesn’t have enough processing power or memory to run large speech-to-text models locally in a reliable way. Instead of fighting hardware limitations for days, I used a cloud-based speech recognition service called Wit.ai.
The ESP32 only handles:
The cloud handles the difficult AI processing.
Way simpler.
The workflow is actually pretty clean.
The INMP441 microphone captures audio using the I2S protocol. The ESP32 records the audio as 16-bit PCM data and sends it over HTTPS to Wit.ai using WiFi.
Once processed, Wit.ai sends back the recognized text in JSON format.
The ESP32 extracts the text and displays it on:
So the whole system behaves almost like a tiny voice assistant.
Press button → speak → get text.
The hardware setup is very small:
That’s it.
No extra audio shield.
No Raspberry Pi.
No expensive AI hardware.
I honestly expected cloud AI setup to be painful.
But the process was surprisingly simple:
Done.
The ESP32 sends raw audio directly to:
api.wit.ai
using HTTPS requests.
No custom server setup required.
One thing I really liked was the OLED status updates.
The display switches between:
It makes the device feel interactive instead of just dumping logs into Serial Monitor.
Once the recognized text appears on the OLED, the project suddenly feels much more polished.
This setup can easily evolve into:
You could even combine it with text-to-speech later and create a complete two-way voice assistant using only ESP32 hardware.
For a small microcontroller project, this one feels surprisingly close to real-world AI systems.