Building a Real-time Voice Assistant with ESP32 and Whisper API
How I built a low-latency voice assistant using ESP32, INMP441 microphone, Whisper API for speech recognition, and Edge-TTS for natural responses.
Overview
Building a voice assistant that runs on embedded hardware presents unique challenges. In this post, I'll walk through how I combined an ESP32 microcontroller with cloud AI services to create a responsive, natural-language voice assistant.
Hardware Setup
The core components:
- ESP32 Dev Board — The brain of the operation
- INMP441 — I2S MEMS microphone for high-quality audio capture
- MAX98357 — I2S amplifier for audio output
Architecture
ESP32 → Wi-Fi → Flask Backend → Whisper API (STT)
→ GPT-4 (Response)
→ Edge-TTS (Voice Synthesis)
ESP32 ← Audio Stream ← Flask Backend
Audio Capture with I2S
The INMP441 connects via I2S for digital audio capture, avoiding analog noise issues:
#include <driver/i2s.h>
void setup_i2s() {
i2s_config_t i2s_config = {
.mode = i2s_mode_t(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = 16000,
.bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = I2S_COMM_FORMAT_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 8,
.dma_buf_len = 1024,
};
i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL);
}
Streaming Audio to Whisper
The ESP32 streams audio chunks over WebSocket to a Flask backend, which forwards them to OpenAI's Whisper API for real-time transcription.
Key Learnings
- Buffer management is critical — too large and latency suffers, too small and you get audio artifacts
- Wi-Fi stability on ESP32 requires proper error handling and reconnection logic
- Streaming TTS via Edge-TTS provides much more natural voice than pre-recorded prompts
Conclusion
This project demonstrates that powerful AI voice interactions are possible on sub-$10 microcontrollers by leveraging cloud APIs intelligently.