Doni Zhang — Engineering & AI

Overview

Building a voice assistant that runs on embedded hardware presents unique challenges. In this post, I'll walk through how I combined an ESP32 microcontroller with cloud AI services to create a responsive, natural-language voice assistant.

Hardware Setup

The core components:

ESP32 Dev Board — The brain of the operation
INMP441 — I2S MEMS microphone for high-quality audio capture
MAX98357 — I2S amplifier for audio output

Architecture

ESP32 → Wi-Fi → Flask Backend → Whisper API (STT)
                                  → GPT-4 (Response)
                                  → Edge-TTS (Voice Synthesis)
ESP32 ← Audio Stream ← Flask Backend

Audio Capture with I2S

The INMP441 connects via I2S for digital audio capture, avoiding analog noise issues:

#include <driver/i2s.h>

void setup_i2s() {
  i2s_config_t i2s_config = {
    .mode = i2s_mode_t(I2S_MODE_MASTER | I2S_MODE_RX),
    .sample_rate = 16000,
    .bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT,
    .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
    .communication_format = I2S_COMM_FORMAT_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = 8,
    .dma_buf_len = 1024,
  };
  i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL);
}

Streaming Audio to Whisper

The ESP32 streams audio chunks over WebSocket to a Flask backend, which forwards them to OpenAI's Whisper API for real-time transcription.

Key Learnings

Buffer management is critical — too large and latency suffers, too small and you get audio artifacts
Wi-Fi stability on ESP32 requires proper error handling and reconnection logic
Streaming TTS via Edge-TTS provides much more natural voice than pre-recorded prompts

Conclusion

This project demonstrates that powerful AI voice interactions are possible on sub-$10 microcontrollers by leveraging cloud APIs intelligently.

ESP32 → Wi-Fi → Flask Backend → Whisper API (STT) → GPT-4 (Response) → Edge-TTS (Voice Synthesis) ESP32 ← Audio Stream ← Flask Backend

Audio Capture with I2S

The INMP441 connects via I2S for digital audio capture, avoiding analog noise issues:

#include <driver/i2s.h> void setup_i2s() { i2s_config_t i2s_config = { .mode = i2s_mode_t(I2S_MODE_MASTER | I2S_MODE_RX), .sample_rate = 16000, .bits_per_sample = I2S_BITS_PER_SAMPLE_32BIT, .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT, .communication_format = I2S_COMM_FORMAT_I2S, .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1, .dma_buf_count = 8, .dma_buf_len = 1024, }; i2s_driver_install(I2S_NUM_0, &i2s_config, 0, NULL); }

Building a Real-time Voice Assistant with ESP32 and Whisper API

Overview

Hardware Setup

Architecture

Audio Capture with I2S

Streaming Audio to Whisper

Key Learnings

Conclusion

Building a Real-time Voice Assistant with ESP32 and Whisper API

Overview

Hardware Setup

Architecture

Audio Capture with I2S

Streaming Audio to Whisper

Key Learnings

Conclusion