SpeakSmart: Real-Time Public Speaking Coach

35 Views, 1 Favorites, 0 Comments

SpeakSmart: Real-Time Public Speaking Coach

Introduction

Public speaking is one of the most universally dreaded skills — and one of the hardest to improve without feedback. Most people practice alone in front of a mirror, or rely on friends to give vague encouragement. Professional coaching exists, but it's expensive and not always accessible.

SpeakSmart is a real-time public speaking coach that fits in your pocket. It uses an ESP32-S3-EYE (a tiny camera and microphone board) to capture both video and audio of you while you speak. A Python backend running on your laptop processes the streams using computer vision and speech analysis, then displays live feedback on a dashboard in your browser.

What it tracks:

Posture & Body Language: The system watches your body positioning, head tilt, and how much you sway while speaking. These are the physical habits that audiences notice subconsciously — a slouched posture or constant swaying signals nervousness even when your words are confident.

Eye Contact Using facial landmark detection, SpeakSmart estimates how directly you're facing the camera — a proxy for eye contact with an audience. A low score is a cue to look up from your notes more often.

Speaking Pace Too fast and your audience can't follow. Too slow and you lose their attention. The system tracks your words per minute in real time and flags when you drift outside the ideal 120–160 WPM range.

Filler Words "Um", "uh", "like", "you know", "basically" are tracked using a locally-running Whisper speech recognition model to count them as you speak, so you can become aware of your habits without anyone keeping score for you.

Pitch & Monotone Detection A flat, unchanging vocal pitch is one of the fastest ways to lose an audience. The system analyses your pitch range and flags if you're speaking in a monotone, prompting you to vary your delivery.

This project is ideal for students preparing presentations, professionals working on pitches, or anyone who wants to become a more confident and compelling speaker. It is also a practical introduction to embedded systems, computer vision, and real-time audio analysis.

Supplies

Hardware

ESP32-S3-EYE development board (includes OV2640 camera and ES7210 microphone)
AliExpress
Mouser Electronics
MicroUSB cable cable (data-capable — not charge-only)
Amazon
Computer with WiFi (tested on Mac M4, works on Windows/Linux too)
2.4GHz WiFi network (the ESP32 does not support 5GHz)

Software — ESP Firmware

VS Code
Download Link
PlatformIO IDE extension (VS Code marketplace)
Download Link
C/C++ extension (VS Code marketplace)

Software — Python Backend

Python 3.9–3.12
pip packages: fastapi, uvicorn, opencv-python, mediapipe, librosa, requests, websockets, faster-whisper

Software — Models (auto-downloaded on first run)

MediaPipe Pose Landmarker Lite
Whisper tiny model for filler word detection

Download the Project Files

https://github.com/search?q=speaksmart&type=repositories

Project files are linked in the github above! Download before proceeding.

Install VS Code and Extensions

Download and install VS Code. Then open the Extensions panel (Ctrl+Shift+X / Cmd+Shift+X) and install:

PlatformIO IDE
C/C++

PlatformIO will take a few minutes to install its core tools on first launch. Wait for it to finish before proceeding.

Configure Wifi Credentials

Speak smart relies on wifi to communicate between your laptop and the camera.

Open esp_firmware/include/config.h and update these two lines:

#define WIFI_SSID "your network name"

#define WIFI_PASSWORD "your password

The ESP32 only supports 2.4GHz networks. If your router broadcasts both 2.4GHz and 5GHz under the same name, you may need to log into your router and separate them.

Install VS Code and PlatformIO

Download and install VS Code from https://code.visualstudio.com

Open VS Code and click the Extensions icon on the left sidebar

Search for PlatformIO IDE and click Install
Search for C/C++ by Microsoft and click Install

Wait for PlatformIO to finish installing its core — this takes 2–5 minutes and only happens once. You will see a loading indicator in the bottom status bar.

Download and Unzip the Project Files

Download the project zip file attached to this Instructable

Extract it — on Mac, double-click the zip. On Windows, right-click → Extract All

You should see a folder called speaksmart containing two subfolders: esp_firmware and backend

Configure Wifi Credentials

In VS Code, go to File → Open Folder and select the esp_firmware folder
In the file explorer on the left, open include/config.h
Find these two lines and update them with your network name and password:

#define WIFI_SSID "your network name here"

#define WIFI_PASSWORD "your password here"

This is an extremely important step. It needs to be done every time you change location and connect to a different wifi. Make sure your laptop connects to the same wifi, also make sure that it is 2.4G.

Flash the Firmware to the ESP32

Plug the ESP32-S3-EYE into your computer with a USB-C data cable
Put the board into download mode: - Hold the BOOT button on the board - While holding BOOT, press and release RESET - Release BOOT.
Click the PlatformIO icon in the left sidebar, then go to "Project tasks" → ESP32S3EYE → General → Upload
The first upload may take a long time because it downloads all the required resources, after that it should take less then 20 seconds to upload.
Wait for the terminal to show SUCCESS.

If the upload fails with upload error then add "upload_port = COM3: (Windows) or "upload_port = /dev/ttyUSB0" (Mac/Linux) to `platformio.ini`

Find the ESP's IP Address

Click the plug icon in the PlatformIO toolbar to open the Serial Monitor.
Press the RESET button on the ESP32 board
Watch for output like this:

[WiFi] Connected — IP: 123.456.x.x

[Camera] Initialised

[SpeakSmart] Ready

Stream : http://123.456.x.x

Stream Audio : ws://123.456.x.x:81

Copy the IP Address because you will need it in the next step.

Configure and Run the Backend

Open backend/main.py and find line 28:

ESP_IP = "192.168.x.x"

Replace the IP with the one you copied from the Serial Monitor in Step 8.

Save the file, then in your terminal run:

python3 main.py

[pose_analyser] PoseAnalyser ready
[speaksmart] Vision loop started
[audio_consumer] Audio WebSocket connected
[filler_detector] Whisper model ready

Open the Dashboard

Open your browser and go to: http://localhost:8000
The status dot in the top left should turn green and show LIVE
The timer will start counting
Position the ESP32-S3-EYE so it has a clear view of you — roughly at chest height, 1–2 metres away, in a well-lit room

Within a few seconds you should see:

Live video feed on the left
Posture score, eye contact %, and sway updating
Volume chart drawing in the bottom panel

Run a Session

Stand in front of the camera and speak naturally for at least 60 seconds.

Posture — stands tall or drops when you slouch. Try deliberately rounding your shoulders and watch the score fall.

Volume - If the volume is too low and it is not detecting speaking then change "volume" in config.h to a higher value.

Eye Contact — rises when you look directly at the camera and drops when you look away. Practice looking up from your notes.

WPM — watch this while varying your pace deliberately. The ideal range is 120–160 WPM for most presentations.

Filler count — deliberately say "um" a few times to confirm it is working. Count increments within a few seconds.

When you are done, click Reset Session to see your full report with scores and coaching tips, then start a new session.

How It Works

The ESP32-S3-EYE is a small WiFi-enabled board with a built-in camera and microphone. Once flashed with our firmware, it joins your WiFi network and broadcasts two streams:

Video — a live MJPEG camera stream over HTTP on port 80
Audio — raw PCM microphone data over WebSocket on port 81

A Python program running on your laptop connects to both streams simultaneously. It runs two analysis pipelines in parallel:

Vision pipeline — each video frame is processed by MediaPipe, a Google computer vision library, which detects your body pose and facial landmarks to measure posture, head tilt, shoulder alignment, and eye contact
Speech pipeline — the audio stream is analysed by librosa for volume, pitch, and speaking pace, and by a locally-running Whisper AI model for filler word detection

The Python program hosts a web server at http://localhost:8000. Opening that address in your browser shows the live dashboard, which receives all metrics via WebSocket and updates in real time.

Looking in the code, there are comments which break it down into parts, making it easier to understand.