Skip to main content

Whisper Speech Recognition

Introduction

Voice commands are the most natural way for humans to interact with robots. OpenAI Whisper is a state-of-the-art automatic speech recognition (ASR) system that converts spoken language into text with remarkable accuracy—even in noisy environments.

This section covers Whisper architecture, ROS 2 integration, and building a complete voice command pipeline for humanoid robots.

Why Whisper for Robotics?

Comparison with Other ASR Systems

SystemAccuracyLanguagesRobustnessLatencyCost
Whisper95%+99 languagesExcellent~1-3sFree (self-hosted)
Google Speech-to-Text95%+125+Good~0.5s$0.006/15s
Mozilla DeepSpeech90%+EnglishModerate~1sFree
CMU Sphinx85%LimitedPoorFastFree

Whisper Advantages for Humanoids:

  • Multilingual: Supports 99 languages out of the box
  • Noise robust: Trained on 680,000 hours of diverse audio
  • No internet required: Runs locally on robot
  • Open source: Free and customizable
  • High accuracy: Matches commercial systems

Whisper Model Sizes

ModelParametersVRAMSpeedWER (English)Use Case
tiny39M~1GB32x realtime5.0%Fast prototyping
base74M~1GB16x realtime3.5%Mobile robots
small244M~2GB6x realtime2.8%Humanoids
medium769M~5GB2x realtime2.3%High accuracy
large1550M~10GB1x realtime2.0%Server deployment

For humanoid robots: small model provides best speed/accuracy trade-off.

Whisper Architecture

How Whisper Works

Pipeline:

  1. Audio Input: Raw audio at 16kHz sampling rate
  2. Mel Spectrogram: Convert audio to frequency representation
  3. Encoder: Extract audio features with Transformer
  4. Decoder: Generate text token by token
  5. Output: Transcribed text

Key Features

Timestamps: Whisper provides word-level timestamps:

{
"text": "Bring me a drink",
"segments": [
{"start": 0.0, "end": 0.5, "text": "Bring"},
{"start": 0.5, "end": 0.8, "text": "me"},
{"start": 0.8, "end": 1.0, "text": "a"},
{"start": 1.0, "end": 1.5, "text": "drink"}
]
}

Language Detection: Automatically detects language:

whisper.detect_language(audio)
# Returns: ("en", 0.98) # English with 98% confidence

Translation: Translates non-English to English:

# Spanish audio: "Tráeme una bebida"
result = whisper.transcribe(audio, task="translate")
# Output: "Bring me a drink"

Installation and Setup

Install Whisper

# Install Whisper
pip install openai-whisper

# Install ffmpeg (required for audio processing)
sudo apt install ffmpeg

# Test installation
whisper --help

Download Model

import whisper

# Download and cache model (one-time operation)
model = whisper.load_model("small") # ~500MB download

Models are cached in ~/.cache/whisper/.

ROS 2 Integration

Voice Command Pipeline

Whisper ROS 2 Node

Create a ROS 2 node that subscribes to audio and publishes transcriptions:

import rclpy
from rclpy.node import Node
from audio_common_msgs.msg import AudioStamped
from std_msgs.msg import String
import whisper
import numpy as np

class WhisperNode(Node):
def __init__(self):
super().__init__('whisper_asr')

# Load Whisper model
self.get_logger().info("Loading Whisper model...")
self.model = whisper.load_model("small")
self.get_logger().info("Whisper model loaded")

# Subscribe to audio stream
self.audio_sub = self.create_subscription(
AudioStamped,
'/audio/input',
self.audio_callback,
10
)

# Publish transcriptions
self.text_pub = self.create_publisher(
String,
'/voice/command',
10
)

# Audio buffer
self.audio_buffer = []
self.sample_rate = 16000

def audio_callback(self, msg):
# Convert ROS audio message to numpy array
audio_data = np.frombuffer(msg.audio.data, dtype=np.int16)

# Accumulate audio (wait for 3 seconds of speech)
self.audio_buffer.extend(audio_data)

if len(self.audio_buffer) >= self.sample_rate * 3:
# Convert to float32 and normalize
audio = np.array(self.audio_buffer, dtype=np.float32) / 32768.0

# Transcribe with Whisper
result = self.model.transcribe(audio, language="en")

# Publish transcription
text_msg = String()
text_msg.data = result["text"]
self.text_pub.publish(text_msg)

self.get_logger().info(f"Transcribed: {result['text']}")

# Clear buffer
self.audio_buffer = []

def main(args=None):
rclpy.init(args=args)
node = WhisperNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()

if __name__ == '__main__':
main()

Voice Activity Detection (VAD)

Use VAD to detect when user is speaking (avoid transcribing silence):

import webrtcvad

class WhisperNodeWithVAD(Node):
def __init__(self):
super().__init__('whisper_asr_vad')
self.model = whisper.load_model("small")

# Initialize VAD
self.vad = webrtcvad.Vad(mode=3) # Aggressive filtering

self.audio_buffer = []
self.is_speaking = False

def audio_callback(self, msg):
audio_chunk = msg.audio.data

# Check if speech is present
is_speech = self.vad.is_speech(audio_chunk, sample_rate=16000)

if is_speech:
if not self.is_speaking:
self.get_logger().info("Speech detected")
self.is_speaking = True
self.audio_buffer.extend(audio_chunk)

elif self.is_speaking:
# Speech ended, transcribe
self.get_logger().info("Speech ended, transcribing...")
self.transcribe_buffer()
self.is_speaking = False

def transcribe_buffer(self):
if len(self.audio_buffer) < self.sample_rate * 0.5:
# Too short, probably noise
self.audio_buffer = []
return

audio = np.array(self.audio_buffer, dtype=np.float32) / 32768.0
result = self.model.transcribe(audio)
self.text_pub.publish(String(data=result["text"]))
self.audio_buffer = []

Handling Noisy Environments

Challenges in Robotics

Humanoid robots operate in:

  • Noisy environments: Motors, fans, people talking
  • Varying distances: User may be 1m or 5m away
  • Reverberation: Indoor echoes, outdoor wind
  • Overlapping speech: Multiple people talking

Noise Reduction Strategies

1. Microphone Array

Use multiple microphones with beamforming:

# 4-mic array configuration
mic_positions = [
[0.0, 0.05, 0.0], # Left
[0.0, -0.05, 0.0], # Right
[0.05, 0.0, 0.0], # Front
[-0.05, 0.0, 0.0] # Back
]

# Apply delay-and-sum beamforming
def beamform(audio_channels, target_direction):
# Calculate time delays for target direction
delays = compute_delays(mic_positions, target_direction)

# Align and sum
aligned = [delay_audio(ch, d) for ch, d in zip(audio_channels, delays)]
return np.sum(aligned, axis=0)

2. Wake Word Detection

Only transcribe after detecting wake word ("Hey Robot"):

import pvporcupine

class WakeWordNode(Node):
def __init__(self):
super().__init__('wake_word')

# Initialize Porcupine wake word detector
self.porcupine = pvporcupine.create(keywords=["hey robot"])
self.model = whisper.load_model("small")

self.listening = False

def audio_callback(self, msg):
audio_chunk = msg.audio.data

# Detect wake word
keyword_index = self.porcupine.process(audio_chunk)

if keyword_index >= 0:
self.get_logger().info("Wake word detected!")
self.listening = True
self.audio_buffer = []

elif self.listening:
# Accumulate audio after wake word
self.audio_buffer.extend(audio_chunk)

if len(self.audio_buffer) >= self.sample_rate * 5:
# Transcribe command
self.transcribe_buffer()
self.listening = False

3. Speaker Diarization

Identify who is speaking (useful for multi-person environments):

from pyannote.audio import Pipeline

# Load speaker diarization model
diarization = Pipeline.from_pretrained("pyannote/speaker-diarization")

# Apply to audio
diarization_result = diarization(audio)

# Extract segments for speaker 1
speaker_audio = extract_speaker_segments(audio, diarization_result, speaker="SPEAKER_00")

# Transcribe only speaker 1
result = whisper.transcribe(speaker_audio)

Complete Voice Command System

Architecture

class VoiceCommandSystem(Node):
def __init__(self):
super().__init__('voice_command_system')

# Load Whisper
self.whisper = whisper.load_model("small")

# Audio input
self.audio_sub = self.create_subscription(
AudioStamped, '/audio/input', self.audio_callback, 10
)

# Command output
self.command_pub = self.create_publisher(
String, '/voice/command', 10
)

# Feedback to user
self.tts_pub = self.create_publisher(
String, '/audio/speak', 10
)

def audio_callback(self, msg):
# Transcribe audio
result = self.whisper.transcribe(msg.audio.data)
command = result["text"]

self.get_logger().info(f"Heard: {command}")

# Publish command for LLM planner
self.command_pub.publish(String(data=command))

# Acknowledge receipt
self.tts_pub.publish(String(data=f"I heard: {command}"))

def main():
rclpy.init()
node = VoiceCommandSystem()
rclpy.spin(node)

Integration with LLM Planner

class VoiceControlledRobot(Node):
def __init__(self):
super().__init__('voice_robot')

# Whisper transcription subscriber
self.command_sub = self.create_subscription(
String, '/voice/command', self.command_callback, 10
)

# LLM client
self.llm = LLMPlanner()

def command_callback(self, msg):
command = msg.data
self.get_logger().info(f"Processing command: {command}")

# LLM decomposes command
action_plan = self.llm.decompose(command)

# Execute actions
for action in action_plan:
self.execute_action(action)

def execute_action(self, action):
if action["action"] == "navigate":
self.navigate(action["location"])
elif action["action"] == "grasp":
self.grasp(action["object"])
# ...

Launch File

# launch/voice_control.launch.py
from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
return LaunchDescription([
# Audio capture node
Node(
package='audio_capture',
executable='audio_capture_node',
name='audio_capture',
parameters=[{'device': 'default'}]
),

# Whisper ASR node
Node(
package='humanoid_vla',
executable='whisper_node',
name='whisper_asr',
parameters=[
{'model_size': 'small'},
{'language': 'en'}
]
),

# LLM planner node
Node(
package='humanoid_vla',
executable='llm_planner',
name='llm_planner'
),

# Robot executor
Node(
package='humanoid_vla',
executable='robot_executor',
name='executor'
)
])

Testing and Validation

Test Scenarios

test_commands = [
"Bring me a glass of water",
"Clean the table",
"Follow me to the kitchen",
"Pick up the red cup",
"Go to the living room"
]

for cmd in test_commands:
print(f"\nTesting: {cmd}")
audio = synthesize_speech(cmd) # TTS for testing
result = whisper.transcribe(audio)
print(f"Transcribed: {result['text']}")
assert result["text"].lower() == cmd.lower()

Accuracy Metrics

Word Error Rate (WER):

def compute_wer(reference, hypothesis):
# Levenshtein distance
ref_words = reference.split()
hyp_words = hypothesis.split()

substitutions, deletions, insertions = levenshtein(ref_words, hyp_words)
wer = (substitutions + deletions + insertions) / len(ref_words)

return wer

reference = "bring me a glass of water"
hypothesis = whisper.transcribe(audio)["text"].lower()
wer = compute_wer(reference, hypothesis)
print(f"WER: {wer * 100:.1f}%") # Target: less than 5%

Production Optimization

GPU Acceleration

Whisper runs significantly faster on GPUs:

# CPU vs GPU performance comparison
model_cpu = whisper.load_model("small", device="cpu")
model_gpu = whisper.load_model("small", device="cuda")

audio = load_audio("command.wav")

# CPU inference
start = time.time()
result_cpu = model_cpu.transcribe(audio)
cpu_time = time.time() - start # ~3-5 seconds

# GPU inference
start = time.time()
result_gpu = model_gpu.transcribe(audio)
gpu_time = time.time() - start # ~0.5-1.5 seconds

print(f"Speedup: {cpu_time / gpu_time:.1f}x faster on GPU")
# Typical output: Speedup: 4-6x faster on GPU

Optimization Tips:

  • Keep model loaded in VRAM (don't reload per request)
  • Batch multiple audio chunks when possible
  • Use FP16 precision for 2x speedup with minimal quality loss
  • Consider model quantization (INT8) for edge devices

Model Quantization

Reduce model size and inference time:

# Quantize Whisper to INT8 (reduces size by 4x)
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

model = ORTModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-small",
export=True,
quantization_config="arm64" # or "avx512" for x86
)

# INT8 model: 250MB vs 1GB for FP32
# Inference: ~2x faster with <2% accuracy drop

Streaming Recognition

For interactive systems, stream partial results:

import whisper
from whisper.audio import SAMPLE_RATE

class StreamingWhisper(Node):
def __init__(self):
super().__init__('streaming_whisper')

self.model = whisper.load_model("small")
self.buffer_duration = 5 # seconds
self.audio_buffer = []

def audio_callback(self, msg):
self.audio_buffer.extend(msg.audio.data)

# Process every 5 seconds
if len(self.audio_buffer) >= SAMPLE_RATE * self.buffer_duration:
# Transcribe buffer
audio = np.array(self.audio_buffer[-SAMPLE_RATE*self.buffer_duration:])
result = self.model.transcribe(audio)

# Publish partial result
self.publish_partial_transcript(result["text"])

# Keep last 2 seconds for context
self.audio_buffer = self.audio_buffer[-SAMPLE_RATE*2:]

Benefits:

  • Lower perceived latency (user sees progress)
  • Can interrupt robot mid-command
  • Better for long commands or conversations

Language Detection and Multilingual Support

Auto-detect user language:

def transcribe_multilingual(audio):
# Detect language
model = whisper.load_model("small")
audio_segment = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio_segment).to(model.device)

_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)

print(f"Detected language: {detected_language} ({probs[detected_language]*100:.1f}%)")

# Transcribe in detected language
result = model.transcribe(audio, language=detected_language)

return result["text"], detected_language

# Example usage
text, lang = transcribe_multilingual(audio)
# Output: Detected language: es (98.5%)
# "Tráeme un vaso de agua"

# Can also translate to English
result = model.transcribe(audio, task="translate")
# Output: "Bring me a glass of water"

Applications:

  • Hotels serving international guests
  • Multilingual eldercare facilities
  • Export markets with diverse users

Domain Adaptation with Prompting

Improve accuracy for robotics jargon:

# Add context prompt for robotics terminology
result = model.transcribe(
audio,
initial_prompt="This is a voice command for a humanoid robot. Common commands include navigate, grasp, place, detect objects, and clean."
)

# Without prompt: "navi gate to kitchen" (incorrect spacing)
# With prompt: "navigate to kitchen" (correct)

# Can also bias toward specific vocabulary
result = model.transcribe(
audio,
initial_prompt="Vocabulary: ROS, Nav2, MoveIt, Isaac Sim, Visual SLAM"
)

Error Recovery Strategies

Handle recognition errors gracefully:

class RobustVoiceCommandSystem(Node):
def __init__(self):
super().__init__('robust_voice')
self.model = whisper.load_model("small")
self.confidence_threshold = 0.7

def process_command(self, audio):
# Transcribe
result = self.model.transcribe(audio, word_timestamps=True)

# Check confidence via word-level scores
avg_confidence = np.mean([
word['probability']
for word in result.get('words', [])
])

if avg_confidence < self.confidence_threshold:
# Low confidence, ask for repeat
self.speak("I didn't catch that. Could you repeat?")
return None

# Check for nonsensical output (gibberish detection)
if contains_gibberish(result["text"]):
self.speak("I didn't understand. Please try again.")
return None

# Valid command
return result["text"]

def contains_gibberish(text):
"""Heuristic gibberish detection"""
words = text.split()

# Too many short words
if sum(len(w) <= 2 for w in words) / len(words) > 0.5:
return True

# Check against vocabulary
valid_words = set(load_english_vocabulary())
unknown_ratio = sum(w.lower() not in valid_words for w in words) / len(words)

return unknown_ratio > 0.4 # >40% unknown words

Summary

Whisper enables natural voice interaction for humanoid robots:

Key Features:

  • 95%+ accuracy across 99 languages
  • Robust to noise and accents
  • Self-hosted (no internet required)
  • Word-level timestamps for precise control

ROS 2 Integration:

  • Audio subscriber node
  • Real-time transcription publishing
  • Integration with LLM planner for action decomposition

Production Optimizations:

  • GPU acceleration (4-6x speedup)
  • Model quantization for edge deployment
  • Streaming recognition for low latency
  • Domain adaptation with prompting
  • Multilingual auto-detection

Best Practices:

  • Use small model for speed/accuracy balance
  • Apply VAD to avoid transcribing silence
  • Use wake word detection in noisy environments
  • Microphone arrays for directional audio capture
  • Confidence thresholding for error recovery

Limitations:

  • 1-3 second latency (not real-time for full transcription, but can stream)
  • GPU recommended for fast inference
  • Struggles with extreme accents or domain-specific jargon without prompting

Next: Learn how to ground language in visual perception with multimodal models.


Continue to: Multimodal Perception

References

Radford, A., et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI Technical Report.

OpenAI. (2024). Whisper [Software]. GitHub. https://github.com/openai/whisper

Personalize