Capstone: Autonomous Humanoid Architecture

Introduction

You've learned the individual components:

LLM Planning: Decompose natural language into actions
Whisper: Convert speech to text
Multimodal Perception: Ground language in vision

Now it's time to integrate everything into a production-ready VLA system for autonomous humanoid robots.

This capstone covers:

End-to-end VLA architecture
ROS 2 node design patterns
Three complete natural language → action examples
Deployment considerations and best practices

End-to-End VLA Architecture

System Overview

ROS 2 Node Architecture

humanoid_vla/
├── audio_processing_node       # Audio capture, VAD, noise reduction
├── whisper_asr_node            # Speech recognition
├── llm_planner_node            # Task decomposition
├── grounding_node              # Vision-language matching
├── segmentation_node           # SAM segmentation
├── grasp_planner_node          # 3D grasp pose computation
├── navigation_manager_node     # Nav2 interface
├── manipulation_manager_node   # MoveIt interface
├── action_executor_node        # Orchestrates execution
└── state_manager_node          # Robot state tracking

Complete Examples

Example 1: "Clean the table"

User Command:

User: "Clean the table"

Step 1: Speech Recognition

# Whisper ASR Node
audio_input = microphone.capture()
text = whisper.transcribe(audio_input)
# Output: "clean the table"

Step 2: LLM Decomposition

# LLM Planner Node
prompt = f"""
Robot capabilities: navigate, detect_objects, grasp, place, speak
Current location: living_room
Known locations: table, counter, trash_bin

User command: "clean the table"

Decompose into action sequence (JSON):
"""

llm_response = llm.chat(prompt)

# LLM Output:
{
  "plan": [
    {"action": "speak", "text": "I'll clean the table for you"},
    {"action": "navigate", "location": "table"},
    {"action": "detect_objects", "area": "table_surface"},
    {"action": "foreach", "objects": "detected_objects", "actions": [
      {"action": "grasp", "object": "$object"},
      {"action": "classify_object", "object": "$object"},
      {"action": "navigate", "location": "$trash_or_counter"},
      {"action": "place", "location": "$destination"}
    ]},
    {"action": "speak", "text": "Table is clean"}
  ],
  "reasoning": "Need to detect objects on table, categorize them, and place in appropriate location (trash or counter)"
}

Step 3: Visual Grounding

# Grounding Node
camera_image = camera.get_latest_frame()

# Detect all objects on table
detections = owl_vit.detect(
    image=camera_image,
    queries=["objects on a table"]
)

# detections: [
#   {id: 0, label: "cup", bbox: [320, 240, 80, 120], score: 0.91},
#   {id: 1, label: "plate", bbox: [500, 280, 100, 100], score: 0.87},
#   {id: 2, label: "napkin", bbox: [200, 300, 60, 40], score: 0.79}
# ]

Step 4: Segmentation

# Segmentation Node
for det in detections:
    mask = sam.segment(camera_image, bbox=det.bbox)
    det.mask = mask

Step 5: 3D Pose Estimation

# Grasp Planner Node
depth_image = depth_camera.get_latest_frame()

for det in detections:
    # Get depth at object center
    center_depth = depth_image[det.bbox.center_y, det.bbox.center_x]

    # Convert to 3D point
    point_3d = camera_model.projectPixelTo3dRay(
        (det.bbox.center_x, det.bbox.center_y)
    ) * center_depth

    # Compute grasp pose
    grasp_pose = compute_grasp_pose(
        mask=det.mask,
        point_3d=point_3d,
        surface_normal=estimate_normal(depth_image, det.mask)
    )

    det.grasp_pose = grasp_pose

Step 6: Classification and Sorting

# Action Executor Node
for det in detections:
    # Navigate to object
    navigate_to(det.grasp_pose.position)

    # Grasp object
    grasp(det.grasp_pose)

    # Classify: trash or keep?
    classification = clip.classify(
        camera.get_gripper_view(),
        labels=["trash", "dishes", "clean_items"]
    )

    if classification == "trash":
        destination = "trash_bin"
    elif classification == "dishes":
        destination = "sink"
    else:
        destination = "counter"

    # Navigate and place
    navigate_to(destination)
    place(destination)

    # Return to table
    navigate_to("table")

# Final feedback
speak("Table is clean")

Execution Timeline:

00 - User speaks: "Clean the table"
02 - Whisper transcribes
03 - LLM generates plan
04 - Robot: "I'll clean the table for you"
06 - Navigate to table (2 seconds)
08 - Detect 3 objects on table
09 - Segment and compute grasp poses
10 - Grasp cup → Classify: dishes → Place in sink (15s)
25 - Grasp plate → Classify: dishes → Place in sink (15s)
40 - Grasp napkin → Classify: trash → Place in bin (12s)
52 - Robot: "Table is clean"

Example 2: "Bring me a drink"

User Command:

User: "Bring me a drink"

Step 1-2: Speech + LLM Planning

llm_response = {
  "plan": [
    {"action": "speak", "text": "I'll get you a drink"},
    {"action": "navigate", "location": "kitchen"},
    {"action": "detect_objects", "query": "drinks"},
    {"action": "select_object", "criteria": "closest_drink"},
    {"action": "grasp", "object": "selected_drink"},
    {"action": "navigate", "location": "user"},
    {"action": "place", "location": "table_near_user"},
    {"action": "speak", "text": "Here's your drink"}
  ]
}

Step 3: Visual Grounding in Kitchen

# Navigate to kitchen
navigate_to("kitchen")

# Detect drinks
drink_detections = owl_vit.detect(
    camera.get_latest_frame(),
    queries=["a bottle of water", "a can of soda", "a glass"]
)

# Select closest accessible drink
selected_drink = min(
    drink_detections,
    key=lambda d: distance(robot_position, d.position_3d)
)

Step 4: Grasp Planning with Collision Avoidance

# Segment drink
mask = sam.segment(camera_image, bbox=selected_drink.bbox)

# Compute multiple grasp candidates
grasp_candidates = compute_grasp_candidates(
    object_mask=mask,
    depth_image=depth_image,
    num_candidates=5
)

# Check collision-free grasps with MoveIt
for candidate in grasp_candidates:
    trajectory = moveit.plan_grasp(candidate)
    if trajectory.success and not trajectory.collision:
        grasp_pose = candidate
        break

Step 5: Execute Grasp

# Pre-grasp pose
moveit.move_to_pose(grasp_pose.pre_grasp)

# Open gripper
gripper.open()

# Approach
moveit.move_to_pose(grasp_pose.grasp)

# Close gripper
gripper.close()

# Lift
moveit.move_to_pose(grasp_pose.lift)

Step 6: Return to User

# Detect user with person detection
user_detections = detect_persons(camera_image)
user_position = user_detections[0].position_3d

# Navigate to user
navigate_to_person(user_position, safe_distance=1.0)

# Find nearby table
table_detections = owl_vit.detect(camera_image, ["a table"])
place_location = table_detections[0].surface_point

# Place drink gently
place_object(place_location, gentle=True)

# Feedback
speak("Here's your drink")

Example 3: "Follow me to the kitchen"

User Command:

User: "Follow me to the kitchen"

Step 1-2: Speech + LLM Planning

llm_response = {
  "plan": [
    {"action": "speak", "text": "I'll follow you"},
    {"action": "track_person", "target": "user"},
    {"action": "follow_at_distance", "distance": 1.5, "until": "kitchen"},
    {"action": "speak", "text": "We're in the kitchen"}
  ]
}

Step 3: Person Detection and Tracking

class PersonFollower(Node):
    def __init__(self):
        super().__init__('person_follower')

        # Person detector (YOLOv8-pose)
        self.person_detector = YOLOv8Pose()

        # State
        self.tracking_person_id = None
        self.following = False

    def follow_person(self):
        while self.following:
            # Detect persons
            detections = self.person_detector.detect(camera.get_frame())

            if self.tracking_person_id is None:
                # Start tracking closest person
                self.tracking_person_id = min(
                    detections,
                    key=lambda d: d.distance
                ).id

            # Get tracked person
            person = next(
                (d for d in detections if d.id == self.tracking_person_id),
                None
            )

            if person is None:
                # Lost track
                self.stop_and_search()
                continue

            # Maintain distance
            target_distance = 1.5  # meters
            current_distance = person.distance

            if current_distance > target_distance + 0.3:
                # Person too far, move closer
                self.navigate_toward(person.position, speed=0.4)
            elif current_distance < target_distance - 0.3:
                # Person too close, back up
                self.navigate_away(person.position, speed=0.2)
            else:
                # Maintain position
                self.stop()

            # Check if reached kitchen
            current_location = self.get_current_location()
            if current_location == "kitchen":
                self.following = False
                speak("We're in the kitchen")

Step 4: Obstacle Avoidance While Following

def navigate_toward(self, person_position, speed):
    # Get costmap from Nav2
    costmap = self.get_local_costmap()

    # Compute vector to person
    direction = person_position - self.get_position()
    direction_normalized = direction / np.linalg.norm(direction)

    # Check if path is clear
    if is_path_clear(costmap, direction_normalized):
        # Move directly toward person
        velocity = direction_normalized * speed
        self.publish_velocity(velocity)
    else:
        # Compute detour around obstacle
        detour_path = self.plan_local_detour(person_position)
        self.follow_path(detour_path, speed)

Step 5: Re-acquisition if Person Lost

def stop_and_search(self):
    self.stop()
    speak("I lost you. Please wave.")

    # Rotate in place to search
    for angle in [0, 90, 180, 270]:
        self.rotate_to(angle)
        time.sleep(1)

        # Detect persons
        detections = self.person_detector.detect(camera.get_frame())

        if len(detections) > 0:
            # Found person, resume tracking
            self.tracking_person_id = detections[0].id
            speak("Found you!")
            return

    # Still lost
    speak("I can't find you. Please come back to me.")

Production Deployment

Hardware Requirements

Compute:

NVIDIA Jetson Orin (32GB) or higher
GPU acceleration essential for real-time VLA

Sensors:

Stereo cameras (Visual SLAM + depth)
Microphone array (4+ mics for beamforming)
IMU (balance and state estimation)
LiDAR (optional, for robust navigation)

Actuators:

7-DOF arms (manipulation)
6-DOF legs (bipedal walking)
2-finger gripper (grasping)

Software Stack

# docker-compose.yml
services:
  ros2_core:
    image: osrf/ros:humble-desktop
    command: ros2 launch humanoid_vla full_system.launch.py

  whisper_asr:
    image: humanoid_vla/whisper:latest
    runtime: nvidia
    environment:
      - MODEL_SIZE=small
      - LANGUAGE=en

  llm_planner:
    image: humanoid_vla/llm_planner:latest
    environment:
      - LLM_API_KEY=${OPENAI_API_KEY}
      - LLM_MODEL=gpt-4

  vision_pipeline:
    image: humanoid_vla/vision:latest
    runtime: nvidia
    volumes:
      - /dev/video0:/dev/video0

  nav2:
    image: osrf/ros:humble-navigation
    command: ros2 launch nav2_bringup bringup_launch.py

Launch Configuration

# full_system.launch.py
from launch import LaunchDescription
from launch_ros.actions import Node

def generate_launch_description():
    return LaunchDescription([
        # Audio pipeline
        Node(package='humanoid_vla', executable='audio_processing_node'),
        Node(package='humanoid_vla', executable='whisper_asr_node'),

        # Vision pipeline
        Node(package='humanoid_vla', executable='camera_node'),
        Node(package='humanoid_vla', executable='grounding_node'),
        Node(package='humanoid_vla', executable='segmentation_node'),

        # Planning and execution
        Node(package='humanoid_vla', executable='llm_planner_node'),
        Node(package='humanoid_vla', executable='action_executor_node'),

        # Navigation
        Node(package='nav2_bringup', executable='nav2_bringup'),

        # Manipulation
        Node(package='moveit', executable='move_group'),

        # State management
        Node(package='humanoid_vla', executable='state_manager_node'),
    ])

Debugging and Troubleshooting

Common VLA System Failures

1. Speech Recognition Failures

Symptom: Whisper outputs garbled text or empty strings
Causes:
- Background noise overwhelming speech
- Microphone too far from speaker
- Low audio volume
- Wrong language model loaded

Debugging:
1. Check audio levels: ros2 topic echo /audio/input
2. Record raw audio and test Whisper offline
3. Verify microphone gain settings
4. Test with synthetic TTS audio (eliminates mic issues)

Fix:
- Adjust VAD sensitivity
- Use noise cancellation (RNN Noise Suppression)
- Move to directional microphone or array
- Add wake word detection to segment speech

2. LLM Planning Failures

Symptom: LLM generates infeasible or nonsensical plans
Causes:
- Insufficient context in prompt
- Hallucination (LLM invents capabilities)
- Prompt injection attack
- Outdated robot state information

Debugging:
1. Log full LLM prompt + response
2. Manually test prompt in ChatGPT/Claude web UI
3. Check if robot state is current
4. Validate action constraints

Fix:
- Enrich prompt with environment state
- Add validation layer before execution
- Use constrained decoding (force JSON schema)
- Implement prompt versioning and A/B testing

3. Grounding Failures

Symptom: Robot can't locate objects mentioned in command
Causes:
- Object not in camera view
- Poor lighting (overexposed/underexposed)
- Object occluded by other objects
- OWL-ViT query phrasing mismatch

Debugging:
1. Visualize bounding boxes: ros2 run rqt_image_view
2. Check detection confidence scores
3. Test with known objects in controlled lighting
4. Compare CLIP embeddings between query and image

Fix:
- Add head movement to scan environment
- Improve lighting or use HDR cameras
- Try multiple query phrasings ("red cup" vs "a red ceramic mug")
- Lower confidence threshold (but increases false positives)

4. Execution Failures

Symptom: Robot starts action but fails partway through
Causes:
- Collision detected by safety systems
- Joint limits reached (object out of reach)
- Gripper fails to grasp (slippery object)
- Lost localization (SLAM drift)

Debugging:
1. Check TF tree: ros2 run tf2_tools view_frames
2. Monitor joint states: ros2 topic echo /joint_states
3. Visualize costmap in RViz (check for phantom obstacles)
4. Review MoveIt collision scene

Fix:
- Replan with updated obstacle map
- Adjust grasp pose (different approach angle)
- Reset SLAM map and relocalize
- Reduce velocity limits for more precise control

Performance Monitoring

class VLAPerformanceMonitor(Node):
    def __init__(self):
        super().__init__('vla_monitor')

        self.metrics = {
            "whisper_latency": [],
            "llm_latency": [],
            "grounding_latency": [],
            "action_success_rate": [],
            "total_task_time": []
        }

    def log_metrics(self):
        self.get_logger().info(f"""
        === VLA Performance Metrics ===
        Whisper avg latency: {np.mean(self.metrics['whisper_latency']):.2f}s
        LLM avg latency: {np.mean(self.metrics['llm_latency']):.2f}s
        Grounding avg latency: {np.mean(self.metrics['grounding_latency']):.2f}s
        Action success rate: {np.mean(self.metrics['action_success_rate'])*100:.1f}%
        Avg task completion time: {np.mean(self.metrics['total_task_time']):.1f}s
        """)

    def alert_if_degraded(self):
        # Alert if latency spikes
        if np.mean(self.metrics['llm_latency'][-10:]) > 5.0:
            self.get_logger().warn("LLM latency degraded > 5s")

        # Alert if success rate drops
        if np.mean(self.metrics['action_success_rate'][-10:]) < 0.7:
            self.get_logger().error("Action success rate below 70%!")

Integration Testing Strategy

def test_vla_pipeline_end_to_end():
    """Test complete pipeline from speech to execution"""

    # Test 1: Simple navigation command
    test_audio = load_audio("test_data/bring_me_cup.wav")
    result = vla_system.execute_audio_command(test_audio)

    assert result.success
    assert "cup" in result.detected_objects
    assert result.final_position == "near_user"

    # Test 2: Multi-step task
    test_audio = load_audio("test_data/clean_table.wav")
    result = vla_system.execute_audio_command(test_audio)

    assert result.success
    assert result.actions_completed >= 5  # Detect, grasp, place for multiple objects
    assert result.objects_cleared_from_table > 0

    # Test 3: Ambiguous command handling
    test_audio = load_audio("test_data/pick_it_up.wav")
    result = vla_system.execute_audio_command(test_audio)

    assert result.requested_clarification
    assert "which object" in result.clarification_question.lower()

    # Test 4: Impossible task handling
    test_audio = load_audio("test_data/lift_car.wav")
    result = vla_system.execute_audio_command(test_audio)

    assert not result.success
    assert result.failure_reason == "exceeds_weight_limit"
    assert result.alternative_suggested

Best Practices

1. Human-in-the-Loop

Always confirm critical actions:

def execute_action_with_confirmation(action):
    if is_critical_action(action):
        speak(f"I'm about to {action}. Should I proceed?")
        response = wait_for_confirmation(timeout=10)

        if response != "yes":
            speak("Action cancelled")
            return False

    return execute_action(action)

2. Graceful Degradation

Handle failures at every level:

def robust_execute(action_plan):
    for i, action in enumerate(action_plan):
        try:
            result = execute_action(action)

            if result.failed:
                # Ask LLM to replan from current state
                new_plan = llm.replan(
                    original_plan=action_plan,
                    failed_action=action,
                    error=result.error,
                    remaining_actions=action_plan[i+1:]
                )
                return robust_execute(new_plan)

        except Exception as e:
            speak(f"Something went wrong: {e}")
            return False

    return True

3. Safety Constraints

Enforce safety rules in code:

def validate_action(action):
    # No grasping while navigating
    if robot.is_navigating() and action.type == "grasp":
        return False, "Cannot grasp while moving"

    # Weight limit
    if action.type == "grasp" and action.object.weight > 5.0:
        return False, "Object too heavy"

    # Person proximity
    if person_detected_in_path() and action.type == "navigate":
        return False, "Person in path"

    return True, "Safe"

4. Monitoring and Logging

Log everything for debugging:

class ActionLogger(Node):
    def __init__(self):
        super().__init__('action_logger')

        self.action_log = []

    def log_action(self, action, result):
        entry = {
            "timestamp": time.time(),
            "action": action,
            "result": result,
            "robot_state": self.get_robot_state(),
            "battery": self.get_battery_level()
        }

        self.action_log.append(entry)

        # Save to database
        self.db.insert(entry)

Summary

This capstone demonstrated:

Complete VLA Pipeline:

Speech → Whisper → LLM → Grounding → Segmentation → Execution
ROS 2 node architecture for production systems
Real-time integration with Nav2 and MoveIt

Three Complete Examples:

"Clean the table": Detection → Classification → Sorting
"Bring me a drink": Navigation → Manipulation → Delivery
"Follow me": Person tracking → Dynamic following

Production Deployment:

Hardware requirements (Jetson Orin + sensors)
Docker containerization
Launch file orchestration

Best Practices:

Human-in-the-loop confirmation
Graceful degradation and replanning
Safety constraints enforcement
Comprehensive logging

The Future: VLA systems are rapidly evolving. Expect:

Faster LLMs (sub-second planning)
Better vision models (GPT-4V successors)
End-to-end learned VLA policies
Improved sim-to-real transfer

You now have the complete toolkit to build autonomous humanoid robots that understand and execute natural language commands!

Congratulations! You've completed Module 4. Continue to References for the complete bibliography.

References

Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv preprint arXiv:2204.01691.

Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378.

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv preprint arXiv:2307.15818.

Introduction​

End-to-End VLA Architecture​

System Overview​

ROS 2 Node Architecture​

Complete Examples​

Example 1: "Clean the table"​

Example 2: "Bring me a drink"​

Example 3: "Follow me to the kitchen"​

Production Deployment​

Hardware Requirements​

Software Stack​

Launch Configuration​

Debugging and Troubleshooting​

Common VLA System Failures​

Performance Monitoring​

Integration Testing Strategy​

Best Practices​

1. Human-in-the-Loop​

2. Graceful Degradation​

3. Safety Constraints​

4. Monitoring and Logging​

Summary​

References​

Introduction

End-to-End VLA Architecture

System Overview

ROS 2 Node Architecture

Complete Examples

Example 1: "Clean the table"

Example 2: "Bring me a drink"

Example 3: "Follow me to the kitchen"

Production Deployment

Hardware Requirements

Software Stack

Launch Configuration

Debugging and Troubleshooting

Common VLA System Failures

Performance Monitoring

Integration Testing Strategy

Best Practices

1. Human-in-the-Loop

2. Graceful Degradation

3. Safety Constraints

4. Monitoring and Logging

Summary

References