Multimodal Perception

Introduction

Language alone isn't enough for robots. When a user says "pick up the red cup on the left," the robot must:

See the environment (vision)
Understand the command (language)
Connect language to visual objects (grounding)
Reason about spatial relationships (left, on, near)

Multimodal perception bridges language and vision, enabling robots to understand referring expressions and execute visually-grounded commands.

The Grounding Problem

What is Grounding?

Grounding maps linguistic references to physical entities in the world.

Example:

Command: "Pick up the red cup"

Vision: [Object 1: Blue plate], [Object 2: Red cup], [Object 3: Green book]

Grounding: "red cup" → Object 2 → Bounding box [320, 240, 150, 200]

Action: grasp(object_id=2, bbox=[320, 240, 150, 200])

Without grounding, the LLM knows what a "red cup" is conceptually but can't locate it in the camera image.

Challenges

1. Referring Expressions

"The cup" (which cup? there are 3)
"The cup on the left" (left relative to robot or user?)
"The big red cup" (how big is "big"?)
"My cup" (requires tracking ownership)

2. Partial Observability

Objects occluded by other objects
Object partially out of frame
Object behind the robot (not visible)

3. Ambiguity

"Pick up the book" when there are 5 books on the table
"The red one" when there are 3 red objects

Solution: Vision-language models that learn joint representations of images and text.

Vision-Language Models

CLIP: Contrastive Language-Image Pre-training

CLIP learns a shared embedding space where similar images and text have similar representations.

Training:

Image: [Photo of a red cup]
Text: "A red ceramic cup on a table"

CLIP learns: embed(image) ≈ embed(text)

Usage for Robotics:

import torch
import clip
from PIL import Image

# Load CLIP model
model, preprocess = clip.load("ViT-B/32", device="cuda")

# Image from robot camera
image = Image.open("robot_view.jpg")
image_input = preprocess(image).unsqueeze(0).to("cuda")

# Possible objects
text_queries = ["a red cup", "a blue plate", "a green book"]
text_inputs = clip.tokenize(text_queries).to("cuda")

# Compute similarities
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

    # Cosine similarity
    similarity = (image_features @ text_features.T).softmax(dim=-1)

print(similarity)
# Output: [[0.85, 0.10, 0.05]]  # 85% confident it's a red cup

Application: Zero-shot object classification without training on robotics data.

OWL-ViT: Open-Vocabulary Object Detection

OWL-ViT extends CLIP to detect objects described by text queries.

Query:

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

image = Image.open("robot_view.jpg")
text_queries = ["a red cup", "a blue plate"]

inputs = processor(text=text_queries, images=image, return_tensors="pt")
outputs = model(**inputs)

# Get bounding boxes
boxes = outputs.pred_boxes[0].detach().cpu().numpy()
scores = outputs.logits[0].softmax(-1).max(-1).values.detach().cpu().numpy()

for query, box, score in zip(text_queries, boxes, scores):
    if score > 0.3:
        print(f"{query}: {box} (confidence: {score:.2f})")

# Output:
# a red cup: [0.45, 0.32, 0.15, 0.20] (confidence: 0.87)

Application: Detect objects from natural language descriptions without retraining.

SAM: Segment Anything Model

SAM generates pixel-perfect segmentation masks for objects.

Usage:

from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

# Set image
image = cv2.imread("robot_view.jpg")
predictor.set_image(image)

# Provide bounding box prompt (from OWL-ViT)
input_box = np.array([320, 240, 470, 440])  # [x1, y1, x2, y2]

masks, scores, _ = predictor.predict(box=input_box)

# masks[0] is binary mask for the object
segmentation_mask = masks[0]

Application: Precise object boundaries for grasp planning and collision avoidance.

ROS 2 Multimodal Pipeline

Architecture

Grounding Node Implementation

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image
from std_msgs.msg import String
from vision_msgs.msg import Detection2DArray, Detection2D
from cv_bridge import CvBridge
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection

class GroundingNode(Node):
    def __init__(self):
        super().__init__('grounding_node')

        # Load OWL-ViT
        self.processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
        self.model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

        # Subscribe to camera
        self.image_sub = self.create_subscription(
            Image, '/camera/image_raw', self.image_callback, 10
        )

        # Subscribe to grounding queries
        self.query_sub = self.create_subscription(
            String, '/grounding/query', self.query_callback, 10
        )

        # Publish detections
        self.detection_pub = self.create_publisher(
            Detection2DArray, '/grounding/detections', 10
        )

        self.bridge = CvBridge()
        self.latest_image = None

    def image_callback(self, msg):
        # Store latest camera frame
        self.latest_image = self.bridge.imgmsg_to_cv2(msg, "bgr8")

    def query_callback(self, msg):
        query = msg.data
        self.get_logger().info(f"Grounding query: {query}")

        if self.latest_image is None:
            self.get_logger().warn("No camera image available")
            return

        # Detect objects matching query
        detections = self.ground_query(query, self.latest_image)

        # Publish detections
        detection_msg = Detection2DArray()
        detection_msg.detections = detections
        self.detection_pub.publish(detection_msg)

    def ground_query(self, query, image):
        # Run OWL-ViT
        inputs = self.processor(text=[query], images=image, return_tensors="pt")
        outputs = self.model(**inputs)

        # Extract boxes and scores
        boxes = outputs.pred_boxes[0].detach().cpu().numpy()
        scores = outputs.logits[0].softmax(-1).max(-1).values.detach().cpu().numpy()

        detections = []
        for box, score in zip(boxes, scores):
            if score > 0.3:
                detection = Detection2D()
                detection.bbox.center.x = box[0]
                detection.bbox.center.y = box[1]
                detection.bbox.size_x = box[2]
                detection.bbox.size_y = box[3]
                detection.results[0].score = float(score)
                detections.append(detection)

        return detections

LLM Integration with Vision

class MultimodalPlanner(Node):
    def __init__(self):
        super().__init__('multimodal_planner')

        # Subscribe to voice commands
        self.command_sub = self.create_subscription(
            String, '/voice/command', self.command_callback, 10
        )

        # Grounding query publisher
        self.grounding_pub = self.create_publisher(
            String, '/grounding/query', 10
        )

        # Detection subscriber
        self.detection_sub = self.create_subscription(
            Detection2DArray, '/grounding/detections', self.detection_callback, 10
        )

        self.llm = LLMClient()
        self.current_detections = None

    def command_callback(self, msg):
        command = msg.data  # "Pick up the red cup"

        # Ask LLM to extract grounding query
        prompt = f"""
        User command: "{command}"

        Extract the object to be detected. Return only the object description.

        Example:
        Command: "Pick up the red cup"
        Object: "a red cup"

        Command: "Navigate to the table"
        Object: "a table"

        Command: "{command}"
        Object:
        """

        grounding_query = self.llm.generate(prompt).strip()
        self.get_logger().info(f"Grounding query: {grounding_query}")

        # Request grounding
        self.grounding_pub.publish(String(data=grounding_query))

    def detection_callback(self, msg):
        self.current_detections = msg.detections

        if len(self.current_detections) == 0:
            self.get_logger().warn("No objects detected")
            return

        # Select best detection
        best_detection = max(self.current_detections, key=lambda d: d.results[0].score)

        self.get_logger().info(f"Grounded object at: {best_detection.bbox}")

        # Execute grasp action
        self.execute_grasp(best_detection.bbox)

Spatial Reasoning

Handling Referring Expressions

Left/Right:

def resolve_spatial_reference(detections, reference):
    """
    detections: List of bounding boxes
    reference: "left", "right", "center"
    """
    if reference == "left":
        # Sort by x-coordinate, return leftmost
        return min(detections, key=lambda d: d.bbox.center.x)
    elif reference == "right":
        return max(detections, key=lambda d: d.bbox.center.x)
    elif reference == "center":
        image_center_x = 640  # Image width / 2
        return min(detections, key=lambda d: abs(d.bbox.center.x - image_center_x))

Size Comparisons:

def resolve_size_reference(detections, reference):
    """
    reference: "big", "small", "biggest", "smallest"
    """
    if reference in ["big", "biggest"]:
        return max(detections, key=lambda d: d.bbox.size_x * d.bbox.size_y)
    elif reference in ["small", "smallest"]:
        return min(detections, key=lambda d: d.bbox.size_x * d.bbox.size_y)

Distance from Robot:

def resolve_distance_reference(detections, depth_map, reference):
    """
    reference: "closest", "farthest"
    """
    distances = []
    for det in detections:
        # Get depth at object center
        x, y = int(det.bbox.center.x), int(det.bbox.center.y)
        distance = depth_map[y, x]
        distances.append(distance)

    if reference == "closest":
        idx = np.argmin(distances)
    elif reference == "farthest":
        idx = np.argmax(distances)

    return detections[idx]

Combining Constraints

command = "Pick up the small red cup on the left"

# Parse command
attributes = parse_command(command)
# {"color": "red", "size": "small", "position": "left", "object": "cup"}

# Detect all cups
all_cups = detect_objects("a cup")

# Filter by color
red_cups = filter_by_text(all_cups, "red cup")

# Filter by size
small_red_cups = resolve_size_reference(red_cups, "small")

# Filter by position
target_cup = resolve_spatial_reference([small_red_cups], "left")

# Execute grasp
execute_grasp(target_cup)

State Estimation and Context

Tracking Objects Over Time

class ObjectTracker(Node):
    def __init__(self):
        super().__init__('object_tracker')

        self.tracked_objects = {}  # {object_id: DetectionHistory}

    def update(self, new_detections):
        for det in new_detections:
            # Match to existing object or create new
            object_id = self.match_detection(det)

            if object_id is None:
                # New object
                object_id = self.create_new_track()

            # Update track
            self.tracked_objects[object_id].add_detection(det)

    def match_detection(self, det):
        # Simple IOU-based matching
        for obj_id, track in self.tracked_objects.items():
            if iou(det.bbox, track.latest_bbox) > 0.5:
                return obj_id
        return None

Application: Resolve "it", "that", "the same object" references.

Contextual Understanding

# User says: "Pick up the cup"
# Robot sees 3 cups

# Use context to disambiguate:
context = {
    "last_mentioned_object": "red cup",
    "user_gaze_direction": [0.5, 0.3],  # User looking at center-left
    "recent_actions": ["navigate to table"]
}

# Prioritize object in user's gaze direction
def select_object_with_context(detections, context):
    gaze_x, gaze_y = context["user_gaze_direction"]

    # Find object closest to gaze
    best_match = min(detections, key=lambda d:
        distance([d.bbox.center.x, d.bbox.center.y], [gaze_x, gaze_y])
    )

    return best_match

Complete Example: "Bring me the red cup"

Step 1: Voice Command

User: "Bring me the red cup"
Whisper → Text: "bring me the red cup"

Step 2: LLM Decomposition

llm_plan = [
    {"action": "detect", "query": "a red cup"},
    {"action": "navigate", "target": "detected_object"},
    {"action": "grasp", "object": "detected_object"},
    {"action": "navigate", "target": "user"},
    {"action": "place", "location": "table"}
]

Step 3: Visual Grounding

# Execute detect action
grounding_query = "a red cup"
detections = owl_vit_detect(camera_image, grounding_query)

# detections[0]: bbox=[320, 240, 150, 200], score=0.89

Step 4: Segmentation

# Get precise mask for grasp planning
mask = sam_segment(camera_image, bbox=detections[0].bbox)

Step 5: Grasp Planning

# Compute 3D grasp pose from mask and depth
depth = get_depth_at_bbox(detections[0].bbox)
grasp_pose = compute_grasp_pose(mask, depth)

Step 6: Execution

# Navigate to object
navigate(grasp_pose.position)

# Grasp
grasp(grasp_pose)

# Navigate to user
navigate(user_position)

# Place
place(table_position)

Advanced Grounding Techniques

Visual Question Answering (VQA) for Verification

After detecting an object, verify it matches the command using VQA:

from transformers import ViltProcessor, ViltForQuestionAnswering

class GroundingVerifier(Node):
    def __init__(self):
        super().__init__('grounding_verifier')

        # Load VQA model
        self.vqa_processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
        self.vqa_model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

    def verify_detection(self, image, bbox, expected_object):
        """
        Verify that detected bounding box contains the expected object.

        Example:
        Command: "Pick up the red cup"
        Detected: bbox at [320, 240, 150, 200]
        Verify: Is this actually a red cup?
        """

        # Crop to bounding box
        x, y, w, h = bbox
        cropped_image = image[y:y+h, x:x+w]

        # Ask VQA model
        questions = [
            f"Is this a {expected_object}?",
            f"What color is this {expected_object.split()[-1]}?",  # "red cup" → "cup"
            "What object is this?"
        ]

        answers = []
        for question in questions:
            inputs = self.vqa_processor(cropped_image, question, return_tensors="pt")
            outputs = self.vqa_model(**inputs)
            logits = outputs.logits
            answer_idx = logits.argmax(-1).item()
            answer = self.vqa_model.config.id2label[answer_idx]
            answers.append(answer)

        # Verify answers match expectations
        if expected_object in answers[0].lower() and answers[0].lower() == "yes":
            self.get_logger().info(f"Verified: {expected_object}")
            return True
        else:
            self.get_logger().warn(f"Verification failed: expected '{expected_object}', but VQA says '{answers[2]}'")
            return False

Application: Reduces false positives from object detection, especially in cluttered environments.

Scene Graphs for Complex Spatial Reasoning

Build a scene graph to understand object relationships:

from scene_graph_benchmark.scene_parser import SceneParser

class SceneGraphGrounding(Node):
    def __init__(self):
        super().__init__('scene_graph_grounding')
        self.parser = SceneParser()

    def parse_scene(self, image):
        """
        Generate scene graph with object relationships.

        Scene Graph:
        - cup [on] table
        - book [next to] cup
        - person [behind] table
        """

        scene_graph = self.parser.parse(image)

        # Example output:
        # {
        #   "objects": [
        #     {"id": 0, "label": "cup", "bbox": [320, 240, 80, 120]},
        #     {"id": 1, "label": "table", "bbox": [100, 300, 600, 200]},
        #     {"id": 2, "label": "book", "bbox": [450, 250, 100, 80]}
        #   ],
        #   "relationships": [
        #     {"subject": 0, "predicate": "on", "object": 1},  # cup on table
        #     {"subject": 2, "predicate": "next_to", "object": 0}  # book next to cup
        #   ]
        # }

        return scene_graph

    def ground_with_relationships(self, command, scene_graph):
        """
        Ground referring expression using relationships.

        Command: "Pick up the cup on the table"
        → Find cup object
        → Verify it has "on" relationship with table
        """

        # Parse command for relationships
        if "on the table" in command:
            # Find table object
            table = next(o for o in scene_graph["objects"] if o["label"] == "table")

            # Find objects on table
            on_table_ids = [
                r["subject"]
                for r in scene_graph["relationships"]
                if r["predicate"] == "on" and r["object"] == table["id"]
            ]

            # Filter for cup
            cups_on_table = [
                o for o in scene_graph["objects"]
                if o["label"] == "cup" and o["id"] in on_table_ids
            ]

            return cups_on_table[0]  # Return first matching cup

Handling Negations and Exclusions

Support commands like "pick up the cup that's NOT red":

def ground_with_negation(command, image):
    """
    Handle negative constraints in commands.

    Examples:
    - "Pick up the cup that's not red"
    - "Bring me a book, but not the blue one"
    - "Navigate to any table except the one in the kitchen"
    """

    # Parse negation
    if "not" in command or "except" in command:
        # Extract base object and negated property
        if "not red" in command:
            base_object = "cup"
            excluded_property = "red"

            # Detect all cups
            all_cups = owl_vit.detect(image, "a cup")

            # Filter by color exclusion
            non_red_cups = []
            for cup in all_cups:
                # Check color
                color = clip.classify(
                    crop_bbox(image, cup.bbox),
                    labels=["red", "blue", "white", "black", "green"]
                )

                if color != "red":
                    non_red_cups.append(cup)

            return non_red_cups[0]  # Return first non-red cup

Temporal Reasoning for Dynamic Scenes

Track objects across time for temporal commands:

class TemporalGrounding(Node):
    def __init__(self):
        super().__init__('temporal_grounding')

        self.object_history = {}  # {object_id: [detection_t0, detection_t1, ...]}

    def update(self, detections, timestamp):
        """
        Track objects over time.

        Enables commands like:
        - "Pick up the cup that just moved"
        - "Bring me the object that was on the table a minute ago"
        - "Navigate to where you saw the person last"
        """

        for det in detections:
            obj_id = det.id

            if obj_id not in self.object_history:
                self.object_history[obj_id] = []

            self.object_history[obj_id].append({
                "timestamp": timestamp,
                "bbox": det.bbox,
                "position_3d": det.position_3d
            })

    def ground_temporal_query(self, query):
        """
        Query: "the cup that just moved"
        → Find object with highest recent position change
        """

        if "just moved" in query:
            # Calculate recent motion for all objects
            motion_scores = {}

            for obj_id, history in self.object_history.items():
                if len(history) < 2:
                    continue

                # Compare last 2 positions
                recent_positions = [h["position_3d"] for h in history[-5:]]
                motion = np.std(recent_positions, axis=0)  # Standard deviation
                motion_magnitude = np.linalg.norm(motion)

                motion_scores[obj_id] = motion_magnitude

            # Return object with most motion
            moving_object_id = max(motion_scores, key=motion_scores.get)
            return self.get_current_detection(moving_object_id)

        elif "was on the table" in query and "ago" in query:
            # Find object that was on table in past
            time_delta = parse_time_delta(query)  # "a minute ago" → 60 seconds
            target_timestamp = time.time() - time_delta

            # Search history for object on table at target time
            for obj_id, history in self.object_history.items():
                for h in history:
                    if abs(h["timestamp"] - target_timestamp) < 5:  # Within 5s
                        if self.was_on_table(h["position_3d"]):
                            return self.get_current_detection(obj_id)

Compositional Grounding for Complex Descriptions

Handle multi-attribute descriptions:

def ground_compositional(command, image):
    """
    Complex description: "the small red cup next to the blue plate on the left side of the table"

    Strategy:
    1. Parse into components: [small, red, cup, next_to, blue plate, left side, table]
    2. Ground each component independently
    3. Combine with logical AND
    """

    components = parse_description(command)
    # {
    #   "object": "cup",
    #   "attributes": ["small", "red"],
    #   "spatial": "left side of table",
    #   "relations": ["next to blue plate"]
    # }

    # Step 1: Detect all cups
    candidates = owl_vit.detect(image, "a cup")

    # Step 2: Filter by attributes
    for attr in components["attributes"]:
        if attr == "small":
            candidates = [c for c in candidates if is_small(c)]
        elif attr == "red":
            color_candidates = []
            for c in candidates:
                color = classify_color(crop_bbox(image, c.bbox))
                if color == "red":
                    color_candidates.append(c)
            candidates = color_candidates

    # Step 3: Filter by spatial constraints
    if "left side" in components["spatial"]:
        table = detect_table(image)
        table_center_x = table.bbox.center_x
        candidates = [c for c in candidates if c.bbox.center_x < table_center_x]

    # Step 4: Filter by relations
    if "next to blue plate" in components["relations"]:
        blue_plate = detect_blue_plate(image)
        candidates = [
            c for c in candidates
            if is_next_to(c.bbox, blue_plate.bbox, threshold=100)  # pixels
        ]

    # Return best match
    return candidates[0] if candidates else None

Production Deployment Considerations

Latency Optimization

Vision-language models can be slow. Optimize for real-time performance:

class OptimizedGroundingPipeline(Node):
    def __init__(self):
        super().__init__('optimized_grounding')

        # Pre-load all models at startup
        self.owl_vit = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
        self.owl_vit.to("cuda")  # Move to GPU
        self.owl_vit.eval()  # Inference mode

        self.sam = SamPredictor(sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth"))

        # Cache recent detections (avoid recomputing)
        self.detection_cache = {}
        self.cache_ttl = 2.0  # seconds

    def detect_with_cache(self, image, query):
        """Use cached detections if available and recent"""

        cache_key = f"{hash_image(image)}_{query}"

        if cache_key in self.detection_cache:
            cached_detection, timestamp = self.detection_cache[cache_key]
            if time.time() - timestamp < self.cache_ttl:
                self.get_logger().info("Using cached detection")
                return cached_detection

        # Run detection
        detection = self.owl_vit.detect(image, query)
        self.detection_cache[cache_key] = (detection, time.time())

        return detection

Performance Targets:

Object detection: <200ms
Segmentation: <100ms
Total grounding latency: <500ms

Failure Recovery

Handle grounding failures gracefully:

class RobustGroundingNode(Node):
    def ground_with_retry(self, command, max_attempts=3):
        """Retry grounding with progressively relaxed constraints"""

        for attempt in range(max_attempts):
            try:
                # Attempt 1: Strict matching
                if attempt == 0:
                    result = self.ground_strict(command)

                # Attempt 2: Relaxed constraints
                elif attempt == 1:
                    self.get_logger().warn("Strict grounding failed, relaxing constraints")
                    result = self.ground_relaxed(command)

                # Attempt 3: Ask user for help
                else:
                    self.get_logger().error("Grounding failed, requesting user help")
                    self.speak("I can't find that object. Could you point to it or describe it differently?")
                    return None

                if result:
                    return result

            except Exception as e:
                self.get_logger().error(f"Grounding attempt {attempt+1} failed: {e}")

        return None

    def ground_relaxed(self, command):
        """Lower confidence thresholds and try multiple query phrasings"""

        # Try multiple phrasings
        queries = generate_query_variations(command)
        # "red cup" → ["a red cup", "red ceramic cup", "red mug", "cup that is red"]

        for query in queries:
            detections = self.owl_vit.detect(image, query, confidence_threshold=0.2)  # Lower threshold
            if detections:
                return detections[0]

        return None

Summary

Multimodal perception enables robots to understand visually-grounded commands:

Key Technologies:

CLIP: Zero-shot image-text similarity
OWL-ViT: Open-vocabulary object detection
SAM: Pixel-perfect segmentation masks
VQA: Visual question answering for verification
Scene Graphs: Understanding object relationships

Grounding Pipeline:

Parse natural language command
Extract object description and constraints
Detect object in camera image
Verify detection with VQA or secondary checks
Segment object for grasp planning
Execute action with 3D pose

Advanced Techniques:

Compositional grounding for complex multi-attribute descriptions
Temporal reasoning for dynamic scenes and motion
Negation handling ("not red", "except")
Scene graph reasoning for spatial relationships
Multi-modal verification to reduce false positives

Challenges:

Ambiguous referring expressions ("the cup" → which cup?)
Partial observability (occluded objects)
Spatial reasoning (left/right, near/far)
Latency constraints for real-time interaction

Solutions:

Multi-constraint filtering (color + size + position)
Context tracking (user gaze, recent mentions)
Object tracking over time
Caching and model optimization for speed
Graceful degradation with retry strategies

Production Considerations:

Target <500ms total grounding latency
Cache recent detections
Implement retry with relaxed constraints
Always verify detections before execution

Next: Integrate everything into a complete VLA architecture.

Continue to: Capstone: Autonomous Humanoid Architecture

References

Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.

Minderer, M., et al. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv preprint arXiv:2205.06230.

Kirillov, A., et al. (2023). Segment Anything. arXiv preprint arXiv:2304.02643.

Introduction​

The Grounding Problem​

What is Grounding?​

Challenges​

Vision-Language Models​

CLIP: Contrastive Language-Image Pre-training​

OWL-ViT: Open-Vocabulary Object Detection​

SAM: Segment Anything Model​

ROS 2 Multimodal Pipeline​

Architecture​

Grounding Node Implementation​

LLM Integration with Vision​

Spatial Reasoning​

Handling Referring Expressions​

Combining Constraints​

State Estimation and Context​

Tracking Objects Over Time​

Contextual Understanding​

Complete Example: "Bring me the red cup"​

Step 1: Voice Command​

Step 2: LLM Decomposition​

Step 3: Visual Grounding​

Step 4: Segmentation​

Step 5: Grasp Planning​

Step 6: Execution​

Advanced Grounding Techniques​

Visual Question Answering (VQA) for Verification​

Scene Graphs for Complex Spatial Reasoning​

Handling Negations and Exclusions​

Temporal Reasoning for Dynamic Scenes​

Compositional Grounding for Complex Descriptions​

Production Deployment Considerations​

Latency Optimization​

Failure Recovery​

Summary​

References​