LLM-Driven Action Planning

Introduction

Large Language Models (LLMs) like GPT-4, Claude, and Llama have revolutionized how robots understand and execute natural language commands. Instead of programming every possible task, we can leverage LLMs to decompose high-level instructions into executable action sequences.

This section covers prompt engineering for robotics, task decomposition strategies, and grounding language in robot capabilities.

Why LLMs for Robot Planning?

Traditional Approach: Explicit Programming

# Every task requires explicit code
if command == "clean_table":
    navigate_to_table()
    detect_objects_on_table()
    for obj in detected_objects:
        grasp_object(obj)
        navigate_to_bin()
        place_object()
        navigate_to_table()
elif command == "bring_drink":
    # Another 20+ lines of code...
elif command == "follow_person":
    # Another 20+ lines of code...
# 100+ commands = 1000+ lines of brittle code

Problems:

Requires programming every scenario
Can't handle variations ("clean the kitchen table" vs "tidy up the table")
Breaks with unexpected commands
No reasoning about novel situations

LLM Approach: Natural Language Decomposition

# Universal command executor
user_command = "Clean the table"

# LLM decomposes into action sequence
action_plan = llm.decompose(user_command, robot_capabilities)

# Execute actions
for action in action_plan:
    robot.execute(action)

LLM Output:

[
  {"action": "navigate", "target": "table"},
  {"action": "detect_objects", "surface": "table"},
  {"action": "grasp", "object": "cup"},
  {"action": "place", "location": "counter"},
  {"action": "grasp", "object": "plate"},
  {"action": "place", "location": "counter"}
]

Advantages:

Handles novel commands without reprogramming
Generalizes to variations
Reasons about feasibility
Explains its decisions

Prompt Engineering for Robotics

The Core Prompt Structure

A robotics prompt must include:

Robot Capabilities: What actions can the robot perform?
Environment Context: What does the robot know about its surroundings?
Task Constraints: Safety rules, physics limitations
Output Format: Structured action sequence

Example: Humanoid Assistant Prompt

SYSTEM_PROMPT = """
You are a motion planner for a humanoid robot with these capabilities:

ACTIONS:
- navigate(location: str) - Walk to named location
- detect_objects(area: str) - Find objects in area using vision
- grasp(object: str) - Pick up object with gripper
- place(location: str) - Put held object down
- speak(message: str) - Say something to user

KNOWN LOCATIONS:
- kitchen, living_room, bedroom, bathroom
- table, counter, shelf, floor

CONSTRAINTS:
- Can only hold one object at a time
- Must navigate before grasping
- Cannot grasp objects heavier than 5kg
- Must maintain balance while carrying objects

TASK: Decompose the user's command into a sequence of actions.

OUTPUT FORMAT (JSON):
[
  {"action": "navigate", "location": "kitchen"},
  {"action": "detect_objects", "area": "counter"},
  {"action": "grasp", "object": "cup"}
]

If the task is impossible, explain why.
"""

User Command Example

user_command = "Bring me a glass of water from the kitchen"

response = llm.chat([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_command}
])

LLM Response:

{
  "plan": [
    {"action": "speak", "message": "I'll get you water from the kitchen"},
    {"action": "navigate", "location": "kitchen"},
    {"action": "detect_objects", "area": "counter"},
    {"action": "grasp", "object": "glass"},
    {"action": "navigate", "location": "sink"},
    {"action": "place", "location": "sink"},
    {"action": "speak", "message": "Filling glass with water"},
    {"action": "grasp", "object": "glass"},
    {"action": "navigate", "location": "user"},
    {"action": "place", "location": "table"}
  ],
  "reasoning": "I need to navigate to kitchen, find a glass, fill it with water at the sink, then bring it to you."
}

Task Decomposition Strategies

Hierarchical Decomposition

Break complex tasks into subtasks recursively:

"Clean the house"
├── Clean the living room
│   ├── Detect objects on floor
│   ├── Pick up each object
│   └── Place in appropriate location
├── Clean the kitchen
│   ├── Clear the counter
│   └── Wipe surfaces
└── Clean the bedroom
    └── ...

Prompt Technique:

HIERARCHICAL_PROMPT = """
First, decompose the high-level task into major subtasks.
Then, for each subtask, generate the specific action sequence.

Example:
Task: "Prepare breakfast"
Subtasks:
1. Get ingredients (eggs, bread, butter)
2. Cook eggs
3. Toast bread
4. Plate and serve

Actions for subtask 1:
- navigate(kitchen)
- detect_objects(refrigerator)
- grasp(eggs)
- place(counter)
...
"""

Sequential vs Parallel Planning

Sequential (safe but slow):

[
  {"action": "navigate", "location": "kitchen"},
  {"action": "grasp", "object": "cup"},
  {"action": "navigate", "location": "bedroom"},
  {"action": "place", "location": "table"}
]

Parallel (efficient but complex):

{
  "plan": [
    {
      "parallel_group": [
        {"action": "navigate", "location": "kitchen"},
        {"action": "detect_objects", "area": "living_room"}
      ]
    },
    {"action": "grasp", "object": "cup"}
  ]
}

Most humanoid robots execute sequentially due to hardware limitations.

Grounding Language in Robot Capabilities

The Grounding Problem

LLMs can generate infeasible plans:

Bad LLM Output:

{"action": "teleport", "location": "kitchen"}  // No teleportation!
{"action": "grasp", "object": "elephant"}  // Too heavy!
{"action": "fly", "location": "ceiling"}  // Can't fly!

Solution 1: Constrained Action Space

Only allow LLM to select from valid actions:

VALID_ACTIONS = {
    "navigate": {"params": ["location"], "type": "string"},
    "grasp": {"params": ["object"], "type": "string"},
    "place": {"params": ["location"], "type": "string"},
    "detect_objects": {"params": ["area"], "type": "string"}
}

def validate_plan(plan):
    for action in plan:
        if action["action"] not in VALID_ACTIONS:
            return False, f"Invalid action: {action['action']}"
        # Validate parameters...
    return True, "Plan is valid"

llm_plan = llm.generate_plan(user_command)
valid, message = validate_plan(llm_plan)

if not valid:
    # Ask LLM to regenerate with error feedback
    llm_plan = llm.regenerate_plan(user_command, error=message)

Solution 2: Physics Validation

Check if plan obeys physics before execution:

def is_physically_feasible(action):
    if action["action"] == "grasp":
        object_weight = get_object_weight(action["object"])
        if object_weight > MAX_GRIPPER_WEIGHT:
            return False, "Object too heavy"

        robot_pose = get_current_pose()
        object_pose = get_object_pose(action["object"])
        if distance(robot_pose, object_pose) > ARM_REACH:
            return False, "Object out of reach"

    return True, "Feasible"

Use LLM to self-correct:

def execute_with_feedback(plan):
    for action in plan:
        feasible, reason = is_physically_feasible(action)

        if not feasible:
            # Ask LLM to fix the plan
            revised_plan = llm.chat([
                {"role": "user", "content": f"Action {action} failed: {reason}. Revise the plan."}
            ])
            return execute_with_feedback(revised_plan)

        result = robot.execute(action)
        if result.failed:
            # LLM handles execution failures
            revised_plan = llm.handle_failure(action, result.error)
            return execute_with_feedback(revised_plan)

Example: "Clean the Table"

Step 1: User Command

command = "Clean the table"

Step 2: LLM Decomposes Task

Prompt:

User command: "Clean the table"

Generate a plan using these actions:
- navigate(location)
- detect_objects(area)
- grasp(object)
- place(location)

The robot is currently in the living room.

LLM Response:

{
  "plan": [
    {"action": "navigate", "location": "table", "reasoning": "Move to table to see objects"},
    {"action": "detect_objects", "area": "table", "reasoning": "Identify what needs cleaning"},
    {"action": "grasp", "object": "cup", "reasoning": "Pick up cup"},
    {"action": "navigate", "location": "counter"},
    {"action": "place", "location": "counter"},
    {"action": "navigate", "location": "table"},
    {"action": "grasp", "object": "plate"},
    {"action": "navigate", "location": "counter"},
    {"action": "place", "location": "counter"}
  ]
}

Step 3: Validation

valid, msg = validate_plan(llm_plan)
print(f"Plan valid: {valid}")  # True

Step 4: Execution

for action in llm_plan:
    print(f"Executing: {action}")
    robot.execute(action)

Output:

Executing: navigate to table
Executing: detect_objects on table
  Found: cup, plate, book
Executing: grasp cup
Executing: navigate to counter
Executing: place on counter
Executing: navigate to table
Executing: grasp plate
Executing: navigate to counter
Executing: place on counter

Advanced Error Handling

Handling Ambiguous Commands

Real users don't speak precisely. The LLM must request clarification when needed:

AMBIGUITY_HANDLER_PROMPT = """
If the command is ambiguous, return a clarification request instead of a plan.

Examples of ambiguous commands:
- "Pick it up" → What object?
- "Go there" → Where specifically?
- "Bring me one" → One of what?
- "Clean up" → Clean what area?

Response format for ambiguous commands:
{
  "needs_clarification": true,
  "question": "Which object would you like me to pick up?",
  "options": ["cup", "plate", "book"]  // If detectable
}
"""

# Example interaction
user: "Pick it up"
llm: {"needs_clarification": true, "question": "What would you like me to pick up?"}
user: "The red cup"
llm: {"plan": [{"action": "grasp", "object": "red cup"}, ...]}

Handling Partial Failures

When part of a plan fails, the LLM can adapt:

def execute_with_adaptation(plan):
    completed_actions = []

    for i, action in enumerate(plan):
        result = robot.execute(action)

        if result.success:
            completed_actions.append(action)
        else:
            # Replan from current state
            replan_prompt = f"""
            Original goal: {original_command}

            Completed successfully:
            {completed_actions}

            Failed action: {action}
            Error: {result.error}

            Remaining actions: {plan[i+1:]}

            Generate a new plan to achieve the goal from current state.
            You may need to:
            - Skip the failed action if not critical
            - Find an alternative approach
            - Ask for human help if truly stuck
            """

            new_plan = llm.generate(replan_prompt)
            return execute_with_adaptation(new_plan)

    return True

Example Scenario:

Original Plan: Navigate to kitchen → Grasp cup → Navigate to user
Failed at: Grasp cup (cup too far from edge, out of reach)

Replanned:
1. Navigate closer to counter (adjust position)
2. Retry grasp
3. If still fails, ask user: "Could you move the cup closer to the edge?"

Handling Resource Constraints

LLMs must consider battery, payload capacity, and time constraints:

RESOURCE_AWARE_PROMPT = """
Robot state:
- Battery: {battery_level}% (return to charger below 20%)
- Current payload: {current_weight}kg / {max_weight}kg max
- Time available: {time_budget} minutes

Generate a plan that:
1. Completes before battery critical
2. Doesn't exceed weight capacity
3. Finishes within time budget

If impossible, explain which constraint is violated and suggest alternatives.
"""

# Example output when battery low:
{
  "plan": [
    {"action": "navigate", "location": "charger"},
    {"action": "charge", "duration": 30},
    {"action": "navigate", "location": "kitchen"},
    {"action": "grasp", "object": "cup"}
  ],
  "reasoning": "Battery at 15%, recharging first to avoid mid-task shutdown"
}

Best Practices

1. Provide Rich Context

Include environment state, robot capabilities, and constraints in every prompt.

def build_context_prompt(robot_state):
    return f"""
    Robot capabilities: {robot_state.capabilities}
    Current location: {robot_state.location}
    Battery level: {robot_state.battery}%
    Objects in view: {robot_state.detected_objects}
    Recent actions: {robot_state.action_history[-3:]}
    Known locations: {robot_state.semantic_map.locations}

    User command: {{command}}
    """

2. Use Structured Output

Always request JSON or structured formats—easier to parse and validate.

3. Few-Shot Examples

Include 2-3 example decompositions in the prompt:

FEW_SHOT_EXAMPLES = """
Example 1:
Command: "Bring me a book"
Plan:
[
  {"action": "navigate", "location": "shelf"},
  {"action": "detect_objects", "area": "shelf"},
  {"action": "grasp", "object": "book"},
  {"action": "navigate", "location": "user"},
  {"action": "place", "location": "table"}
]

Example 2:
Command: "Put away the toys"
Plan: [...]
"""

4. Safety Constraints

Always include safety rules in the prompt:

SAFETY RULES:
- Never navigate while holding fragile objects
- Stop immediately if person is detected in path
- Do not grasp sharp objects without confirmation
- Confirm before discarding any object
- Maintain 0.5m distance from people while navigating

5. Graceful Failure

Prompt LLM to explain when tasks are impossible:

{
  "plan": null,
  "impossible": true,
  "reason": "Cannot grasp the car—it exceeds the 5kg weight limit.",
  "alternative": "I could push the car if it has wheels, or I can call for human assistance."
}

6. Multi-Step Validation

Validate plans before execution at multiple levels:

def validate_plan_comprehensively(plan):
    # Syntax validation
    if not is_valid_json(plan):
        return False, "Invalid JSON format"

    # Action validation
    for action in plan:
        if action["action"] not in VALID_ACTIONS:
            return False, f"Unknown action: {action['action']}"

    # Physics validation
    for action in plan:
        if not is_physically_feasible(action):
            return False, f"Infeasible action: {action}"

    # Safety validation
    for action in plan:
        if violates_safety_constraint(action):
            return False, f"Unsafe action: {action}"

    # Temporal validation (can complete before battery dies?)
    estimated_time = sum(estimate_duration(a) for a in plan)
    battery_time = battery_level / battery_consumption_rate

    if estimated_time > battery_time:
        return False, "Insufficient battery to complete plan"

    return True, "Plan validated"

Summary

LLM-driven action planning enables humanoid robots to:

Understand natural language commands
Decompose high-level tasks into action sequences
Reason about feasibility and constraints
Adapt to novel commands without reprogramming

Key Techniques:

Prompt engineering with robot capabilities and constraints
Structured JSON output for parsing
Validation layers (action space, physics, safety)
Iterative refinement with error feedback

Limitations:

LLMs can hallucinate infeasible actions
Require validation before execution
Need rich context to generate good plans
Prompt engineering is task-specific

Next: Integrate speech recognition with Whisper to enable voice commands.

Continue to: Whisper Speech Recognition

References

Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv preprint arXiv:2204.01691.

Huang, W., et al. (2022). Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv preprint arXiv:2207.05608.

Introduction​

Why LLMs for Robot Planning?​

Traditional Approach: Explicit Programming​

LLM Approach: Natural Language Decomposition​

Prompt Engineering for Robotics​

The Core Prompt Structure​

Example: Humanoid Assistant Prompt​

User Command Example​

Task Decomposition Strategies​

Hierarchical Decomposition​

Sequential vs Parallel Planning​

Grounding Language in Robot Capabilities​

The Grounding Problem​

Solution 1: Constrained Action Space​

Solution 2: Physics Validation​

Solution 3: Iterative Refinement​

Example: "Clean the Table"​

Step 1: User Command​

Step 2: LLM Decomposes Task​

Step 3: Validation​

Step 4: Execution​

Advanced Error Handling​

Handling Ambiguous Commands​

Handling Partial Failures​

Handling Resource Constraints​

Best Practices​

1. Provide Rich Context​

2. Use Structured Output​

3. Few-Shot Examples​

4. Safety Constraints​

5. Graceful Failure​

6. Multi-Step Validation​

Summary​

References​