A humanoid robot interacting with humans using symbolic overlays
From scripts to semantics: robots powered by foundation models.

The Shift from Scripted to Semantic: How Foundation Models are Solving the 'General Purpose' Problem in Humanoid Robotics

How foundation models replace brittle scripts with semantic policies to make humanoid robots genuinely general-purpose.

The Shift from Scripted to Semantic: How Foundation Models are Solving the ‘General Purpose’ Problem in Humanoid Robotics

Introduction

Humanoid robots have been stuck in a paradox: physically capable platforms that are still limited by brittle, hand-written scripts. Engineers can craft sophisticated controllers and reflexes, but general-purpose behavior in unstructured environments remains elusive. Foundation models — large, pre-trained models with broad semantic understanding — are changing that calculus. This article explains how shifting from scripted policies to semantic policies, powered by foundation models, addresses the general-purpose problem for humanoid robots. Expect concrete architectural patterns, a practical example, and a checklist you can apply to your next robot integration.

Why scripted approaches hit a ceiling

Scripted systems work when the world is predictable. Classic stacks split the robot into perception, planning, and control, where engineers implement rule-based decision trees and state machines. That approach has problems:

In short: scripted robots optimize for coverage by enumeration, not understanding. That makes them expensive to extend and fragile in real-world settings where semantics (intent, object function, social cues) matter.

What “semantic” policies mean in robotics

A semantic policy is a decision-making layer that reasons over high-level meaning rather than hand-crafted state transitions. Instead of “if sensor S reads X then execute Y,” a semantic policy evaluates statements like “deliver the cup to the person holding the red notebook” or “clear the surface while respecting fragile objects.” Key properties:

Foundation models provide the semantic grounding: they map multi-modal inputs (vision, language, proprioception) into a shared latent or symbolic space where high-level goals can be reasoned about and composed.

What foundation models bring to humanoid robotics

Foundation models are not a silver bullet, but they are a force multiplier for semantic policies. Here’s what they enable:

Concretely, a foundation model can translate an instruction like “bring the mug to Alice” into a semantic plan: locate mug → assess grip affordance → plan collision-free trajectory → handover while respecting personal space.

Architectural pattern: Semantic layer over skills

A practical architecture separates concerns into layers. This pattern is clean, testable, and incremental for teams migrating from scripted systems.

Communication between layers is key: use typed messages (goal objects, affordance tuples, confidence scores). The semantic layer should not output raw motor commands; it outputs sequences of skill invocations and parameters.

Example message types

Use small, typed messages to separate concerns. Represent them as JSON-like objects when sending between modules: { "type": "Affordance", "object_id": "mug_42", "grip": "rim", "confidence": 0.92 }.

Note the semantic layer relies on confidence scores and fallbacks. If the perception confidence is low, the system should ask for clarification or switch to a verification subroutine.

Bridging perception and action with language

Language is the natural interface for semantic policies. Foundation models accept language prompts and visual context to produce plans. Implementing this requires two practical elements:

  1. Prompt engineering as interface: Compose concise, structured prompts that include the robot’s capabilities and the observed semantic facts.
  2. Deterministic decoding for plans: Avoid stochastic generations for action plans. Use decoding strategies or fine-tuning that produce consistent outputs for the same semantic state.

A minimal prompt might look like: “You are a robot with skills: pick(object, grip), move(frame), handover(person). Observe: [Affordance objects]. Goal: deliver mug_42 to Alice. Output: ordered skill calls with parameters.” Treat the foundation model’s output as a program that the skill layer executes after verification.

Practical example: semantic planner invoking low-level controllers

Below is a compact pseudocode pipeline. The semantic planner translates observations and an instruction into a sequence of skills. Implement this as a deterministic model call followed by verification.

# Perception produces semantic facts
observations = perceive_scene()
# facts = list of affordances and human states
facts = extract_affordances(observations)

# High-level instruction from the operator
instruction = "Bring the blue mug to the person wearing a green scarf"

# Model config sent with the prompt (deterministic decoding)
model_config = `{ "temperature": 0.0, "topK": 50, "maxTokens": 256 }`

# Semantic prompt builder
prompt = build_prompt(instruction, facts, robot_capabilities)
plan = call_foundation_model(prompt, model_config)

# plan = [ {"skill": "pick", "object": "mug_17", "grip": "handle"}, ... ]

# Verification and execution
for step in plan:
    if not verify_preconditions(step, observations):
        raise Exception("Precondition failed: " + str(step))
    result = execute_skill(step)
    if not result.success:
        handle_failure(result)
        break

Notes on this example:

Limitations and practical considerations

Foundation models have weaknesses that influence design choices:

A pragmatic rollout strategy: start with non-critical semantic tasks (informational prompts, task planning) and incrementally allow the semantic layer to control skills as you add verification logic.

Summary / Checklist

Final note: moving from scripted to semantic policies is an engineering shift, not a single-model replacement. The biggest win is organizational: you move effort from enumerating edge cases to defining robust verification and compositional skill interfaces. Foundation models give robots the semantic understanding they lacked; your job as an engineer is to keep that semantic power predictable, verifiable, and safe.

Related

Get sharp weekly insights