The Shift from Scripted to Semantic: How Foundation Models are Solving the 'General Purpose' Problem in Humanoid Robotics
How foundation models replace brittle scripts with semantic policies to make humanoid robots genuinely general-purpose.
The Shift from Scripted to Semantic: How Foundation Models are Solving the ‘General Purpose’ Problem in Humanoid Robotics
Introduction
Humanoid robots have been stuck in a paradox: physically capable platforms that are still limited by brittle, hand-written scripts. Engineers can craft sophisticated controllers and reflexes, but general-purpose behavior in unstructured environments remains elusive. Foundation models — large, pre-trained models with broad semantic understanding — are changing that calculus. This article explains how shifting from scripted policies to semantic policies, powered by foundation models, addresses the general-purpose problem for humanoid robots. Expect concrete architectural patterns, a practical example, and a checklist you can apply to your next robot integration.
Why scripted approaches hit a ceiling
Scripted systems work when the world is predictable. Classic stacks split the robot into perception, planning, and control, where engineers implement rule-based decision trees and state machines. That approach has problems:
- Scale: Adding new behaviors or handling edge cases requires manual rule additions that interact in unpredictable ways.
- Brittleness: Perception noise or novel contexts break conditional checks and state transitions.
- Combinatorial explosion: The number of explicit scripts grows faster than the diversity of environments.
In short: scripted robots optimize for coverage by enumeration, not understanding. That makes them expensive to extend and fragile in real-world settings where semantics (intent, object function, social cues) matter.
What “semantic” policies mean in robotics
A semantic policy is a decision-making layer that reasons over high-level meaning rather than hand-crafted state transitions. Instead of “if sensor S reads X then execute Y,” a semantic policy evaluates statements like “deliver the cup to the person holding the red notebook” or “clear the surface while respecting fragile objects.” Key properties:
- Abstract actions: Use goal-oriented actions (pick-up, handover, point) instead of low-level motor sequences.
- Compositionality: Combine sub-goals to form new behavior without writing new rules.
- Robustness to perceptual noise: Operating at semantic level is less sensitive to transient sensor glitches.
Foundation models provide the semantic grounding: they map multi-modal inputs (vision, language, proprioception) into a shared latent or symbolic space where high-level goals can be reasoned about and composed.
What foundation models bring to humanoid robotics
Foundation models are not a silver bullet, but they are a force multiplier for semantic policies. Here’s what they enable:
- Cross-modal grounding: A single model can link visual observations to language and action affordances.
- Few-shot adaptation: New tasks can be specified via examples or prompts instead of full retraining.
- Commonsense and social reasoning: Pretraining on diverse data gives models an understanding of object function and human intent that is hard to encode manually.
Concretely, a foundation model can translate an instruction like “bring the mug to Alice” into a semantic plan: locate mug → assess grip affordance → plan collision-free trajectory → handover while respecting personal space.
Architectural pattern: Semantic layer over skills
A practical architecture separates concerns into layers. This pattern is clean, testable, and incremental for teams migrating from scripted systems.
-
Perception layer: Outputs semantics (object labels, affordances, human states) instead of raw pixels. This uses models fine-tuned for detection and segmentation and optionally a perception foundation model.
-
Semantic layer (foundation-model based): Receives symbolic observations and instructions. It generates plans or intent sequences, composes skills, and reasons under uncertainty.
-
Skill layer (primitive controllers): Robust, verified controllers for grasps, balancing, walking, and trajectory following. These remain low-level and optimized for safety.
-
Reflex/safety layer: Fast, reactive checks that can override the semantic layer for immediate hazards (slips, impacts). Keep this minimal and verifiable.
Communication between layers is key: use typed messages (goal objects, affordance tuples, confidence scores). The semantic layer should not output raw motor commands; it outputs sequences of skill invocations and parameters.
Example message types
Use small, typed messages to separate concerns. Represent them as JSON-like objects when sending between modules: { "type": "Affordance", "object_id": "mug_42", "grip": "rim", "confidence": 0.92 }.
Note the semantic layer relies on confidence scores and fallbacks. If the perception confidence is low, the system should ask for clarification or switch to a verification subroutine.
Bridging perception and action with language
Language is the natural interface for semantic policies. Foundation models accept language prompts and visual context to produce plans. Implementing this requires two practical elements:
- Prompt engineering as interface: Compose concise, structured prompts that include the robot’s capabilities and the observed semantic facts.
- Deterministic decoding for plans: Avoid stochastic generations for action plans. Use decoding strategies or fine-tuning that produce consistent outputs for the same semantic state.
A minimal prompt might look like: “You are a robot with skills: pick(object, grip), move(frame), handover(person). Observe: [Affordance objects]. Goal: deliver mug_42 to Alice. Output: ordered skill calls with parameters.” Treat the foundation model’s output as a program that the skill layer executes after verification.
Practical example: semantic planner invoking low-level controllers
Below is a compact pseudocode pipeline. The semantic planner translates observations and an instruction into a sequence of skills. Implement this as a deterministic model call followed by verification.
# Perception produces semantic facts
observations = perceive_scene()
# facts = list of affordances and human states
facts = extract_affordances(observations)
# High-level instruction from the operator
instruction = "Bring the blue mug to the person wearing a green scarf"
# Model config sent with the prompt (deterministic decoding)
model_config = `{ "temperature": 0.0, "topK": 50, "maxTokens": 256 }`
# Semantic prompt builder
prompt = build_prompt(instruction, facts, robot_capabilities)
plan = call_foundation_model(prompt, model_config)
# plan = [ {"skill": "pick", "object": "mug_17", "grip": "handle"}, ... ]
# Verification and execution
for step in plan:
if not verify_preconditions(step, observations):
raise Exception("Precondition failed: " + str(step))
result = execute_skill(step)
if not result.success:
handle_failure(result)
break
Notes on this example:
- Keep the model call isolated: the model recommends a plan; the robot verifies every step.
model_configusestemperature0.0 to encourage deterministic outputs.- Verification ensures the semantic layer cannot cause unsafe low-level commands.
Limitations and practical considerations
Foundation models have weaknesses that influence design choices:
- Hallucination: Models may invent objects or affordances. Always verify against sensor data.
- Latency: Large models can be slow. Use smaller distilled models for latency-critical decisions or run local model caches for common prompts.
- Safety and verification: Treat model outputs as suggestions. The safety stack must remain the ground truth for emergency intervention.
A pragmatic rollout strategy: start with non-critical semantic tasks (informational prompts, task planning) and incrementally allow the semantic layer to control skills as you add verification logic.
Summary / Checklist
- Replace brittle conditionals with semantic facts and affordances.
- Introduce a semantic layer powered by a foundation model to reason over goals and compose skills.
- Keep low-level skill controllers deterministic and verifiable; the semantic layer should not emit motor commands.
- Use deterministic decoding (e.g.,
temperature0.0) and structured prompts for plan generation. - Always verify model outputs against perception and include a reflex safety layer for immediate hazards.
- Start with low-risk tasks and iterate: add tests that assert plan preconditions and fallbacks.
Final note: moving from scripted to semantic policies is an engineering shift, not a single-model replacement. The biggest win is organizational: you move effort from enumerating edge cases to defining robust verification and compositional skill interfaces. Foundation models give robots the semantic understanding they lacked; your job as an engineer is to keep that semantic power predictable, verifiable, and safe.