
The world of AI agents is evolving at light speed, and with it, the language we use to describe it. New terms emerge, old ones get repurposed, and sometimes, concepts blur into an unintelligible mess. This rapid evolution can be incredibly confusing for newcomers and seasoned practitioners alike.
One common source of confusion revolves around terms like “harness” and “scaffold.” After ICLR 2026, a question perfectly captured this sentiment: “What do you mean by the terms ‘harness’ and ‘scaffold’ in the context of agents? I have heard a lot of explanations, but I could not understand why they did not converge to a single explanation.” This article aims to bring clarity to these often-misunderstood terms, focusing on the concepts that frequently get mixed up or are assumed to be universally understood.
Untangling the Core Components: Model, Scaffolding, and Harness
At the heart of any AI agent system is the model. This is your large language model (LLM) β think Claude, Qwen, GPT, or Kimi. On its own, the model is a text-in, text-out machine with no memory between calls and no internal loop. It can suggest using a tool, but it needs an external system to actually execute that call. Essentially, it answers a single prompt and then stops.
Then we have scaffolding, which is the behavior-defining layer wrapped around the model. This includes elements like the system prompt, detailed tool descriptions, how the model’s responses are parsed, and its context management β essentially, what it remembers across steps. Scaffolding dictates how the model perceives the world and interacts within it, both during training and inference.
The harness is the dynamic execution layer within the agent. Itβs what brings the agent to life, responsible for calling the model, handling its tool invocations, and determining when to stop the process. If the scaffolding provides the instructions and tools the model works with, the harness is the engine that actually makes the agent run and execute those instructions.
While some products, like Claude Code, broadly use “harness” to encompass everything that isn’t the model, distinguishing between scaffold and harness becomes crucial in complex scenarios, such as during a training pipeline. Harness engineering is the critical discipline of designing this layer effectively, ensuring proper error handling, setting stopping conditions, and implementing essential guardrails. This applies to both training and inference stages.
What Defines an AI Agent?
In its most fundamental sense, an agent, originating from reinforcement learning, is a function that takes an observation and returns an action. This action then influences the environment, which in turn provides a new observation, perpetuating a continuous loop. This core loop remains central to how modern LLM agents function.
In the LLM landscape, the definition of an agent has expanded significantly. An agent is now understood as the combination of a model and all the surrounding infrastructure that enables it to act, rather than merely respond. It transforms raw text generation into a dynamic entity capable of receiving information, making decisions, and acting upon those decisions in a continuous loop.
Consider a coding agent: the system prompt, tool descriptions, and required output format constitute the scaffolding. The iterative loop that invokes the model, manages its tool calls, and decides when to terminate the process is the harness. In many frameworks, the common shorthand is “Agent = Model + Harness.” Products like Claude Code or Cursor are essentially specific harnesses built on top of specific models, intricately designed and optimized together. This means that two products using the same underlying model can deliver vastly different experiences due to their distinct harnesses.
Key Concepts for Agent Performance and Training
Context engineering involves meticulously designing what goes into an agent’s context window. This includes everything the model sees at each step, from the system prompt and tool descriptions to conversation history and retrieved knowledge. This isn’t a static decision; the harness dynamically manages this context throughout the agent’s run. Getting this wrong during training can necessitate retraining the model, whereas at inference, it’s a more straightforward text adjustment.
A policy defines an agent’s behavior: for any given situation, it outlines the probability of taking each possible action. While part of this policy is encoded in the model’s weights, a significant portion also stems from the surrounding scaffolding and harness. The same model can exhibit dramatically different behaviors depending on its prompts, tools, memory, and execution loop. Essentially, the policy dictates the behavior, while the agent is the complete system that enacts it within an environment.
Tool use refers to how agents interact with external systems like APIs, code interpreters, databases, or web search. The model expresses its intent to use a tool in a structured format, which the harness then processes, routing the call to the appropriate function. The result is then fed back into the agent’s context, and the loop continues.
Skills are reusable, structured packages of knowledge designed for multi-step tasks. While a tool performs a single action, a skill bundles everything necessary to achieve a larger goal, such as “investigate a bug, form a hypothesis, and write a fix.” They are portable and can be loaded on demand. This differs from a sub-agent, which is an agent called by another agent to handle a specific subtask. A sub-agent possesses its own model and scaffold, reasons independently, and returns a result, capable of using its own tools and even calling further sub-agents.
Understanding Training-Specific Terminology
When an agent is being trained, a specialized vocabulary comes into play. The RL environment is any interactive, stateful object that accepts an action, updates its internal state, and returns an observation. In LLM training, actions are often tool calls. A filesystem, for instance, acts as a simple environment: creating a file updates its state, and an updated file listing serves as the observation. For a more detailed exploration, consult specialized guides on RL environments.
The trainer is the component responsible for improving the agent. It runs numerous agent episodes, scores their outcomes, and uses these scores to update the inner model’s weights. A “rollout” represents one complete agent run from start to finish, capturing everything the agent observed, did, and the reward it received at each step. This raw data, also known as a trajectory or trace, is what reinforcement learning algorithms learn from.
Finally, the reward is the score that signals to the training algorithm whether the model is improving. This can be verifiable (e.g., tests passing), learned (e.g., human preferences or an LLM-as-judge), sparse (a single score at the end of an episode), or dense (a score at each step). This critical feedback is what the trainer utilizes to update the model’s weights effectively. Rubrics further refine rewards by breaking them down into explicit, weighted dimensions rather than a single numerical value.
Source: Hugging Face Blog