AI, Agents, and the Art of Orchestration

The AI industry has a terminology problem. "Model," "agent," "orchestrator," and "skill" get thrown around interchangeably in blog posts, podcasts, and product marketing — as if they're the same thing. They're not. The distinctions matter, and getting them wrong leads to architectures that are expensive, fragile, and slow.

This article lays out the precise hierarchy: what a model is, what makes an agent an agent, why orchestration often matters more than model choice, and how a discipline called context engineering is emerging as a key optimization in agentic AI.

The Hierarchy: Model, Agent, Orchestrator

A language model (Claude, GPT, Gemini) is a reasoning engine. It takes input, produces output, one turn at a time. It has no persistent goals, no autonomous decision-making, and no independent access to the world unless tools are explicitly wired to it. Give it a prompt, get a response. That's it.

An agent is an architecture built on top of a model. It adds goal persistence, planning and task decomposition, autonomous tool use in a loop, and state management across steps. The formula is straightforward:

Agent = Model + Goal + Loop + Tools + Memory

When you interact with a chatbot, the model is reactive — you prompt, it responds. A true agent takes a high-level objective like "find and fix all P1 bugs in this repo" and autonomously plans, executes, retries, and reports back.

An orchestration layer is the coordination code that connects models, tools, data, and workflows into a functioning system. It defines what system prompt gets injected, what tools are available, how execution flows between steps, and how errors are handled. Not all orchestration produces an agent — RAG pipelines, tool routing, and evaluation pipelines are all orchestrated systems without being agents. An agent is a specific orchestration pattern: one that adds goal persistence and autonomous action loops.

This is why the same model behaves completely differently in different products. VS Code Copilot using Claude Sonnet and Claude Code using Claude Sonnet will produce wildly different results on the same task — different system prompts, different tool definitions, different loop logic, different context management strategies. Once models cross a capability threshold for a given task, orchestration quality tends to dominate system performance. This is why the "which LLM is best" debate can be misleading — for many production use cases, the orchestration design matters more.

Now that we know what these components are, the question becomes: what makes the difference between systems that work and systems that don't?

Why Multi-Model Architectures Are Winning

Using a single model for every agent task is the equivalent of using a single microservice for your entire backend. It works at small scale, but as systems grow, multi-model architectures become increasingly advantageous. The correct decomposition isn't just by modality (text vs. images) but by multiple axes:

Task complexity. A routing agent classifying incoming requests needs a small, fast, cheap model. An analyst agent synthesizing financial data needs a frontier model with strong reasoning. A reader agent doing bulk text extraction sits somewhere in between. Don't match models to modality — match them to cognitive demand.

Latency constraints. Some agent calls are in the user-facing hot path (sub-second required). Others are background batch jobs where cost matters more than speed. You'd use different models for the same type of task based on where it sits in the request lifecycle.

Cost efficiency. The emerging architecture is a mixture of models behind a router: an orchestrator agent (frontier model) that understands the goal and decomposes it, specialist agents (right-sized models) that execute subtasks, and a router layer that maps subtask type to model selection — factoring in cost, latency, and accuracy requirements.

This is the "mixture of experts" concept applied at the system architecture level, not the model architecture level.

A practical warning: don't over-engineer the model selection upfront. Start with one good model, measure where it's overkill (wasting cost and latency) or underkill (poor accuracy), then selectively swap in specialized models where the data tells you to. Premature model optimization is as real a problem as premature code optimization. The hard problem isn't picking models — it's building the observability and evaluation infrastructure to know when a cheaper model is good enough.

Write Skills, Not Agents

There's an emerging architectural pattern among practitioners building production agent systems: favor specialized, reusable procedural modules over monolithic agents. Different ecosystems call these modules different things — skills, tools, commands, workflows, capabilities, plugins — but the underlying principle is the same. It echoes the Unix philosophy: small, composable units that do one thing well.

Boris Cherny — the creator and head of Claude Code at Anthropic — embodies this philosophy in practice. His workflow tips, shared across multiple public threads (see References), reveal a consistent pattern: reliability comes from specialization plus constraint, not from building one monolithic agent that tries to do everything. In his words, agents are not "one big agent" — they're modular roles. Skills (as Claude Code calls them) are the reusable knowledge those roles draw on.

The distinction matters because it directly impacts the most underappreciated bottleneck in agent systems: context window management.

The Context Window Is a Fixed Resource

A context window is not infinite memory — it's a finite budget. Every token loaded into it is a token the model uses for attention during reasoning. Load irrelevant context, and you dilute the signal-to-noise ratio. The model doesn't just ignore the noise; it attends to it, which degrades output quality and increases cost.

This creates a counterintuitive problem: the more instructions you give an agent "just in case," the worse it performs on the actual task. Front-loading 50K tokens of general-purpose instructions to cover every possible scenario means the model is reasoning through irrelevant context on every single call. A well-written skill with conditional references might achieve the same coverage with 2K tokens — a 25x reduction.

Cherny's team learned this empirically. Their shared CLAUDE.md files — the persistent instruction sets that guide agent behavior — are kept deliberately concise. The community best practice that emerged targets under 200 lines per instruction file. Large instruction sets often reduce reasoning efficiency and increase cost — attention dilution is real, even if its exact impact varies by task.

Even more telling: for code agent workflows, the Claude Code team tried sophisticated approaches to context retrieval — local vector databases, recursive model-based indexing, RAG pipelines. All had downsides including stale indexes, permission complexity, and noise. Plain glob and grep, driven by the model itself, outperformed everything. The simpler, more targeted approach won because it loaded less noise into context. This pattern applies primarily to code environments where filenames, symbols, and directory structure already encode semantics — RAG still wins for general knowledge retrieval where no such structure exists.

Skills as Lazy-Loaded Context

A skill, in this architecture, is a self-contained piece of knowledge or workflow that gets loaded only when needed. Think of it as lazy loading for AI context. Instead of dumping everything the agent might need into the context window at startup, you write conditional workflows:

If the task involves API changes:
  1. Read ./docs/api-conventions.md
  2. Read ./docs/breaking-change-policy.md
  3. Run /api-compatibility-check

If the task involves database changes:
  1. Read ./docs/migration-checklist.md
  2. Run /schema-validate

Otherwise:
  Skip the above, proceed with standard workflow

This is a routing tree inside the skill itself. The agent only loads what the current branch requires. It's progressive disclosure applied to AI context management.

Cherny's team operationalizes this: if you do something more than once a day, turn it into a skill or command. They build skills for BigQuery analytics, code review, tech debt analysis, security review — each one a focused, reusable procedure that loads its specific context on demand rather than polluting every session.

Three Principles of Context Engineering

Context engineering is emerging as a real discipline — one that encompasses prompt design, retrieval strategies, context compression, skill routing, memory management, and token budgeting. What follows focuses on one critical slice: how skill design and instruction loading strategy affect agent performance. Three principles are emerging from practice:

Principle 1: Skills with references minimize token utilization. Write skills that point to documents rather than embedding the documents' content. A reference like "read ./docs/migration-checklist.md" costs a few tokens in the skill definition. The actual document only gets loaded when the skill triggers. Across hundreds of agent invocations, this difference compounds into massive cost and quality improvements.

Principle 2: Right-sized agents for right-sized tasks. This goes beyond choosing the right model (Opus for reasoning, Haiku for classification). It means scoping the agent's role narrowly. A small, constrained agent with a focused skill set outperforms a general-purpose agent on the exact same model, because it has less ambiguity in its instructions and less noise in its context. Cherny's team runs specialized subagents — a code-simplifier, a verify-app agent, a build-validator — rather than one agent that does everything.

Principle 3: Conditional workflows minimize loaded context. Skills should implement branching logic that determines what context to load based on the current task. This is the architectural equivalent of database query optimization: don't SELECT * when you only need two columns.

The Most Important Tip: Verification

Across all of Cherny's workflow advice, one principle stands above everything else: give the agent a way to verify its work. If the agent has a feedback loop — tests to run, a browser to check UI, a linter to validate code — output quality roughly doubles or triples compared to unverified generation.

This is the equivalent of test-driven development for agents. The verification infrastructure you build around an agent matters more than the prompt engineering you put into it. Browser testing, test suites, type checkers, simulators — these are the feedback loops that close the gap between "it generated something" and "it generated the right thing."

The investment in verification compounds over time because every verified outcome can feed back into the agent's instruction set (CLAUDE.md, skills, memory), making future executions more reliable. Cherny's team treats code review as the place to encode standards — tagging @claude on pull requests to update CLAUDE.md with new learnings. It's institutional memory that compounds.

The Synthesis

The architecture pattern that practitioners are converging on: load the minimum context needed for the current task, at the moment it's needed, using the smallest capable model, with a verification loop to confirm the output.

Systems that ignore these principles tend to waste tokens, latency, and compute — and produce less reliable results.

This synthesis — modular agents, lazy-loaded skills, right-sized models, verification feedback loops — is the foundation. But building it correctly is only half the challenge. The other half is knowing how it breaks.

References

Boris Cherny's Claude Code Workflow Tips (Primary Sources):

Part 1 (Jan 2, 2026) — 13 tips: x.com/bcherny thread
Part 2 (Jan 31, 2026) — 10 tips: x.com/bcherny thread
Part 3 (Feb 11, 2026) — 12 tips on customization
Part 4 (Feb 20, 2026) — 5 tips
All tips compiled: howborisusesclaudecode.com

Boris Cherny Interviews & Podcasts:

"Building Claude Code with Boris Cherny" — Gergely Orosz, The Pragmatic Engineer: newsletter.pragmaticengineer.com
"Head of Claude Code: What happens after coding is solved" — Lenny Rachitsky's Podcast: lennysnewsletter.com
"Boris Cherny (Creator of Claude Code) On How His Career Grew" — Developing.dev: developing.dev

Community Resources:

Claude Code Skills & Agents Best Practices: github.com/shanraisshan/claude-code-best-practice

Shaped in collaboration with Claude, an AI assistant by Anthropic, during rainy and windy Pacific Northwest morning where engineering problems meet philosophical questions.