For most of 2024 and 2025, the default answer was simple: give the task to one agent, use the biggest context window available, and wait. Sometimes it worked. Often, the model quietly lost the thread partway through.
Anthropic described the problem directly: long-horizon tasks require agents to stay coherent across many steps, often beyond what a context window can reliably support. Bigger windows helped, but they did not solve it.
Anthropic had already shipped tools to help. Subagents let the main agent delegate side tasks to isolated workers, each with its own fresh context, collecting summaries back into the main conversation. Skills packaged repeatable workflows into Markdown files — a recipe Claude could follow on demand. Agent teams went further still: multiple independent Claude sessions, each with its own context window, coordinating through a shared task list and messaging each other directly.
All of this was real progress. But each tool still had the same structural ceiling.
With subagents, the orchestrating Claude session still holds the plan. Every result that comes back from a worker lands in the main conversation’s context window. With subagents, skills, and agent teams, Claude is the orchestrator: it decides turn by turn what to spawn or assign next, and all the results accumulate in the context. This means the orchestrating context expands as the number of agents increases, eventually reaching its limits. As a result, the orchestrating degrades, and the same failure modes appear.
Anthropic identified three failure modes that appear consistently when one context window — whether it belongs to a single agent or a lead orchestrating a small team — is responsible for a task too large to track cleanly. That is where the three common failure modes show up (Figure 1).

These are not bugs. They are what happens when the plan is a thought, and thoughts degrade.
The cost became hard to ignore in early 2026, when Jarred Sumner, creator of Bun, needed to port, file-by-file, about 750,000 lines of Zig to Rust. In the past, a task like this would have taken a team months. Sumner’s pattern was simple: do one unit of work, run an adversarial review, then apply the changes. He later called Dynamic Workflows “the state of the art today for reliably using agents to complete medium-to-large projects.” The result: 750,000 lines of Rust, 99.8% of the existing test suite passing, and only 11 days from first commit to merge.
The key idea is that Claude does not have to keep the whole plan in its head. The workflow moves the plan into code. The script holds the loop, the branches, and the intermediate results. Claude only needs to handle the current step and the final synthesis. The plan becomes a JavaScript file. It does not forget, drift, or stop halfway and call the job done.
That is the problem Dynamic Workflows were built to solve. And that is what this article covers.
By the end, you will understand exactly where subagents, skills, and agent teams reach their limits and why — not as a vague intuition, but as a structural argument you can apply to your own tasks. You will know the six composition patterns that cover the majority of real-world workflow problems, how to write a workflow prompt that actually produces a useful harness, and how to avoid the two most expensive mistakes people make when starting out. You will also know when a workflow is the wrong tool — because Dynamic Workflows consume substantially more tokens than a standard session, and reaching for them on the wrong task is its own kind of failure.
A dynamic workflow is like replacing one exhausted person with a small, focused team.
Instead of asking one AI to carry the whole project from start to finish, you split the work into clean pieces. One agent handles one task. Another checks the result. Another moves the work forward. As a result, no one gets tired in the middle and starts cutting corners. No one gives themselves a perfect score just because they wrote the answer. And no one forgets the original brief, because each agent only has to hold one clear piece of the job.
Claude’s dynamic workflow helps you do this. It splits the job across a team of fresh-context Claudes. Each one handles a smaller piece, another layer checks the work, and the results are merged back into one answer for you.
The keyword here is harness. A harness is the scaffolding around the model: the part that decides how a task is planned, divided, checked, and executed. The default Claude Code harness is built mainly for coding tasks. Anthropic’s team found that these dynamic harnesses are “sometimes even more useful for non-technical work.” Then they created it on the spot, shaped around the task you give it.
Before going further, it helps to separate a workflow from a few other words that often get mixed together. Tools, agents, harnesses, and workflows are often used as if they mean the same thing. They do not. The cleanest way to separate them — I’m borrowing this framing from AlphaSignal — is asking one question: who holds the plan? (Figure 2)

An agent team and a dynamic workflow seem to be alike. However, they are totally distinct. Check the below table to see that.
| Subagent | Agent team | Dynamic workflow | |
| Who holds the plan | the main Claude (orchestrator), in its head | the peers, between them | a JavaScript program |
| Lifecycle | fire-and-forget, one job | long-running, ongoing | runs once, returns one answer |
| Talk to each other? | no — the orchestrator routes everything, and a subagent can’t even spawn its own subagents | yes — they coordinate as peers over time | no — agents work off to the side in script variables; only the final result comes back |
| Feels like | an intern you hand one task | colleagues on a shared project | an assembly line you designed |
And you might ask another question. What is Dynamic? What are the differences of dynamic vs. static?
You could always build a harness yourself. You could wire up the Agent SDK, or run claude -p in a loop, and create a fixed system that you use again and again. That is a static harness: useful, repeatable, but designed in advance.
A dynamic harness is the reverse. Claude writes the harness in the moment, shaped around the task you just gave it. It plans the structure, splits the work, runs the agents, checks the outputs, and then throws the harness away when the job is done — unless you press s to save it.
Static harnesses are general-purpose; dynamic ones are tailor-made and disposable.
Claude is now capable of building dynamic workflows because Opus 4.8 is now capable enough to build the right harness on the fly — as the Anthropic team said, “intelligent enough to write a custom harness tailor-made for your use case.”
There are 6 workflows that Anthropic introduces, and I did some tests with them to intuitively show you how they work. They are:
Fan-out-and-synthesize is probably one of the most seen patterns. One task splits into several agents, each with its own clean context so they can’t contaminate each other, and then a synthesize step — a step that waits for everyone — merges their work into one result (Figure 3).

And Adversarial verification is also another common pattern (Figure 4).

The quickest way to understand dynamic workflows is to use one on a problem that has nothing to do with code.
So I gave Claude a plain business plan for a restaurant subscription model and asked it to tear the idea apart from three hostile angles at once: a risk-averse investor, a demanding customer, and an incumbent competitor. Each agent worked independently. Then a final synthesizer pulled the results together and returned the three strongest objections, plus how I could answer them.
Here’s that run (Figure 5), sped up:

This is the fan-out-and-synthesize pattern: three agents fan out across the same problem from different viewpoints, then one agent synthesizes the results. The whole run took about thirteen seconds.
The important part was not the speed. It was the separation. Because the agents did not share the same context window, they did not quietly influence each other or soften each other’s conclusions. Each one came back with a different kind of view.
Here are the answers:
That is what made the workflow useful. It did not just give me “feedback on the business plan.” It gave me three different objections from three different pressure points: economics, customer value, and defensibility. A single chat would probably have blended those into one polite, mildly useful critique. The workflow made the disagreement sharper. And the nicest part: I did not write a line of code.
The setup is small. You switch the model to Opus 4.8 (I’ll explain it later), and you trigger the workflow either of three ways. The reliable way is to just put the word workflow in your prompt. The other way is to set effort toultracode, which turns on extra-high reasoning and lets Claude decide itself whether to build a workflow. However, be careful with ultracode — it costs more tokens, so reach for it when you want auto-orchestration.
The third one can be triggered if you’ve already had a good workflow before, and it can be triggered again through /<name>. There are two save locations: .claude/workflows/ (Project shared; accessible to everyone who cloned the repository) ~/.claude/workflows/ (Personal use; accessible to all projects, but only to you)
The reason Opus 4.8 matters is that the orchestrator has the hardest job. It is not just answering the question. It is deciding how to split the task, writing the workflow script, assigning work to sub-agents, choosing tools, tracking outputs, and synthesizing the final result. So the pattern is: use the smartest model for orchestration, then use smaller or cheaper models for the worker agents when the sub-tasks are narrower.
The objective: I use a multi-file repo and ask Claude to run workflows to audit this repo using Fan-out-and-synthesize and Adversarial verification.
Prompt: audit the repo with a workflow: fan out finders and verify each finding, synthesize a severity-ranked report. use 200k token

As in Figure 5, Claude creates a workflow with 3 phases: Find –> Verify –> Synthesis; and uses 6 finders for 6 dimensions: security, correctness, data integrity, accessibility, code quality, and repo hygiene. Because I did not specify the aspect for Claude to look into, it automatically suggests these 6.
It started to run the workflow. To check the progress, use command /workflows

Inside /workflows (Figure 7), 6 agents are running, and the bad thing is that they’re all Opus 4.8 and they’re consuming ~50k tokens each. My wallet will run out soon.
After 2 minutes, the finders are all done and found 50 candidate issues (Figure 8). As a result, there are 50 verifying agents to be run on each issue to check whether the issue was real or just a false positive. And all are using Opus 4.8.
That is usually unnecessary. The orchestrator benefits from the strongest model because it has to design the workflow, split the task, manage the agents, and synthesize the result. But many verification tasks are narrower: check this one issue, inspect the evidence, and decide whether it holds up. For that kind of focused work, a cheaper model is often enough.
Therefore, in the next test, I switched the worker agents to Sonnet. The goal was not to make the workflow weaker. It was to keep Opus where it mattered most — orchestration and synthesis — while using a cheaper model for the repeated verification work.

Another try with Sonnet as agents and Opus as orchestrator and synthesizer.
Prompt: audit the repo with a workflow: fanout finders and verify each finding, synthesize a severity-ranked report. use 200k token. Use Sonnet for all agents and Opus as orchestrator and synthesizer
In Figure 9, Claude provided 7 finder agents with Sonnet 4.6 and took 254k tokens to find 71 candidate issues after almost 5 minutes 17 seconds. Sonnet definitely takes longer than Opus to run.

You can check verification details of each issue in the workflows window as in Figure 10.

The verification process of 71 issues roughly consumes almost 1.5Million tokens. It costs much less than Opus, but the running time is significantly longer for finder agents.
Here is the result of the synthesizer (Opus 4.8) in Figure 11.

The important thing is that you have to read the report it produced, review and revise it before putting Claude to work revising the code.
The finder agent still detects several issues, and those were validated as valid by verifying agents later. However, those issues are the nature of the app, meaning they have to be that way, and detecting them means nothing but creating more checking work for us. Hence, I want to add some constraints to the workflow before it runs so that these issues are not picked up during scanning.
Prompt: audit the repo with a workflow: fanout finders and verify each finding, synthesize a severity-ranked report. use 200k token. Use Sonnet for all agents and Opus as orchestrator and synthesizer. Write the workflow and give me the link to access and revise it before running.

Perfect. Claude gives me the workflow script to review and revise before telling Claude to run it (simply by run the workflow) (Figure 12)
I used a shorter codebase and simpler prompt to demonstrate the components of the JavaScript workflow file in Figure 13.

For my testing codebase, here is the scope that I want to revise:
{
key: 'correctness',
prompt: `Audit for CORRECTNESS / LOGIC bugs. Focus: the deterministic date-based daily pick, shuffle behavior, the "last 5 worn excluded" history logic (off-by-one, wraparound, per-wardrobe isolation), wardrobe-gender switching, 2-piece/3-piece filter, theme auto-switch by hour (6am-6pm boundaries), localStorage key handling. Trace edge cases (empty male wardrobe, all outfits recently worn). Read app.js and collection.js.`,
},
{
key: 'docs-accuracy',
prompt: `Audit DOCUMENTATION ACCURACY. Compare README.md and docs/*.md claims against actual code behavior. Focus: features described that don't match implementation, wrong localStorage keys, stale config, deployment steps that won't work, outdated counts ("all 40 outfits"). Read README.md, docs/codebase-summary.md, docs/deployment-guide.md, then verify against the code.`,
},
I removed: shuffle behavior, theme auto-switch by hour (6am-6pm boundaries).Trace edge cases (empty male wardrobe, all outfits recently worn), and the entire 'docs-accuracy' . I also checked other places in the js file to ensure that the above points are removed.
You can also ask Claude to exclude that, but this is simple, so I prefer to do this myself.
So, from 7 aspects that the finder agents will look for, it reduces to 6, and one aspect has a smaller scope (Figure 14).

Six finder agents found 44 distinct candidate issues, and confirmed 40 issues. The whole process, called 51 agents, took 9 minutes and 52 seconds, consuming ~1.66 million tokens.
I ran the same codebase with a single agent in one pass, no team, no verification. It found 47 issues — more than the workflow’s 44 — in a third of the tokens. However, because it did not run verification, so among 47 ones, there are the same 2 wrong findings that the verifier agents in the workflow had caught and removed. I show the differences in below chart for easier comparison (Figure 15).

If you focus on raw coverage and don’t mind self-reviewing, the single agent is a more economical choice with a trade-off in quality.
Dynamic workflows use a lot more tokens than a normal Claude Code session. That’s because they run several sub-agents in the background, and each one works in its own separate context window. So you shouldn’t use them for every task. If you do, you can burn through your plan in just a few hours. The better approach is to use them only when the task truly needs multiple agents working in parallel. A few key signals can help you decide when a workflow is worth using, are in Figure 16.

Dynamic Workflows are expensive, but I want to test whether the cheapest model, Haiku, can save us tokens and cost or not. We cannot change the orchestrator and synthesizer; they must be Opus, that’s non-negotiable. Hence, let’s try to change the subagents to Haiku.
Surprisingly, the workflow finished in ~7.5 min — 37 agents. It used 37 agents and 1.35 million tokens. It found 23 candidate issues, which is much fewer than the Sonnet run above, and all 23 survived verification.
But the cost story was not as simple as “cheaper model, cheaper workflow.” Haiku found only 23 issues with 1.35 million tokens. The Sonnet version found 40 issues with 1.66 million tokens. So even though Haiku is cheaper per token, the token efficiency was worse. It needed more turns to do the same kind of analytical work, and every extra turn meant re-reading more context. The lesson is simple: a smaller model is not automatically cheaper in practice. If it takes more steps to think through the task, it can burn through its price advantage very quickly.
Haiku costs roughly one-third as much as Sonnet per token. On paper, that looks like an easy win. But in this test, Haiku used about 1.5 times more tokens. Those two numbers almost cancel each other out. In the end, the Haiku fan-out was roughly the same cost as Sonnet, maybe around 10% cheaper, and only slightly faster in real time. So “just route everything to the smallest model” is not a reliable rule. A smaller model can lose its price advantage if it needs more tokens to get the job done.
One more note about quality, which I think it’s quite important. There were 14 issues that appeared in both versions. That was quite surprising, and it suggests that the agents were actually doing useful work when they were isolated from each other. However, there were also 2 issues where the two versions disagreed. Surprisingly, Haiku was right on both, while Sonnet was wrong. This does not show which one is a better model, but it’s more like the model does not perform 100% consistently as expected. One of the reasons is that I gave Claude a vague and broad prompt. Hence, instead, I will test with a more specific aspect.
New prompt: audit the repo in term of security vulnerabilities, including secrets, auth, injection, dependencies, data handling, with a workflow: fanout finders and verify each finding, synthesize a severity-ranked report. use 200k token. Use Haiku for all agents and Opus as orchestrator and synthesizer. Write the workflow and give me the link to access and revise it before running.
How the run of Haiku went:
And for Sonnet agents:
One important detail: all 3 issues confirmed in the Haiku run were also found in the Sonnet run. That is more consistent than the previous run. One possible reason is that this time the prompt gave the agents a specific angle to investigate, instead of asking them to look at the whole system from a broad view. That makes sense. The workflow used 5 agents, and each agent focused only on one aspect of security. Because the scope was narrower, the agents could dig deeper into the same type of problem instead of spreading their attention across too many possible issue categories. When an agent isn’t forced to prioritize across a wide surface area, it naturally spends more of its reasoning budget on the specific problem it was handed — and that leads to more thorough, reproducible findings.
Hence, even if you’re using Dynamic Workflows with isolated subagents, your prompt still needs to be as specific as possible. Narrower prompts reduce that variance and push agents toward the same conclusions, which is exactly what you want when consistency and reliability matter.
A useful saved workflow should feel like project automation, not like a transcript of one lucky run. It should be clean enough that another teammate can open it and quickly understand: who owns it, what inputs it expects, which tools it is allowed to use, what each sub-agent is responsible for, and what level of proof is required before the workflow can call the task done.
If the workflow worked well and you want to reuse it, press s in the workflow menu to save it to ~/.claude/workflows. You can also move the script into a skill if the goal is to share the method with your team and make it easier to reuse across similar tasks.
But don’t save a workflow just because the first run succeeded. A successful run only proves that it worked once. Save it when the orchestration itself is valuable: when the script is easier to inspect, reuse, and improve than writing a normal Claude Code prompt again from scratch.
Below are some suggestions for prompts for your reference. Add your details when you want to use one of them:
Stress-test a plan: “Take the plan below and run a workflow where separate agents tear it apart — a skeptical investor, a hard-to-please customer, an incumbent competitor — each independent. Then synthesize the three sharpest objections and the strongest answer to each.”
Audit a repo: “Run a workflow to audit this repository. Fan out agents for logic bugs, unsafe routes, weak auth, missing authorization, exposed secrets, risky dependencies, and data leaks. For each finding, spawn a separate agent to adversarially verify it — try to prove it’s not real. Synthesize a severity-ranked report with file paths and fixes.
use 200k tokens.”
Make it cheap: “Build it so the finder agents run on
model: 'haiku'while the orchestrator stays on Opus 4.8 and does the final synthesis. Report tokens and wall-clock time.”
Reproduce a flaky test: “This test fails maybe 1 in 50 runs. Set up a workflow to reproduce it — form theories and adversarially test them in worktrees.
/goaldon’t stop until one theory works.”
Verify a draft: “Go through this draft and use a workflow to verify every technical claim against the codebase and sources. I don’t want to ship anything wrong.”
Rank by real priority (tournament): “I have a list of findings/options. Use a workflow to rank them by [real exploitability / impact / whatever matters] — but instead of scoring each one, run a pairwise tournament and rank by who wins. Then show me the top three and why.”
Root-cause a heisenbug: “This bug is intermittent and the obvious cause looks wrong. Use a workflow: split the investigation by evidence — one agent on the symptoms, one on the code, one on the data/logs — then have separate agents try to refute each theory, and synthesize the cause that survives.”
Triage a backlog safely: “Use a workflow to triage this backlog: classify each item (fix-now / escalate / needs-a-decision), dedupe into families, and route. Anything that reads untrusted input must be read-only — keep it separate from whatever proposes changes.”
Route by task shape: “Use a workflow with a classifier that looks at each task and routes it to the cheapest capable model — small models for mechanical work, Opus for the ambiguous, security-critical reasoning — then runs each on its chosen model.”
Check house rules: “Use a workflow to check this code against our rules in CLAUDE.md — one verifier per rule, plus a skeptic that hunts for false positives. I care more about not crying wolf than about catching every nit.”