A Harness for Every Task: Putting a Team of Claudes on One Job

1.

For most of 2024 and 2025, the default answer was simple: give the task to one agent, use the biggest context window available, and wait. Sometimes it worked. Often, the model quietly lost the thread partway through.

Anthropic described the problem directly: long-horizon tasks require agents to stay coherent across many steps, often beyond what a context window can reliably support. Bigger windows helped, but they did not solve it.

Anthropic had already shipped tools to help. Subagents let the main agent delegate side tasks to isolated workers, each with its own fresh context, collecting summaries back into the main conversation. Skills packaged repeatable workflows into Markdown files — a recipe Claude could follow on demand. Agent teams went further still: multiple independent Claude sessions, each with its own context window, coordinating through a shared task list and messaging each other directly.

All of this was real progress. But each tool still had the same structural ceiling.

With subagents, the orchestrating Claude session still holds the plan. Every result that comes back from a worker lands in the main conversation’s context window. With subagents, skills, and agent teams, Claude is the orchestrator: it decides turn by turn what to spawn or assign next, and all the results accumulate in the context. This means the orchestrating context expands as the number of agents increases, eventually reaching its limits. As a result, the orchestrating degrades, and the same failure modes appear.

Anthropic identified three failure modes that appear consistently when one context window — whether it belongs to a single agent or a lead orchestrating a small team — is responsible for a task too large to track cleanly. That is where the three common failure modes show up (Figure 1).

Figure 1. One mind, one context window — and the three ways it quietly fails on a big job. Image by author help by ChatGPT

First, Agentic laziness — It starts the task but does not fully finish. It may stop early, skip some files, or assume the remaining work is similar enough. Then it confidently says the whole task is done. This is like a person checking only part of a long spreadsheet but marking the entire spreadsheet as reviewed.

Second, Self-preferential bias. The AI is not very strict when judging its own output. If you ask it, “Did you follow the instructions?” it often says yes, because it tends to give itself the benefit of the doubt. It may miss its own mistakes or overrate the quality of its answer.

Third, Goal drift. Over a long task, the AI slowly loses track of the original goal. It may remember the main task, but forget important details like “do not include X”, “do not skip any file” or “only use this format”. The longer the conversation or task becomes, the more likely this drift happens.

These are not bugs. They are what happens when the plan is a thought, and thoughts degrade.

The cost became hard to ignore in early 2026, when Jarred Sumner, creator of Bun, needed to port, file-by-file, about 750,000 lines of Zig to Rust. In the past, a task like this would have taken a team months. Sumner’s pattern was simple: do one unit of work, run an adversarial review, then apply the changes. He later called Dynamic Workflows “the state of the art today for reliably using agents to complete medium-to-large projects.” The result: 750,000 lines of Rust, 99.8% of the existing test suite passing, and only 11 days from first commit to merge.

The key idea is that Claude does not have to keep the whole plan in its head. The workflow moves the plan into code. The script holds the loop, the branches, and the intermediate results. Claude only needs to handle the current step and the final synthesis. The plan becomes a JavaScript file. It does not forget, drift, or stop halfway and call the job done.

That is the problem Dynamic Workflows were built to solve. And that is what this article covers.

By the end, you will understand exactly where subagents, skills, and agent teams reach their limits and why — not as a vague intuition, but as a structural argument you can apply to your own tasks. You will know the six composition patterns that cover the majority of real-world workflow problems, how to write a workflow prompt that actually produces a useful harness, and how to avoid the two most expensive mistakes people make when starting out. You will also know when a workflow is the wrong tool — because Dynamic Workflows consume substantially more tokens than a standard session, and reaching for them on the wrong task is its own kind of failure.

2. What a dynamic workflow is

A dynamic workflow is like replacing one exhausted person with a small, focused team.

Instead of asking one AI to carry the whole project from start to finish, you split the work into clean pieces. One agent handles one task. Another checks the result. Another moves the work forward. As a result, no one gets tired in the middle and starts cutting corners. No one gives themselves a perfect score just because they wrote the answer. And no one forgets the original brief, because each agent only has to hold one clear piece of the job.

Claude’s dynamic workflow helps you do this. It splits the job across a team of fresh-context Claudes. Each one handles a smaller piece, another layer checks the work, and the results are merged back into one answer for you.

The keyword here is harness. A harness is the scaffolding around the model: the part that decides how a task is planned, divided, checked, and executed. The default Claude Code harness is built mainly for coding tasks. Anthropic’s team found that these dynamic harnesses are “sometimes even more useful for non-technical work.” Then they created it on the spot, shaped around the task you give it.

Before going further, it helps to separate a workflow from a few other words that often get mixed together. Tools, agents, harnesses, and workflows are often used as if they mean the same thing. They do not. The cleanest way to separate them — I’m borrowing this framing from AlphaSignal — is asking one question: who holds the plan? (Figure 2)

Subagents vs Agent Teams vs Dynamic workflow — Figure 2. One question — *who holds the plan?* *Image by author* *with* help of ChatGPT

A subagent is a helper the main Claude sends out for one specific job. The plan still stays with the main Claude. The subagent does its part, sends the result back, and that result appears in your chat. It is mostly fire-and-forget. As the table below shows, a subagent cannot create its own helpers or talk to other subagents.
An agent team is different. It is a group of Claudes working side by side, coordinating as peers. The plan does not sit inside one Claude. It lives between them. They can message each other, adjust as the work unfolds, and continue across one larger shared task. It is more like giving a project to a small team.
A dynamic workflow is different again. Claude writes a small JavaScript program for the task itself. In this case, the plan lives in code. The agents do their work off to the side, their outputs are stored in variables, and only the final merged answer comes back to you.

An agent team and a dynamic workflow seem to be alike. However, they are totally distinct. Check the below table to see that.

	Subagent	Agent team	Dynamic workflow
Who holds the plan	the main Claude (orchestrator), in its head	the peers, between them	a JavaScript program
Lifecycle	fire-and-forget, one job	long-running, ongoing	runs once, returns one answer
Talk to each other?	no — the orchestrator routes everything, and a subagent can’t even spawn its own subagents	yes — they coordinate as peers over time	no — agents work off to the side in script variables; only the final result comes back
Feels like	an intern you hand one task	colleagues on a shared project	an assembly line you designed

And you might ask another question. What is Dynamic? What are the differences of dynamic vs. static?

You could always build a harness yourself. You could wire up the Agent SDK, or run claude -p in a loop, and create a fixed system that you use again and again. That is a static harness: useful, repeatable, but designed in advance.

A dynamic harness is the reverse. Claude writes the harness in the moment, shaped around the task you just gave it. It plans the structure, splits the work, runs the agents, checks the outputs, and then throws the harness away when the job is done — unless you press s to save it.

Static harnesses are general-purpose; dynamic ones are tailor-made and disposable.

Claude is now capable of building dynamic workflows because Opus 4.8 is now capable enough to build the right harness on the fly — as the Anthropic team said, “intelligent enough to write a custom harness tailor-made for your use case.”

3. The real test

3.1 Patterns that make dynamic workflows useful

There are 6 workflows that Anthropic introduces, and I did some tests with them to intuitively show you how they work. They are:

Fan-out-and-synthesize — split the work, then merge them. Each piece gets its own agent and clean context; a final synthesizer waits for all of them before combining results.
Adversarial verification — for every finding, spawn a separate agent whose only job is to disprove it. A skeptic checking the optimist.
Classify-and-act — use a classifier agent to sort each item first, then route it to the right handler. A front desk.
Generate-and-filter — brainstorm wide, then filter by a rubric: dedupe, verify, keep only what survives scrutiny.
Tournament — spawn N agents that each attempt the same task differently, then have a judge agent compare them in pairs until one wins. Good for taste and naming.
Loop-until-done — for jobs of unknown size, keep spawning agents until a stop condition is met (no new findings, no more errors) rather than a fixed number of passes.

Fan-out-and-synthesize is probably one of the most seen patterns. One task splits into several agents, each with its own clean context so they can’t contaminate each other, and then a synthesize step — a step that waits for everyone — merges their work into one result (Figure 3).

Figure 3. Fan-out-and-synthesize: split into clean-context agents, then a barrier merges everyone’s work into one result. Image by author with help of ChatGPT

And Adversarial verification is also another common pattern (Figure 4).

Figure 4. Adversarial verification: a finding faces a panel of refuters; majority-refute kills it, the rest survive. Image by author help by ChatGPT

3.2 Dynamic Workflow on non-technical problem

The quickest way to understand dynamic workflows is to use one on a problem that has nothing to do with code.

So I gave Claude a plain business plan for a restaurant subscription model and asked it to tear the idea apart from three hostile angles at once: a risk-averse investor, a demanding customer, and an incumbent competitor. Each agent worked independently. Then a final synthesizer pulled the results together and returned the three strongest objections, plus how I could answer them.

Here’s that run (Figure 5), sped up:

Figure 5. Fan-out-and-synthesize: three critics attack a business plan in parallel; one Opus agent synthesizes. 4 agents, ~262k tokens, ~13s. *Image by author*

This is the fan-out-and-synthesize pattern: three agents fan out across the same problem from different viewpoints, then one agent synthesizes the results. The whole run took about thirteen seconds.

The important part was not the speed. It was the separation. Because the agents did not share the same context window, they did not quietly influence each other or soften each other’s conclusions. Each one came back with a different kind of view.

Here are the answers:

The investor attacked the math: The economics are too thin to survive churn. At $29/month and roughly 40% margin, the product makes only about $11.60 in gross profit per customer per month. With a $35 customer acquisition cost, the business needs customers to stay long enough for lifetime value to clearly beat acquisition cost. But food subscriptions usually face churn, and one weak retention month can push the model underwater. Answer: fix the unit economics before scaling: increase revenue per user through annual plans or add-ons, prove low cohort churn, and model LTV-to-CAC explicitly.
The customer attacked the value: The pitch leaned too hard on ideas like rotating menus and carbon-neutral delivery. Those may sound good in a deck, but they may not be what customers care about most when choosing dinner. Most customers want speed, flexibility, and less daily decision-making. Answer: make the value more practical: lead with time saved, convenience, and how the service makes weeknight meals easier.
The competitor attacked the moat: A rotating menu and carbon-neutral delivery can be copied quickly. Neither creates much switching cost. A larger competitor could imitate the surface-level features, undercut the price, or bundle the offer into an existing delivery network. Answer: build a stronger moat: per-city logistics density, personalization, switching credits, or habits that make the service harder to replace.

That is what made the workflow useful. It did not just give me “feedback on the business plan.” It gave me three different objections from three different pressure points: economics, customer value, and defensibility. A single chat would probably have blended those into one polite, mildly useful critique. The workflow made the disagreement sharper. And the nicest part: I did not write a line of code.

3.3 Enable dynamic workflows

The setup is small. You switch the model to Opus 4.8 (I’ll explain it later), and you trigger the workflow either of three ways. The reliable way is to just put the word workflow in your prompt. The other way is to set effort toultracode, which turns on extra-high reasoning and lets Claude decide itself whether to build a workflow. However, be careful with ultracode — it costs more tokens, so reach for it when you want auto-orchestration.

The third one can be triggered if you’ve already had a good workflow before, and it can be triggered again through /<name>. There are two save locations: .claude/workflows/ (Project shared; accessible to everyone who cloned the repository) ~/.claude/workflows/ (Personal use; accessible to all projects, but only to you)

The reason Opus 4.8 matters is that the orchestrator has the hardest job. It is not just answering the question. It is deciding how to split the task, writing the workflow script, assigning work to sub-agents, choosing tools, tracking outputs, and synthesizing the final result. So the pattern is: use the smartest model for orchestration, then use smaller or cheaper models for the worker agents when the sub-tasks are narrower.

3.4 Let’s test them out

3.4.1 Default approach

The objective: I use a multi-file repo and ask Claude to run workflows to audit this repo using Fan-out-and-synthesize and Adversarial verification.

Prompt: audit the repo with a workflow: fan out finders and verify each finding, synthesize a severity-ranked report. use 200k token

Figure 6: Claude Code answer for the workflow creation. *Image by author*

As in Figure 5, Claude creates a workflow with 3 phases: Find –> Verify –> Synthesis; and uses 6 finders for 6 dimensions: security, correctness, data integrity, accessibility, code quality, and repo hygiene. Because I did not specify the aspect for Claude to look into, it automatically suggests these 6.

It started to run the workflow. To check the progress, use command /workflows

Figure 7: Workflow progress. *Image by author*

Inside /workflows (Figure 7), 6 agents are running, and the bad thing is that they’re all Opus 4.8 and they’re consuming ~50k tokens each. My wallet will run out soon.

After 2 minutes, the finders are all done and found 50 candidate issues (Figure 8). As a result, there are 50 verifying agents to be run on each issue to check whether the issue was real or just a false positive. And all are using Opus 4.8.

That is usually unnecessary. The orchestrator benefits from the strongest model because it has to design the workflow, split the task, manage the agents, and synthesize the result. But many verification tasks are narrower: check this one issue, inspect the evidence, and decide whether it holds up. For that kind of focused work, a cheaper model is often enough.

Therefore, in the next test, I switched the worker agents to Sonnet. The goal was not to make the workflow weaker. It was to keep Opus where it mattered most — orchestration and synthesis — while using a cheaper model for the repeated verification work.

Figure 8: Finder agents result. *Image by author*

3.4.2 Cheaper model for agents

Another try with Sonnet as agents and Opus as orchestrator and synthesizer.

Prompt: audit the repo with a workflow: fanout finders and verify each finding, synthesize a severity-ranked report. use 200k token. Use Sonnet for all agents and Opus as orchestrator and synthesizer

In Figure 9, Claude provided 7 finder agents with Sonnet 4.6 and took 254k tokens to find 71 candidate issues after almost 5 minutes 17 seconds. Sonnet definitely takes longer than Opus to run.

Figure 9: Finder agents with Sonnet. *Image by author*

You can check verification details of each issue in the workflows window as in Figure 10.

Figure 10: Verification window.Image by author

The verification process of 71 issues roughly consumes almost 1.5Million tokens. It costs much less than Opus, but the running time is significantly longer for finder agents.

Here is the result of the synthesizer (Opus 4.8) in Figure 11.

Figure 11: Synthesizer result.Image by author

The important thing is that you have to read the report it produced, review and revise it before putting Claude to work revising the code.

The finder agent still detects several issues, and those were validated as valid by verifying agents later. However, those issues are the nature of the app, meaning they have to be that way, and detecting them means nothing but creating more checking work for us. Hence, I want to add some constraints to the workflow before it runs so that these issues are not picked up during scanning.

3.4.3 Revise the workflow before running

Figure 12: Claude stopped after providing the workflow script to amend. *Image by author*

Perfect. Claude gives me the workflow script to review and revise before telling Claude to run it (simply by run the workflow) (Figure 12)

I used a shorter codebase and simpler prompt to demonstrate the components of the JavaScript workflow file in Figure 13.

Figure 13. Fan-out-and-synthesize — walked through line by line, then run (4 agents, ~262k tokens, ~13s). *Image by author*

For my testing codebase, here is the scope that I want to revise:

{
    key: 'correctness',
    prompt: `Audit for CORRECTNESS / LOGIC bugs. Focus: the deterministic date-based daily pick, shuffle behavior, the "last 5 worn excluded" history logic (off-by-one, wraparound, per-wardrobe isolation), wardrobe-gender switching, 2-piece/3-piece filter, theme auto-switch by hour (6am-6pm boundaries), localStorage key handling. Trace edge cases (empty male wardrobe, all outfits recently worn). Read app.js and collection.js.`,
},
{
    key: 'docs-accuracy',
    prompt: `Audit DOCUMENTATION ACCURACY. Compare README.md and docs/*.md claims against actual code behavior. Focus: features described that don't match implementation, wrong localStorage keys, stale config, deployment steps that won't work, outdated counts ("all 40 outfits"). Read README.md, docs/codebase-summary.md, docs/deployment-guide.md, then verify against the code.`,
},

I removed: shuffle behavior, theme auto-switch by hour (6am-6pm boundaries).Trace edge cases (empty male wardrobe, all outfits recently worn), and the entire 'docs-accuracy' . I also checked other places in the js file to ensure that the above points are removed.

You can also ask Claude to exclude that, but this is simple, so I prefer to do this myself.

So, from 7 aspects that the finder agents will look for, it reduces to 6, and one aspect has a smaller scope (Figure 14).

Figure 14: Workflow running process. *Image by author*

Six finder agents found 44 distinct candidate issues, and confirmed 40 issues. The whole process, called 51 agents, took 9 minutes and 52 seconds, consuming ~1.66 million tokens.

3.4.4 Compare to a single agent running

I ran the same codebase with a single agent in one pass, no team, no verification. It found 47 issues — more than the workflow’s 44 — in a third of the tokens. However, because it did not run verification, so among 47 ones, there are the same 2 wrong findings that the verifier agents in the workflow had caught and removed. I show the differences in below chart for easier comparison (Figure 15).

Figure 15: Comparison of single agent and workflow. *Image by author* *with help from* *ChatGPT*

If you focus on raw coverage and don’t mind self-reviewing, the single agent is a more economical choice with a trade-off in quality.

4. When to use workflow

Dynamic workflows use a lot more tokens than a normal Claude Code session. That’s because they run several sub-agents in the background, and each one works in its own separate context window. So you shouldn’t use them for every task. If you do, you can burn through your plan in just a few hours. The better approach is to use them only when the task truly needs multiple agents working in parallel. A few key signals can help you decide when a workflow is worth using, are in Figure 16.

Figure 16: When to use Dynamic Workflow. *Image by author* *with help from* *ChatGPT*

The first is that the task can be split into independent pieces. If each agent depends on another agent’s output, they mostly end up waiting for each other. At that point, there is not much value in starting a workflow, because you lose the main benefit: parallel work. The less the tasks depend on one another, the more useful the workflow becomes. You get better parallelism, and the results come back faster.

The second signal is whether the task is large enough to need more than one context window. Workflows run multiple sub-agents, and each sub-agent has its own fresh context window. That only makes sense when the task is big enough to benefit from being divided into chunks. Otherwise, you are just spending extra time and tokens for no real gain. This is also useful because each sub-agent returns only its final result. Its detailed reasoning stays inside its own working file and does not enter the main context window unless you ask for it. That keeps the main conversation cleaner and leaves more room for the final synthesis.

The next signal is whether the task needs verification. In some cases, a wrong answer is expensive. You do not want to move forward based on a weak security finding, a false bug report, or a risky migration plan. For tasks like that, it can be worth using extra agents to cross-check the result before you trust it. But verification is not free. More agents mean more tokens and more time. So the task should actually deserve that level of checking. Do not spawn five agents just because you recently heard an AI tech CEO say that more tokens means more money.

The last signal is whether the task is deterministic. A workflow uses code to call agents in a fixed structure. So if the task has a clear shape and can be broken into known steps, a workflow works well. But if the task needs an agent to decide what to do next during runtime, then a workflow is probably not the right tool. A useful way to think about this is whether the task is wide or deep. A wide task can be split into many smaller tasks that run at the same time. That is where workflows shine. They call multiple agents in parallel, let each one work on its own part, and then bring the results together. A deep task moves step by step. Each step depends on what happened before it. For that kind of task, the goal command is usually a better fit. It takes one task at a time and keeps moving forward, instead of trying to run many things in parallel.

5. Can we use Dynamic Workflow economically?

Dynamic Workflows are expensive, but I want to test whether the cheapest model, Haiku, can save us tokens and cost or not. We cannot change the orchestrator and synthesizer; they must be Opus, that’s non-negotiable. Hence, let’s try to change the subagents to Haiku.

Surprisingly, the workflow finished in ~7.5 min — 37 agents. It used 37 agents and 1.35 million tokens. It found 23 candidate issues, which is much fewer than the Sonnet run above, and all 23 survived verification.

But the cost story was not as simple as “cheaper model, cheaper workflow.” Haiku found only 23 issues with 1.35 million tokens. The Sonnet version found 40 issues with 1.66 million tokens. So even though Haiku is cheaper per token, the token efficiency was worse. It needed more turns to do the same kind of analytical work, and every extra turn meant re-reading more context. The lesson is simple: a smaller model is not automatically cheaper in practice. If it takes more steps to think through the task, it can burn through its price advantage very quickly.

Haiku costs roughly one-third as much as Sonnet per token. On paper, that looks like an easy win. But in this test, Haiku used about 1.5 times more tokens. Those two numbers almost cancel each other out. In the end, the Haiku fan-out was roughly the same cost as Sonnet, maybe around 10% cheaper, and only slightly faster in real time. So “just route everything to the smallest model” is not a reliable rule. A smaller model can lose its price advantage if it needs more tokens to get the job done.

One more note about quality, which I think it’s quite important. There were 14 issues that appeared in both versions. That was quite surprising, and it suggests that the agents were actually doing useful work when they were isolated from each other. However, there were also 2 issues where the two versions disagreed. Surprisingly, Haiku was right on both, while Sonnet was wrong. This does not show which one is a better model, but it’s more like the model does not perform 100% consistently as expected. One of the reasons is that I gave Claude a vague and broad prompt. Hence, instead, I will test with a more specific aspect.

New prompt: audit the repo in term of security vulnerabilities, including secrets, auth, injection, dependencies, data handling, with a workflow: fanout finders and verify each finding, synthesize a severity-ranked report. use 200k token. Use Haiku for all agents and Opus as orchestrator and synthesizer. Write the workflow and give me the link to access and revise it before running.

How the run of Haiku went:

15 agents, ~572k subagent tokens, ~3.5 min wall-clock
5 Haiku finders → Haiku adversarial verifiers → Opus synthesizer
9 raw findings → 3 confirmed, 6 refuted. All “high” ratings were removed.

And for Sonnet agents:

23 agents, ~1.3M tokens across both passes. ~ 2.51 min
5 Sonnet finders → Sonnet adversarial verifiers → Opus synthesizer
18 raw findings → 13 confirmed, 5 rejected → deduped to 8 distinct issues. No critical/high survived adversarial verification.

One important detail: all 3 issues confirmed in the Haiku run were also found in the Sonnet run. That is more consistent than the previous run. One possible reason is that this time the prompt gave the agents a specific angle to investigate, instead of asking them to look at the whole system from a broad view. That makes sense. The workflow used 5 agents, and each agent focused only on one aspect of security. Because the scope was narrower, the agents could dig deeper into the same type of problem instead of spreading their attention across too many possible issue categories. When an agent isn’t forced to prioritize across a wide surface area, it naturally spends more of its reasoning budget on the specific problem it was handed — and that leads to more thorough, reproducible findings.

Hence, even if you’re using Dynamic Workflows with isolated subagents, your prompt still needs to be as specific as possible. Narrower prompts reduce that variance and push agents toward the same conclusions, which is exactly what you want when consistency and reliability matter.

6. Keep the good one after run

A useful saved workflow should feel like project automation, not like a transcript of one lucky run. It should be clean enough that another teammate can open it and quickly understand: who owns it, what inputs it expects, which tools it is allowed to use, what each sub-agent is responsible for, and what level of proof is required before the workflow can call the task done.

If the workflow worked well and you want to reuse it, press s in the workflow menu to save it to ~/.claude/workflows. You can also move the script into a skill if the goal is to share the method with your team and make it easier to reuse across similar tasks.

But don’t save a workflow just because the first run succeeded. A successful run only proves that it worked once. Save it when the orchestration itself is valuable: when the script is easier to inspect, reuse, and improve than writing a normal Claude Code prompt again from scratch.

Below are some suggestions for prompts for your reference. Add your details when you want to use one of them:

Stress-test a plan: “Take the plan below and run a workflow where separate agents tear it apart — a skeptical investor, a hard-to-please customer, an incumbent competitor — each independent. Then synthesize the three sharpest objections and the strongest answer to each.”

Audit a repo: “Run a workflow to audit this repository. Fan out agents for logic bugs, unsafe routes, weak auth, missing authorization, exposed secrets, risky dependencies, and data leaks. For each finding, spawn a separate agent to adversarially verify it — try to prove it’s not real. Synthesize a severity-ranked report with file paths and fixes. use 200k tokens.”

Make it cheap: “Build it so the finder agents run on model: 'haiku' while the orchestrator stays on Opus 4.8 and does the final synthesis. Report tokens and wall-clock time.”

Reproduce a flaky test: “This test fails maybe 1 in 50 runs. Set up a workflow to reproduce it — form theories and adversarially test them in worktrees. /goal don’t stop until one theory works.”

Verify a draft: “Go through this draft and use a workflow to verify every technical claim against the codebase and sources. I don’t want to ship anything wrong.”

Rank by real priority (tournament): “I have a list of findings/options. Use a workflow to rank them by [real exploitability / impact / whatever matters] — but instead of scoring each one, run a pairwise tournament and rank by who wins. Then show me the top three and why.”

Root-cause a heisenbug: “This bug is intermittent and the obvious cause looks wrong. Use a workflow: split the investigation by evidence — one agent on the symptoms, one on the code, one on the data/logs — then have separate agents try to refute each theory, and synthesize the cause that survives.”

Triage a backlog safely: “Use a workflow to triage this backlog: classify each item (fix-now / escalate / needs-a-decision), dedupe into families, and route. Anything that reads untrusted input must be read-only — keep it separate from whatever proposes changes.”

Route by task shape: “Use a workflow with a classifier that looks at each task and routes it to the cheapest capable model — small models for mechanical work, Opus for the ambiguous, security-critical reasoning — then runs each on its chosen model.”

Check house rules: “Use a workflow to check this code against our rules in CLAUDE.md — one verifier per rule, plus a skeptic that hunts for false positives. I care more about not crying wolf than about catching every nit.”

Sources

Thariq Shihipar & Sid Bidasaria (Anthropic), “A harness for every task: dynamic workflows in Claude Code” — the why, the patterns, the prompting tips, save/share.
Factory.ai, “The Context Window Problem: Scaling Agents Beyond Token Limits”.
Engineering at Anthropic, “Effective context engineering for AI agents”.
Chroma Technical Report, “Context Rot: How Increasing Input Tokens Impacts LLM Performance”
Anthropic, “Building effective agents” — background on the underlying orchestration patterns.
Anthropic, “Introducing dynamic workflows in Claude Code.”

A Harness for Every Task: Putting a Team of Claudes on One Job | Towards Data Science

1.