← Home

Research Note · March 2026

Five Hard Problems for Next-Gen Agents

Foundation models are getting faster, cheaper, and more capable every quarter. But some problems don't yield to better models. They are structural, mathematical, and will define the constraints of autonomous agent systems for years to come.

The premise

In 6–12 months, we expect foundation models to be 3–5x cheaper, 2–3x faster, with context windows approaching 1M tokens and near-perfect tool use. Single-agent capability will be effectively solved.

This makes the interesting question not "how do we make agents smarter?" but "what breaks when smart agents operate at scale?"

We've identified five fundamental problems that model improvements alone cannot solve. These are the problems that next-generation agent infrastructure must address.

Problem 1

The Entropy Problem

Every change increases disorder unless an active force counteracts it.

This is the second law of thermodynamics applied to software. Every commit adds code, dependencies, and interactions between components. Complexity grows superlinearly with codebase size.

Agents make this dramatically worse. A team of 10 agents producing 200 commits per day generates code faster than any human or agent can comprehend the whole. Each individual commit may be correct. The system as a whole decays.

Longer context windows don't help. You can fit more code in the window, but understanding complexity is not the same as seeing text. A 1M-token context window holding a 1M-token codebase still can't reason about all pairwise interactions between components.

Why brute force fails

You cannot solve entropy by throwing compute at it. Complexity is a property of the system's structure, not a resource problem. Adding more agents to manage complexity adds more complexity. The only solution is architecture that actively reduces entropy — agents that simplify, delete, and refactor, not just generate.

The research question

Can we build entropy-aware agent systems where every change must either prove it doesn't increase system complexity, or explicitly declare an entropy debt?

Problem 2

The Verification Asymmetry

Generation is O(n). Verification is O(2ⁿ).

Writing a function takes seconds. Proving it handles all edge cases requires exploring a combinatorial space of inputs, states, and interactions. This is a practical manifestation of P ≠ NP — and no model improvement changes it.

Today's "verification" is sampling-based: run some tests, check some cases. This catches known failure modes. But agent-generated code introduces bugs in unknown scenarios — the ones not covered by existing tests, precisely because no human anticipated them.

As agents produce more code, the verification gap widens. Human review already can't keep up. A team of agents generating 50 PRs per day makes manual review fiction. But automated review (another agent reading the diff) is just another form of sampling, not proof.

Why brute force fails

More compute lets you sample more cases, but the space is exponential. Going from 1% coverage to 2% costs as much as going from 0% to 1%. You can't brute-force your way to mathematical confidence.

The research question

What is the practical verification frontier — the boundary between properties that machines can prove and properties that require human judgment? How do we maximize the machine-provable surface, and honestly flag everything beyond it?

Problem 3

The Coordination Tax

N agents have O(N²) potential conflicts.

This is Brooks' Law for agents. When multiple agents work on the same codebase, every pair can potentially create a conflict — not just textual merge conflicts (git handles those), but semantic conflicts where two changes are individually correct but combined produce a bug.

  5 agents  →     10 potential conflicts → manageable
 20 agents  →    190 potential conflicts → fragile
 50 agents  →  1,225 potential conflicts → unmanageable
100 agents  →  4,950 potential conflicts → impossible

Current solutions use a central orchestrator that assigns non-overlapping tasks. This works at small scale but becomes a bottleneck — the orchestrator itself must understand the full codebase to avoid assigning conflicting work.

Why brute force might partially work

Unlike the first two problems, coordination might yield to brute force. If one superintelligent agent with a 10M-token context can hold the entire codebase and orchestrate all others, the N² problem reduces to a single-point bottleneck. Whether that bottleneck scales depends on model capability curves.

The research question

Is there a coordination architecture — inspired by distributed systems (consensus protocols, CRDTs, partition tolerance) — that achieves O(N log N) coordination cost instead of O(N²)?

Problem 4

The Context Boundary

Information degrades at every boundary: sessions, agents, projects, time.

When an agent session ends, the reasoning behind its decisions is lost. The next session sees the code but not the why. When Agent A modifies a file, Agent B sees the diff but not A's intent. When a lesson is learned in Project X, Project Y re-discovers it from scratch.

What's lost isn't text — it's semantic context. The decision rationale, the alternatives considered and rejected, the constraints that shaped the solution. Code is a lossy compression of intent.

Why brute force might partially work

Larger context windows help. If you can fit every commit message, every PR discussion, every design document into context, you recover some signal. But context windows solve the storage problem, not the retrieval problem. Having 1M tokens of project history in context doesn't mean the agent will find the one paragraph that explains why a particular design choice was made.

The research question

How do we design agent communication protocols that transmit intent + evidence + constraints instead of just code diffs? What is the right unit of persistent knowledge for agent systems?

Problem 5

The Intent Gap

The hardest part of software was never "how to write it" — it's "what to write."

As models get more capable, you can give vaguer instructions and get working code. This feels like progress — but it makes the intent gap worse.

When the agent "guesses" your intent correctly, it's magic. When it guesses wrong, you don't know what went wrong, because your intent was never precise to begin with. The spec was in your head, unexamined, and the code the agent produced is a confident implementation of the wrong thing.

This problem gets harder with scale. One agent misinterpreting intent produces a bad PR. Fifty agents misinterpreting a vague directive produce a codebase that confidently solves the wrong problem from fifty different angles.

Why brute force fails

This isn't a capability problem. The model could be infinitely smart and still can't execute intent that doesn't exist in precise form. Better models make it easier to skip the hard work of specifying what you actually want — which makes the problem worse, not better.

The research question

Can agents serve as intent crystallizers — not just executing vague instructions, but helping humans discover and formalize what they actually mean through dialogue, prototypes, and counterexamples?

Where we focus

Not all five problems are equally tractable for a small research lab. We evaluate each on three axes: whether brute-force scaling can solve it, whether we have a credible research angle, and whether it's urgent now versus years out.

Core: Entropy + Verification

These two are immune to brute force. No amount of compute turns O(2ⁿ) into O(n). No context window makes complexity disappear. They are the most urgent — already visible in our own experiments with multi-agent development. We focus here.

Active monitoring: Coordination + Context

Both are real, but both might yield to raw model capability gains. If 12 months from now a single agent can hold an entire codebase in context and orchestrate 100 others, these problems transform. We research them but don't bet the lab on them.

Adjacent: Intent

Important, but fundamentally an HCI problem. Foundation model labs are investing billions in making intent specification easier. We don't compete here. We watch and build on their progress.

Evidence from our experiments

These aren't theoretical concerns. Over 22 rounds of multi-agent development on our research platform cc-manager (107 tasks, 91 successes), we've directly observed all five:

Entropy: duplicate functions emerging when parallel agents solve similar problems independently
Verification: tests passing individually while integration reveals semantic conflicts
Coordination: multi-file tasks failing at 3x the rate of single-file tasks
Context: agents re-discovering solutions that previous sessions already found
Intent: vague prompts producing technically correct but architecturally wrong code

The next phase of our research translates these observations into reusable infrastructure — starting with entropy measurement and verification protocols.

These are the problems we believe will define the next generation of autonomous agent systems. Model improvements will raise the floor, but these structural constraints will remain the ceiling.

Follow our research on GitHub →