Documentation Index
Fetch the complete documentation index at: https://docs.overcut.ai/llms.txt
Use this file to discover all available pages before exploring further.
Your team adopted Cursor or Claude Code six months ago and your engineers say they’re more productive than they’ve ever been. Even standups are faster and the team is merging more pull requests per week than ever. Velocity is up.
Step back and look at the org from above. The headcount didn’t change and the roadmap didn’t double. Major launches still take roughly as long as they used to. Somewhere between “we’re shipping more code per engineer than ever” and “the business is moving faster,” the multiplier you expected got eaten.
This isn’t a measurement problem. It’s a level problem.
Dan Shapiro recently published the cleanest articulation of what’s going on. He mapped AI-assisted software development onto the NHTSA’s five levels of driving automation. The framework gives us a shared vocabulary for what’s actually changing as teams put AI to work on code, and for where almost everyone gets stuck.
The short version: most “AI-native” teams have climbed to level 2. Some have pushed into level 3. That’s where the climb stops. Almost everyone plateaus there, and the plateau is the entire reason the deflation you were promised hasn’t shown up in your roadmap.
We’ve been building Overcut with this climb specifically in mind. The teams we work with are trying to reach level 4 in a real, durable way, on real codebases, with real production constraints. It is still a big lift in 2026. This post is about what we’ve learned about why the plateau happens, what climbing past it actually takes, and where the road goes after that.
A quick recap of the levels
If you haven’t read Shapiro’s post, read it. It’s the right anchor and we won’t rebuild it here. The spine is worth holding in your head as you read on.
Level zero is manual coding with AI as a fancy search engine. Level one is the AI intern that writes your unit tests and docstrings. Level two is real pair-programming flow, where most “AI-native” developers live and where everyone feels at their most productive. Level three is the human-in-the-loop reviewer, where the agent runs in five tabs and your life becomes diffs. Level four is the PM mode, where you write a spec, leave for twelve hours, and come back to passing tests. Level five is the Dark Factory, where it isn’t really a software process anymore. It’s a black box that turns specs into software.
The line that matters most in Shapiro’s piece, and the one we keep coming back to with customers, is this: every level after two feels like you are done. You are not done.
Why the plateau happens
Three forces keep teams stuck at level 3, and they feed into each other. Wishing for progress or tweaking your prompts won’t get you through the ceiling.
The first is trust that doesn’t compound. Every agent run starts cold. The agent doesn’t remember that this codebase has a flaky integration test suite that fails twice a week for reasons unrelated to your change. It doesn’t remember that the last three times it tried to refactor the billing module it broke a webhook nobody told it about. It doesn’t remember that this team prefers small PRs split by concern, and that any diff over 400 lines gets bounced. Without compounding context, the agent makes the same class of mistake on Tuesday that it made on Monday, and there’s no safe operating mode except “review every diff.” That is the definition of level 3.
The second is verification that doesn’t scale. When the bottleneck moves from typing to reading, the math gets worse, not better. A human can write maybe 50 lines of code in an hour. The same human can review somewhere between 200 and 600 lines of code in an hour, depending on density and risk. So when agents start producing code at 5x or 10x human speed, the team becomes a queue feeding a single reviewer. The reviewer becomes the new bottleneck, and the agent’s throughput advantage collapses against it. To move past level 3, verification has to run faster than generation. Not as fast. Faster. With margin.
The third is specs that aren’t load-bearing. To leave the room for twelve hours, the spec has to do the work that the back-and-forth conversation used to do. Most specs aren’t built for this. They’re wishlists with vibes, written for humans who already know the system. An agent reads a wish and produces a wish-shaped answer. A spec that an agent can execute against is closer to an integration test than a Notion doc. It defines what success looks like in terms the verifier can check, what the boundaries are, what’s explicitly out of scope. Most teams have never written one. The skill is rare, and the tools to help write them are newer.
Notice how these three reinforce each other. Without compounding memory, you can’t trust the agent enough to skip review. Without scalable verification, you can’t replace review even if you trusted the agent. Without load-bearing specs, you can’t leave the room even if both other problems were solved. Teams that try to climb to level 4 by fixing only one of these slide back to level 3 within a week.
What climbing actually requires
We think of the climb to level 4 as six things that all have to be true at the same time. Miss any one and the team backs into level 3 without noticing. Our work at Overcut is built around making all six achievable for a real team on a real codebase, not just for a five-person greenfield startup operating at level 4 by sheer talent and luck.
Compounding memory. Memory that learns from outcomes, not just from observations. An agent that fails the same way three times should fail differently the fourth time. This is the difference between a tool you supervise and a teammate you trust. The mechanism matters as much as the existence of memory. Naively recording everything is worse than nothing, because the agent gets confidently pointed in the wrong direction by stale or low-signal context. We built our memory system to be biased toward skepticism: bad memories cost more than missing memories cost, and the system reflects that in how it weights what it surfaces. The result is memory that gets sharper, not noisier, over time.
Verification that scales with output. Tests that run themselves. Critics that read a diff faster than a human can. Runtime checks that catch the bad output before a human sees it. Property-based and behavioral checks that say “this used to do X and now it does Y, is that intended.” The principle is that the cost of checking has to drop faster than the cost of generating, or the whole economy of the team inverts. This is the pillar where we are most opinionated about treating verification as a system, not as a habit.
Specs that compile. Not specs in the document sense. Specs in the “if this is wrong, the system fails loudly and early” sense. A spec the agent can interpret, the verifier can check against, and a human can write in twenty minutes. We don’t think this means a new spec DSL the world has to learn. It means treating the existing artifacts a team already produces (tickets, acceptance criteria, design docs) as the raw material and giving the system enough structure around them that they become executable. The cost of writing a real spec has to come down for level 4 to be reachable for normal teams, not just exceptional ones.
Governance. Who is allowed to ship what, through which path, with what approvals. Level 4 does not mean the agent has the keys to production. It means the policies that used to live in someone’s head are encoded somewhere the agent reads from before it acts. Branch rules, deploy windows, who can approve a database migration, what touches PII, which services are off-limits without a human review. Without governance encoded as something the agent respects, security and platform teams correctly refuse to let agents touch anything that matters, and the level 4 dream dies in a procurement review.
Visibility and auditable logs. When an agent works for twelve hours and ships ten changes, you need to reconstruct, after the fact, what it did, why it did it, what it considered, what it skipped, and what it chose not to tell you. Not for compliance theater. For the engineer who has to debug the one run in fifty that went sideways. And for the next two months of improvement: every run is signal, but only if you can read it back. We treat the audit log as a first-class product surface, not as a compliance tax bolted on at the end.
Continuous improvement. The system has to get better between Tuesday and Friday, not between v1 and v2. Every run produces evidence about what went right, what went wrong, where verification missed, where memory misled. A platform that doesn’t pipe that signal back into memory, into evals, into routing decisions, into the verifier, is a platform that ages instead of compounds. This is the pillar that turns level 4 from a stunt into a durable operating mode. It is also the pillar that most teams underinvest in because the return looks small in any single week.
If you read those six and feel like the list is too long, that’s the right reaction. Getting one of them to a level that actually supports trust is hard. Getting all six aligned, on a real team, on a real codebase, is the work. It’s why level 4 isn’t a setting you flip on. It’s a posture an organization has to grow into.
Level 4 is the real fight. Level 5 is the direction.
Let’s be plain about where we are. Getting an organization to a durable level 4 is still a big lift in 2026. Not a victory lap. Most teams that think they’re at level 4 are running level-3 review cycles under a different name, or they’re at level 4 on greenfield work and level 2 on anything that touches a load-bearing service. The honest state of the art is that a few small teams operate at level 4 most of the time, and most large teams aspire to it. That’s the fight we’re in today. Almost everything Overcut ships is in service of getting real teams, on real codebases, to operate at level 4 without quietly slipping back down.
But we’re building with level 5 in mind.
The Dark Factory isn’t “no humans involved.” It’s humans involved at a different layer. Humans set goals, define constraints, pick what’s worth building, and adjudicate when the system asks for a tiebreaker. The factory turns those into software. The interesting work moves up the stack. What used to be a senior engineer’s day becomes a product leader’s day, and what used to be a product leader’s day becomes a founder’s day.
We’re not at level 5. We don’t know anyone who is fully at level 5 except, as Shapiro says, a handful of small teams doing the nearly-unbelievable. But the systems you need to make level 4 actually work at scale (compounding memory, verification that scales, specs that compile, governance, audit, continuous improvement) are the same systems level 5 sits on top of. The capabilities that turn level 5 from a vision into a deployment are different. Multi-agent coordination across services. Cross-codebase reasoning that respects the boundaries between owned and shared code. Agents that author their own specs from product signals and then negotiate with each other about scope. None of these are science fiction. All of them are dependent on the level 4 foundation being real.
We’re heading there.
What we still can’t do
A post that only talks about what works isn’t worth reading. Three things we can’t claim yet.
Long-running agent trust over weeks, not hours, is hard to evaluate. A team that operates at level 4 for a sprint may or may not still be at level 4 after a quarter, and the things that cause regressions (drift in the codebase, drift in the team’s conventions, a new service that the agent hasn’t seen) are different from the things that cause failure in any single run. We have good signal on per-run reliability and weaker signal on quarter-over-quarter trust. We’re investing here, but we don’t want to claim more than we can prove.
Spec authoring is still mostly manual. The pillar above, “specs that compile,” describes what we want. The work to get from a Notion doc to an executable spec is partially tooled and partially craft. The best teams we work with treat spec authoring as a senior skill, not a junior chore. We don’t have a clean answer yet for organizations whose specs are still wishlists, and we’re cautious about teams that want to skip this part. They are the teams most at risk of believing they’re at level 4 when they’re really at level 3 with extra steps.
The framework itself has edges. Shapiro’s five levels are sharp for individual contributors and small teams. They’re less sharp for coordinated work across many agents, work that spans codebase and non-code surfaces, and work where the spec itself emerges from a conversation between product, design, and engineering. We don’t know yet whether level 5 for those kinds of work looks the same as level 5 for a single agent shipping into a single repo. We suspect it doesn’t. That’s open territory and we’re working in it.
The climb generalizes
Software is the canary. The five levels were articulated for code first because code is where the feedback loops are tightest and the verification surfaces are richest. Tests run in seconds. Type systems catch whole classes of error. Production tells you, quickly, when you’re wrong. None of those things are as crisp for other kinds of knowledge work, which is why the AI-assisted version of code is further ahead than the AI-assisted version of legal review or operations or product analytics.
But the same five levels apply to anything where the output can be specified and checked. The plateau looks the same. The forces that hold teams at level 3 look the same. The six pillars that get a team to level 4 look the same, with the names changed. The teams that learn to climb on code will climb fastest on everything else, because the muscle they built (treating verification as a system, treating memory as compounding, treating governance as code) is the muscle every other domain will need next.
The Dark Factory is coming. Maybe next year, maybe later, and not by accident. It will come for the teams that did the unglamorous work of getting to a durable level 4 first. We are on a mission to help more teams be those teams.