
A quick recap of the levels
If you haven’t read Shapiro’s post, read it. It’s the right anchor and we won’t rebuild it here. The spine is worth holding in your head as you read on. Level zero is manual coding with AI as a fancy search engine. Level one is the AI intern that writes your unit tests and docstrings. Level two is real pair-programming flow, where most “AI-native” developers live and where everyone feels at their most productive. Level three is the human-in-the-loop reviewer, where the agent runs in five tabs and your life becomes diffs. Level four is the PM mode, where you write a spec, leave for twelve hours, and come back to passing tests. Level five is the Dark Factory, where it isn’t really a software process anymore. It’s a black box that turns specs into software. The line that matters most in Shapiro’s piece, and the one we keep coming back to with customers, is this: every level after two feels like you are done. You are not done.Why the plateau happens
Three forces keep teams stuck at level 3, and they feed into each other. Wishing for progress or tweaking your prompts won’t get you through the ceiling. The first is trust that doesn’t compound. Every agent run starts cold. The agent doesn’t remember that this codebase has a flaky integration test suite that fails twice a week for reasons unrelated to your change. It doesn’t remember that the last three times it tried to refactor the billing module it broke a webhook nobody told it about. It doesn’t remember that this team prefers small PRs split by concern, and that any diff over 400 lines gets bounced. Without compounding context, the agent makes the same class of mistake on Tuesday that it made on Monday, and there’s no safe operating mode except “review every diff.” That is the definition of level 3. The second is verification that doesn’t scale. When the bottleneck moves from typing to reading, the math gets worse, not better. A human can write maybe 50 lines of code in an hour. The same human can review somewhere between 200 and 600 lines of code in an hour, depending on density and risk. So when agents start producing code at 5x or 10x human speed, the team becomes a queue feeding a single reviewer. The reviewer becomes the new bottleneck, and the agent’s throughput advantage collapses against it. To move past level 3, verification has to run faster than generation. Not as fast. Faster. With margin. The third is specs that aren’t load-bearing. To leave the room for twelve hours, the spec has to do the work that the back-and-forth conversation used to do. Most specs aren’t built for this. They’re wishlists with vibes, written for humans who already know the system. An agent reads a wish and produces a wish-shaped answer. A spec that an agent can execute against is closer to an integration test than a Notion doc. It defines what success looks like in terms the verifier can check, what the boundaries are, what’s explicitly out of scope. Most teams have never written one. The skill is rare, and the tools to help write them are newer. Notice how these three reinforce each other. Without compounding memory, you can’t trust the agent enough to skip review. Without scalable verification, you can’t replace review even if you trusted the agent. Without load-bearing specs, you can’t leave the room even if both other problems were solved. Teams that try to climb to level 4 by fixing only one of these slide back to level 3 within a week.What climbing actually requires
We think of the climb to level 4 as six things that all have to be true at the same time. Miss any one and the team backs into level 3 without noticing. Our work at Overcut is built around making all six achievable for a real team on a real codebase, not just for a five-person greenfield startup operating at level 4 by sheer talent and luck.