AI Reviewing Its Own Code Is a Conflict of Interest

Somebody asked me last week if Autter uses AI to review code. Yes. Obviously. Then they asked the next question, which is the better one.

"Isn't that the same thing the generation tools are doing now?"

No. And the difference matters more than anyone is being honest about.

There's a quiet pattern unfolding in developer tooling right now. Every AI code generation company is shipping a review product. Cursor has one. Copilot has one. Claude Code has one. The pitch is roughly the same across all of them: the same model that wrote your code can now check it before it ships.

This sounds reasonable until you say it out loud.

The author and the reviewer are the same engine with a different prompt.

We have a word for this in every other industry. We do not let auditors audit their own firms. We do not let pharma companies run their own clinical trials and publish the results. We do not let credit rating agencies grade the bonds they helped structure, and the one time we did, the global financial system fell over.

But somehow in software, we are watching the same model write the code, ship the code, and grade the code, and calling it a feature.

The Structural Problem

Captain Patch standing at the harbour gate looking suspiciously at a vessel that's also wearing the harbour master uniform.

A model trained to produce confident, plausible code is the worst possible reviewer for code that is confident and plausible but wrong.

The failure modes of the generator are the blind spots of the reviewer. They were trained on the same data. They share the same priors. They have the same instinct for what "looks right." If a model is going to hallucinate a function call, it will also fail to flag that hallucinated function call when it sees it again, because nothing inside that model is set up to disagree with itself.

This is not a tuning problem. You cannot prompt your way out of a shared epistemology.

The whole point of code review is adversarial. A second pair of eyes that does not share your assumptions, did not write the thing, has no ego attached to the diff, and is willing to push back. When the second reviewer is just the first author wearing a different hat, you have not added a check. You have added a rubber stamp with extra latency.

“A second pair of eyes that shares your assumptions is not a second pair of eyes.”

There's a deeper version of this. Models are increasingly trained on synthetic data, which often includes outputs from other models. The reviewer model has, in some part, been trained on code that looks exactly like what the generator would produce. Of course it thinks the code looks fine. It thinks the code looks fine because it has seen a million versions of code that looks fine, and most of that code came from somewhere structurally identical to where the new code is coming from.

You are not getting an independent opinion. You are getting the same opinion in a different font.

Lesson

The author and the reviewer cannot be the same intelligence, full stop. If the only thing changing between the two passes is the system prompt, you do not have a review. You have a vibe check.

The Economic Problem

Captain Patch holding a clipboard while Captain Scout points at a tilted scale labeled 'ship more' on one side and 'block more' on the other.

The structural problem is the cleaner argument. The economic one is the messier and more important one.

Generation tools live and die on usage. Their pricing depends on tokens generated, lines suggested, completions accepted. Their growth depends on developers feeling like the tool is shipping their work for them. Friction is the enemy of conversion. Friction is the enemy of the metric that justifies the next round.

Now imagine you bolt a review layer onto that same product.

What does that review layer want to do?

In theory, it wants to block bad code. In practice, every block is friction. Every false positive is a developer pausing, getting annoyed, and considering whether they want this tool in their workflow at all. The product team running the review feature reports to the same exec running the generation feature. The generation feature is the revenue. The review feature is, at best, a retention play. At worst, it is a cost center that slows down the feature paying everyone's salaries.

Guess which one wins when those two priorities collide.

This is not theoretical. We have seen it already in adjacent categories. Static analysis tools that started strict and got friendlier over time because nobody was upgrading. Linters that became opt-in. Security scanners that quietly stopped failing builds and started filing tickets nobody read. The market punishes friction. The pressure on a review layer that lives inside a generation product is to slowly, quietly, become a suggestion layer. And then a comment layer. And then a feature on a slide that nobody actually runs.

“The market punishes friction. A review layer that costs the parent product conversion will be tuned friendlier until it stops costing it conversion.”

The incentive gradient is one direction. Toward leniency. Toward suggestions. Toward not blocking the merge.

That is not a code review. That is a feature designed to look like one for as long as it is politically convenient.

Lesson

If your reviewer and your author are owned by the same business, the reviewer is going to lose every quarterly review meeting. Not because anyone is malicious. Because of how incentives compound when nobody is watching.

Why Independent Enforcement Wins

Captain Patch standing at the harbour gate while a row of vessels from different shipyards line up, each waiting for clearance.

The category that scales here is not better in-house review from generation vendors. It is independent enforcement.

Three things have to be true for a review layer to actually work in the AI era.

The reviewer cannot be financially incentivized to ship more code. Its business model has to be independent of the volume of code that gets through. Ideally, it should be incentivized to block code that is wrong, not wave code that is plausible.

The reviewer cannot share the model that wrote the code. Different training data, different priors, different failure modes. You want diversity in the stack the same way you want diversity in a code review panel.

The reviewer has to be able to actually block. Not file a comment. Not open an issue. Not suggest a change in a thread that gets ignored at 4:47pm on a Friday. Block. Sit on the merge button and refuse to move until the issue is resolved.

This is what we mean when we say merge gate, not advisory layer. The whole product category is the difference between a tool that has the authority to say no and a tool that posts a polite suggestion that everyone scrolls past.

“Block. Sit on the merge button. Refuse to move until the issue is resolved.”

Generation will keep commoditizing. Copilot to Cursor to Claude Code in roughly nine to eighteen month cycles. Whichever one is winning today is not the one winning two years from now. That's fine. That market should commoditize. Generation is becoming a feature of the IDE, then a feature of the OS, then a utility.

Governance does not commoditize the same way.

Branch protection rules, contributor history, blast radius patterns, migration risk profiles, false positive suppression that learned from your team's last six months of reviews. These compound. They compound across vendors. They compound across model generations. The value of a governance layer six months in is significantly higher than the value of the same governance layer on day one, and the value of any individual generation model six months in is roughly half of what it was at launch.

The moat is in the layer above generation, not inside it.

Lesson

Generation is the commodity. Governance is the compounding asset. The companies that confuse those two will be re-platforming every eighteen months.

The Honest Version

Captain Patch sitting at a small table with two cups of coffee, having an argument that looks affectionate.

I want to be honest about where this argument can be pushed back on.

Yes, Autter uses AI to review code. We are not pretending otherwise. The argument is not "AI cannot review AI." The argument is that the reviewing AI cannot be the same AI, owned by the same company, optimizing for the same metric, as the generating AI.

We are model-agnostic. The model that ran your last review is not the model that wrote your code. We do not sell code generation. We do not have a Cursor competitor sitting in the next office over influencing how strict our defaults should be next quarter. Our entire business depends on us blocking the right things and not blocking the wrong things. If we get loose, we lose. If we get noisy, we lose. The incentive points the right direction.

That is not a feature. That is a structural fact about being an independent layer.

The companies bundling review into their generation products will keep telling you it's the same thing. It is not. Watch which incentives bend over the next eighteen months. You will see it.

The Real Lesson

The conversation we should be having in the industry is not "is AI good at code review." That question is settled enough. AI can do parts of code review well, and parts of it badly, and the engineering is going to keep getting better.

The conversation we should actually be having is who owns the reviewer.

If your code review layer is owned by the company that wrote the code, you are not running a review process. You are running a marketing channel for the generation product. It will work fine until the moment it doesn't, and the moment it doesn't will be the production incident you cannot explain in the postmortem.

The merge button is the most consequential decision in modern software. Who gets to press it, when, and under what conditions, is governance. It does not belong to a feature flag in someone's IDE plugin.

It belongs at the gate.

Building Autter in public. We're a merge gate that does not write your code, does not sell you a model, and does not have a quarterly metric that depends on shipping more diffs. We just decide what clears the harbour. If that sounds like the kind of layer your team has been quietly missing, drop us a line at hi@autter.dev or book 30 minutes.

P.S. Tanvi read this draft and said the only thing more predictable than an AI reviewing its own code is me writing a 2000 word blog about it. She also said to publish it anyway because being predictable is fine when you're predictably right. I am choosing to take that as the compliment it almost was.