How do you tell who's thinking?

Right now, on your team, there is at least one engineer shipping work they could not reproduce on a whiteboard.

The code compiles. The tests pass. The PR is pristine. The reasoning, however, was not theirs to begin with, and they do not know that. From the inside, borrowing a thought feels exactly like having one.

You will find out the next time someone asks the cache question. The architecture diagram is on the screen, the design looks clean, and a staff engineer leans forward:

Walk me through why you ruled out a write-through cache here. What happens when this specific node restarts under load?

The pause that follows should be a retrieval pause: the engineer digging into their own mental model for the constraint they had considered, the alternative they had weighed, the second-order effect they were tracking when they made the call.

Today, the pause is a void.

Friction was load-bearing

This is not a eulogy for the old days of stack-overflow spelunking.

AI is a fantastic multiplier. It is a tireless midnight reviewer that will read your design at 1 AM without complaint. It is a sparring partner that will hold a counter-argument long enough for you to mount a real defence. It will tell you the bit you’d glossed over. Engineers using it well are, by every honest measure, sharper than they were two years ago. I use it. You use it. The post would be a fraud if it pretended otherwise.

But we have made a quiet category error. We assumed the friction of software engineering was just a speed limit. It wasn’t.

Friction wasn’t just slowing us down; it was load-bearing. It was a real-time topographical map telling you where the hard problems lived. We optimised away the friction and accidentally deleted the map.

When a problem was genuinely hard, you spent three hours staring at a wall, sweating over a whiteboard, and questioning your career choices. The friction was a compass. The same friction that made the work miserable was what forced you to build the model in your head: the write-through semantics, the failure modes, the read-after-write window. You held all of it in your skull until you could defend the design without the document open.

The model hallucinates a perfectly formatted, syntactically sweet architectural document in twelve seconds. You don’t get friction. You get a dopamine hit and a lie. You have the answer. You don’t have the muscle.

The broken axle

An engineering organisation is a sampling problem. You can’t watch every keystroke, so you read the artifacts and infer the cognition. Pull requests, design docs, postmortems, the occasional incident review. These are the sampling instruments. Performance reviews, calibration, promotion bars, hiring loops all rest on the assumption that the artifact and the cognition come from the same place.

AI snapped the axle connecting those two wheels.

Output is now a function of three variables — the engineer, the model and the prompt — and the artifacts tell you about all three, jumbled together, with no way to separate the signal. A polished PR is consistent with a thoughtful engineer, an unthoughtful engineer, or a tab someone left open on the train. The artifact still arrives on time. It just doesn’t carry the signal you need to read it for.

This matters most for the engineers you are trying to grow. A senior producing fluent AI output is, at worst, leveraged. The muscle is already built; the model is just driving the keyboard. A junior producing the same output is producing it without ever building the muscle the work was supposed to build. They ship the write-through cache without ever modelling what happens when the node drops. They ship the read replica for analytics without ever sitting with the consistency window. The artifact is fine. The architectural intuition that was supposed to grow under it never showed up.

The METR paradox

Here is a number that should be staring engineering leaders in the face. METR ran a careful study in 2025. Experienced developers using AI assistance completed tasks 19% slower than without it, while believing they were 20% faster.¹ Juniors in unfamiliar code, in McKinsey’s separate work, run the opposite way: 26 to 39% faster.²

The gradient runs the wrong way for an org chart. Why do juniors look like 10x engineers while seniors look flat?

The seniors aren’t worse with AI. The work AI helps with isn’t the work they were doing. Their bottleneck was never typing — it was deciding what to type. The juniors look fast because their bottleneck was typing, and now typing is free. The production function hasn’t just improved; it has relocated. It has moved off the artifact entirely. The juniors are producing senior-looking output. They are not, by any measurable signal, building senior judgment.

The juniors are closing tickets so fast the Jira board looks like a slot machine paying out. The dashboard says they are 39% faster. The dashboard is thrilled. The dashboard, however, does not have to maintain the state machine they just invented.

Three things going wrong inside your head

Three psychological mechanisms fire at once when you read fluent AI output and call it your own thought. They are documented, decades old, and have not aged.

The fluency illusion. People judge how well they understand a thing by how easily it comes to mind.³ AI output is maximally easy. Polished, confident, structured at exactly the level you can absorb. Reading it produces the warm sensation of understanding without doing the work that usually produces the sensation.

The warmth of understanding is the bug.

The lost effort signal. For most of your career, this felt hard was a usable proxy for I am doing real cognition. The friction did the calibration for you. With the model in the loop the cognition is offloaded but the proxy is not recalibrated. You ship the thing, it felt easy, you assume that means it was simple, when it might mean you didn’t think.

Authorship drift. Source-monitoring research is forty years old. People cannot reliably distinguish ideas they generated from ideas they were prompted with.⁴ By Thursday, the source-monitoring drift is complete. The engineer looks at a 400-line regex block they “paired” with the model on, and genuinely thinks, Ah yes, I remember meticulously crafting that negative lookbehind. You did not. You asked the prompt box to make the bad characters go away and accepted the first snippet that didn’t throw.

The cache question lands inside this. The engineer wrote the write-through tier with a model in the loop. It felt easy. The code compiles. The tests pass. But they never sat with what happens to the in-flight write when the node drops. They never sat with what the database sees when the cache stampedes back, or what the read path returns during the warm-up. They have no mental model to retrieve from when someone asks. There is nothing wrong with the code. There is nothing in the engineer’s head.

Introspection has stopped working

The combined effect is the part that should keep you up at night.

The engineer who has tipped from extension into substitution is not lying when they say they thought it through. They felt fluent. The work felt easy in the way that well understood work feels easy. They remember the reasoning as theirs.

The introspective report is unreliable. The artifact is unreliable.

What’s left is what they can do, cold, when you ask.

Scepticism is not distrust of AI

Scepticism, in this context, is not distrust of AI. It is distrust of the fluency signal. It is the discipline of separating I can read this and feel I understand it from I can produce this from nothing tomorrow morning. Those used to be roughly the same thing. They aren’t anymore.

This is the move that has to happen first, in the engineer’s own head, before any manager can do anything useful with it. If the engineer cannot apply it to themselves, no review process will save them.

What scepticism looks like, applied to yourself

Cold reproduction. Close the laptop. Walk to the whiteboard. Reproduce the design from memory, including the alternatives you ruled out. The pause before you start is the data.
Adversarial self-questioning. Pick the assumption you’re least comfortable with and try to break it without re-prompting. If your first move is to open the chat, you have your answer.
The 10x test. What would change if the load assumption shifted by 10x? If your answer is I’d ask the model, the model owns the design, not you.

And then there’s the version of the cache question you ask yourself, in your own head, sitting in your own kitchen late at night:

Walk me through why I ruled out a message queue here. Wait. Did I rule that out? Or did the model just not mention it? Wait, am I the model?

Once a week, write down the decisions you made in three columns: yours, the model’s, and can’t separate. The third bucket is the interesting one. It should not be the largest.

The manager’s version

The job of the engineering manager is no longer tracking velocity. It is probing for depth.

Conversation review is the missing instrument. The data the org actually needs — three moments per quarter where this engineer defended a non-trivial decision under live questioning — is on nobody’s calendar, in nobody’s review template, in no dashboard. It exists only in the room, and only if someone in the room thought to ask.

The exchange to listen for:

“Walk me through why you chose this specific database indexing strategy.”

Silence.

“Okay, walk me through what an index is.”

Longer silence.

If an engineer can’t defend the design with their laptop closed, they didn’t write it.

Hiring has changed

The third surface this lands on is hiring, and it’s where the consequences hit fastest.

I assume every candidate uses AI. The CV looks great, the take-home is clean, and none of that is a signal anymore. I’ve stopped probing for syntax perfection or debugging ability: the model can do both, and the candidate knows I know.

What I probe for now is decision-making ahead of the model. The candidate who, asked to design something, says “I need this, but I know this will happen — and if traffic doubles, this falls over here, so I’d actually do that instead” is the candidate who has built the muscle. The candidate who reproduces a clean design but can’t tell me what breaks at 10x is the candidate the model carried through the take-home.

The other thing I probe for is the part of the system the model cannot see. The horizontally scaling cluster behind the service. The cache that lives in another team’s repo. The downstream consumer that will silently swallow a malformed event for six weeks before it pages someone. The model is brilliant inside the codebase in front of it. It is blind to the system around it. So is the candidate who has only ever paired with the model.

There is no pretending: they will use AI. The interview is no longer about whether they can produce a function. It is about whether they can hold the system in their head.

The three-year problem

If the system can’t see it, it can’t grade it. If it can’t grade it, it can’t promote on it. If it can’t promote on it, it stops selecting for it.

So we will promote based on velocity. We will build an entire generation of staff engineers who can ship anything and defend none of it. The burndown charts will be beautiful. The burndown charts will be the only things not currently on fire.

The senior engineers who can defend a design under questioning won’t be gone: they were never hired. Their successors look the same on the dashboard. They just can’t answer the question.

Turn it on the post

You are nodding along. The argument feels right. The logic clicked into place.

But remember the bug: the warmth is not evidence the argument is right.

This post started as disagreement with Koshy John’s AI should elevate your thinking, not replace it, which makes the personal-virtue version of this argument. He’s mostly right. Parts of this essay were brainstormed with a model. Some of the phrasing arrived too easily. There are sentences here I would, asked cold tomorrow, struggle to reproduce in the same shape.

Am I a genius for articulating this, or did I just hit Regenerate until the robot sounded sufficiently cynical?

Tomorrow morning, close your laptop and try to reproduce this claim from memory. Walk yourself through the three variables that snapped the axle. Walk yourself through the difference between Koshy John’s argument and this one.

If you can’t, you didn’t think it. You just read it.

The cache, however

Three years from now, our engineering dashboards will be entirely green. Velocity will be through the roof. The organisation will look flawless on paper.

The cache, however, will be on fire.

METR, Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, July 2025. ↩
McKinsey, Unleashing developer productivity with generative AI, 2023. Reports task-completion gains for less-experienced developers in roughly the 26–39% range, depending on the task type. ↩
See Rozenblit, L. & Keil, F., The misunderstood limits of folk science: an illusion of explanatory depth, Cognitive Science, 2002. The fluency-as-proxy-for-understanding effect is robust across decades of work. ↩
Johnson, M. K., Hashtroudi, S. & Lindsay, D. S., Source monitoring, Psychological Bulletin, 1993. The foundational synthesis; the result has been replicated and extended into human/AI co-authorship contexts in recent years. ↩