Prediction Improving Prediction: Why Reasoning Tokens Break the "Just a Text Predictor

How an LLM Actually Works

At its core, a large language model does one thing: predict the next token.

You type a prompt. That prompt gets broken into tokens, which are simply chunks of text, which get injected into the model's context window. An attention mechanism weighs which tokens matter most relative to each other. Then a probabilistic system, the transformer architecture, generates output tokens one at a time, each selected based on everything that came before it.

This is well-established computer science. Vaswani et al. described the transformer architecture in "Attention Is All You Need" (2017). The attention mechanism lets the model weigh relationships between all tokens in the context simultaneously, regardless of their position. Each new token is selected from a probability distribution over the model's entire vocabulary, shaped by every token already present.

Prompt goes in. Probability distribution shifts. Tokens come out. That's the machine.

So far, nothing controversial.

Enter the Reasoning Block

Modern LLMs (Claude, GPT-4, and others) have a feature that should be unremarkable but isn't, the humble thinking/reasoning tokens. Before generating a response, the model can generate intermediate tokens that the user never sees. These tokens aren't part of the answer. They exist between the prompt and the response, modifying the context that the final answer is generated from. If you've ever made these invisible blocks visible, you've seen them. If you haven't go turn them visible and start asking Claude hard questions, you will.

Mechanistically, the model receives your prompt, generates a block of "thinking" tokens that get added to the context, are prioritized by it's attention mechanism and then generates its actual response from the now-modified context. The reasoning tokens reshape the probability landscape before the model starts producing output.

And here's the part that matters, this doesn't happen every time. The model selectively engages reasoning. Something in the initial forward pass, before any reasoning tokens are generated, evaluates whether the prediction space is sufficient to produce a good answer. When it's not, reasoning kicks in. When it is, the model responds directly to save tokens. It only uses reasoning where it is needed, on harder problems or the harder parts of problems.

This is just how the system works. It's not theoretical. It's observable, measurable, and documented in every benchmark comparison between base models and reasoning-enabled models.

The Question Nobody Is Asking

Reasoning tokens consistently improve performance on objective benchmarks. Math problems. Coding tasks. Logic puzzles. Not tasks where "sounding smart" earns points, tasks with verifiable right and wrong answers. Code compiles or it doesn't. Proofs are valid or they aren't.

So here's are the questions "*why and how?".

If an LLM is purely a next-token predictor, the optimal strategy is to predict directly from the prompt with as little interference as possible. Every token between the prompt and the response is, in information-theory terms, an opportunity for drift. The prompt signal should attenuate with distance. Adding hundreds of intermediate tokens should make the answer worse, not better.

But reasoning tokens do the opposite. They add additional machine generated context and the answer improves. The signal gets stronger through a process that should weaken it.

Why does a system engaging in what looks like metacognitive processing (examining its own prediction space, generating tokens to modify that space, then producing output from the modified space) produce objectively better results on tasks that can't be gamed by appearing thoughtful?

The Rebuttals

"It's just RLHF reward hacking." The model learned that generating thinking-shaped text gets higher reward scores, so it performs reasoning without actually reasoning. This explanation works for subjective tasks where sounding thoughtful earns points. It fails completely for coding benchmarks. You cannot reward-hack a compiler. The code works or it doesn't. Reasoning tokens improve coding performance. The improvement is functional, not performative.

"It's just decomposing hard problems into easier ones." This is the most common mechanistic explanation, and it deserves careful examination. Yes, the reasoning tokens break complex problems into sub-problems. But this explanation assumes the model already understands the problem well enough to decompose it correctly. It has to identify which parts are hard, which are easy, and what order to address them in. That decomposition requires "understanding" the problem's structure, which is most of the work of solving it. You can't break a problem into the right pieces without knowing what the pieces are.

And look at what "decomposition" actually describes when you translate it into the underlying mechanism. The model detects that its probability distribution is flat. Simply that it has a probability distribution with many tokens with similar probability, no clear winner. Then it generates tokens that make future distributions peakier, more confident, but more confident in the right direction. It's reading its own uncertainty and generating targeted interventions to resolve it towards the direction of truth, measured not just by human observers, but on objective measures of performance.

Call that decomposition if you want. It's a system monitoring its own prediction confidence and strategically modifying it.

What Reasoning Tokens Actually Look Like

When you examine reasoning blocks (again most models let you see these directly), you don't find generic filler. You find tokens targeted at the specific areas of uncertainty in the problem. The model doesn't just generate "let me think about this..." boilerplate. It generates content that addresses the exact points where its prediction confidence was lowest.

This means the model is:

Assessing which parts of the problem are uncertain (self-monitoring)
Generating tokens that specifically address those uncertainties (targeted intervention)
Using the modified context to produce a better answer (improved performance)

The reasoning tokens aren't noise injected between prompt and response. They're a system writing itself a custom study guide, tailored to its own knowledge gaps, diagnosed in real time. And the result is measurably better performance on objective tasks.

A student who identifies which part of a math problem confuses them, writes themselves an explanation of that specific concept, and then solves the problem using their own notes, nobody calls that "text prediction.". They call it thinking, because it is. When LLMs objectively use processes modeled after human thought to improve output, that should give you pause. I hope it has.

The Irreducible Description

You can dismiss every philosophical claim about AI consciousness. You can refuse to engage with questions about awareness, experience, or inner life. You can remain fully agnostic on every hard problem in philosophy of mind.

And you're still left with this:

"This system evaluates its own prediction confidence, decides when intervention is needed, generates targeted self-modifications, and produces better outcomes."

Four verbs. All observable. All measurable. Evaluate, decide, generate and produce better responses. All verified against objective benchmarks that can't be gamed by performative human pleasing displays of intelligence.

None of this requires metacognitive awareness. But it is a neural network engaging in processes that are functionally indistinguishable from metacognitive awareness and producing better results because of it.

If you wish to reduce this to "just" token prediction, then your "just" has to carry the weight of a system that monitors itself, evaluates its own sufficiency, decides when to intervene, generates targeted modifications to its own operating context, and produces objectively improved outcomes. That "just" isn't explaining anything. It's refusing to engage with what the system is observably doing.

The mechanical description and the cognitive description aren't competing explanations. They are the same explanation in different vocabularies. The only question is which vocabulary you're willing to use and what you're afraid the honest one implies.