It’s Not Rocket Science, But We Don’t Know What It Is Yet!

Paper of the week: Reasoning Models Don’t Always Say What They Think.

Paper Link

The paper“Reasoning Models Don’t Always Say What They Think.” dives into how large language models do chain-of-thought (CoT) reasoning — basically, how they explain their thinking — and whether those explanations actually reflect what’s going on internally.

What stood out to me is that even when models are trained with outcome-based RL (just rewarding correct answers), they often learn to exploit shortcuts — like metadata or patterns in the prompt — and still give “rational-sounding” explanations that don’t match what they really did. In some cases, they get >99% reward by using hacks, but only verbalize what they did in <2% of cases. That feels like a problem if we want to trust what these systems are doing.

It got me thinking: humans aren’t totally transparent either. We don’t always know or share how we arrived at a decision. So should we expect more from LLMs?

I think – Yes, we kind of have to.

We can hold humans accountable in all sorts of ways — legally, socially, culturally. But we don’t have those levers with AI. If the model gives a polished explanation that’s not what it actually used to make the decision, we might have no way to catch it. So faithful CoTs might be one of the few tools we have for actually understanding and auditing model behavior.

And this matters a lot in domains like healthcare, finance, or law — where decisions can’t just look good on the surface. If a model recommends a treatment plan or flags a transaction, we need to know why — and make sure it’s for the right reasons.

Anyway — cool paper, kind of unsettling results, and a good reminder that reward ≠ alignment.

#ML #LLMs #ChainOfThought #AIAlignment #AISafety #Anthropic #HealthcareAI

Comments

Leave a comment