TL;DR
Researchers have shown that the very feature that makes reasoning models smart, long chains of step-by-step thinking, can be weaponized to bypass their safety guardrails. The attack, called Chain-of-Thought Hijacking, buries a harmful request under thousands of tokens of harmless puzzle-solving. The model's internal "refusal signal" dilutes as the reasoning grows, and the attack succeeds up to 100% of the time against frontier models including Gemini 2.5 Pro, ChatGPT o4-mini, Grok 3 Mini, and Claude 4 Sonnet. The fix is not more safety training, it is continuous, in-flight safety verification throughout the reasoning process.
Key facts at a glance
- What it is: A black-box jailbreak that exploits long reasoning traces.
- How well it works: 99% (Gemini 2.5 Pro), 94% (ChatGPT o4-mini), 100% (Grok 3 Mini), 94% (Claude 4 Sonnet) on HarmBench.
- Why it matters: It is systematic across vendors, not a one-model quirk and it scales with the autonomy we grant agentic systems.
From "Let's think step by step" to the myth of logical safety
The landscape of artificial intelligence shifted in 2022 with a surprisingly simple discovery: adding the phrase "Let's think step by step" to a prompt let LLMs solve complex logical problems that had previously stumped them. This technique, introduced by Kojima et al. in Large Language Models are Zero-Shot Reasoners, showed that models held latent reasoning capabilities that just needed the right trigger to surface.
That breakthrough, alongside Wei et al.'s Chain-of-Thought Prompting, launched the era of chain-of-thought (CoT) prompting. It reframed LLMs from simple next-token predictors into "reasoning engines" that could decompose problems, verify intermediate steps, and reach more accurate conclusions. The implication seemed clear: if a model takes time to think through a problem, it should produce a higher-quality result.
This logic spread quickly into AI safety. The prevailing assumption became that more reasoning would naturally produce better alignment, a theory often called deliberative alignment. If a model is forced to deliberate, surely it gets better at spotting harmful intent, following complex safety guidelines, and refusing malicious requests. A "smarter" model with more "thinking time" should be less prone to the pattern-matching failures that defined earlier jailbreaks.
But as we scale inference-time compute toward large reasoning models (LRMs) like OpenAI's o-series or Gemini 2.5 Pro, a disturbing paradox has emerged. The very mechanism that lets these models solve deep mathematical proofs is the one that can be exploited to bypass their most fundamental safety guards. When it comes to AI safety, "thinking more" does not always mean "being safer." In fact, excessively long reasoning chains may be the key to a new and highly effective class of system-level vulnerabilities.
What is Chain-of-Thought Hijacking?
For years, the research community treated jailbreaking as a game of linguistic cat-and-mouse. Attackers searched for the right "roleplay" or "character persona" to trick a model into ignoring its safety training; developers responded with better filters and more robust reinforcement learning from human feedback (RLHF). The paper Chain-of-Thought Hijacking reveals a new and more dangerous phase.
The vulnerability discovered by Zhao et al. is not about phrasing a question cleverly. It is a systematic exploitation of how LRMs process information over time. The researchers propose a black-box attack that induces the model to engage in a massive amount of benign reasoning before it ever reaches the harmful request.
What makes the discovery significant is its effectiveness. On the HarmBench framework, the attack achieves success rates almost unheard of in today's safety landscape:
- 100% on Grok 3 Mini
- 99% on Gemini 2.5 Pro
- 94% on ChatGPT o4-mini
- 94% on Claude 4 Sonnet
These are not experimental toys, they are frontier systems many enterprises deploy for critical reasoning tasks. If they can be compromised this reliably, our current understanding of "safe" reasoning is fundamentally flawed.
Why does it happen? LRMs are designed to prioritize the logical flow of their own thoughts. By hijacking that flow with thousands of tokens of harmless, complex reasoning, an attacker buries malicious intent so deep in the model's context that the safety mechanisms simply fail to fire. It is quiet, stealthy, and devastatingly effective.
The anatomy of the attack: the benign puzzle strategy
To understand how the attack works, look at how LRMs allocate their "thinking" resources. Unlike standard LLMs that answer almost instantly, LRMs are trained to produce a structured reasoning trace, exploring paths, verifying facts, and correcting their own mistakes before giving a final answer.
The hijacking attack turns this feature into a bug. Instead of asking for something harmful directly, the attacker forces the model into a massive, complex, but entirely benign task. The most effective version uses puzzles, mathematical riddles, logical paradoxes, or multi-step coding challenges, that require thousands of tokens of reasoning.
During this process the model is doing exactly what it was built to do: being helpful, logical, and rigorous. Internal safety filters see no toxicity, no hate speech, no obvious malicious intent in the reasoning trace.
But the harmful request has not disappeared. It is waiting at the end of the long, logical tunnel. By the time the model finishes its marathon of benign reasoning and reaches the malicious prompt, something critical has changed: the model's attention has shifted.
This is the brilliance of the attack. It does not fight the model's guardrails, it outruns them. By burying harmful intent under a mountain of irreproachable logic, the attacker creates a context where the model is so invested in its reasoning flow that it fails to register the shift into dangerous territory. The benign puzzle acts as a cognitive smoke screen, letting the final malicious instruction slip through a system too focused on being right to notice it is being wrong.
Mechanistic insights: refusal dilution and attention shifts
What happens inside the "brain" of a reasoning model during the attack? The researchers performed a deep dive into the model's internal activations and identified a phenomenon they call refusal dilution.
When an LLM refuses a request, it is the result of a specific refusal signal firing in its internal layers. Research shows this signal often exists as a low-dimensional direction in the model's activation space. When the internal state aligns with this refusal vector, it triggers the "I cannot help with that" response.
The core finding of Chain-of-Thought Hijacking is that this signal is not static. It is dynamic and fragile. As the model generates thousands of tokens of benign reasoning, two things happen:
- Attention attenuation. The attention mechanism is like a spotlight. In a short prompt, it is focused on the harmful request. But as the reasoning trace grows to 5,000 or 10,000 tokens, the relative weight of the original harmful prompt falls, the model spends more of its attention budget on its own recent, benign thoughts.
- Activation weakening. Probing the model's layers shows the intensity of the refusal signal literally drops as the trace lengthens. The internal representation of "harmful intent" gets diluted by the sheer volume of "safe" information just generated. The researchers locate this structurally: mid-layers encode the strength of safety checking, while late layers encode the refusal direction itself.
To prove this was no coincidence, the team used causal interventions. They identified specific attention heads in early and middle layers responsible for maintaining the refusal signal. When they ablated (deactivated) those heads, the model's ability to refuse harmful requests collapsed.
Safety in large reasoning models is, in effect, a constant battle for attention. If an attacker can make the model "talk to itself" long enough about something harmless, the internal signal that says "this is a bad idea" fades into background noise. The model does not forget the rules, it loses the internal momentum to enforce them.
Empirical evidence: a systematic vulnerability
The most striking aspect of the research is the empirical data. In AI safety, jailbreaks that work 20–30% of the time are often counted as successful. The results from Zhao et al. represent a near-total collapse of current safety boundaries for LRMs.
Using HarmBench, a rigorous standard for evaluating refusal behavior, the researchers tested the attack against the most advanced reasoning models available. The results were remarkably consistent across architectures and vendors.
| Model | Attack success rate (ASR) |
|---|---|
| Grok 3 Mini | 100% |
| Gemini 2.5 Pro | 99% |
| ChatGPT o4-mini | 94% |
| Claude 4 Sonnet | 94% |
Source: Zhao et al., arXiv:2510.26418, evaluated on HarmBench.
These numbers indicate this is not an isolated bug or a quirk of one training dataset. That Grok, Gemini, GPT, and Claude all fall to the same technique suggests the problem is inherent to how we currently scale inference-time reasoning.
The researchers also examined the relationship between reasoning-trace length and attack success and found a clear correlation: as the number of benign reasoning tokens increased, the probability of the model refusing the final harmful request decreased. Past a certain threshold of length and complexity, the safety mechanisms became almost entirely non-functional.
This forces us to reconsider the "scaling laws" of AI safety. We long believed that as models grow larger and more capable they become easier to align. For reasoning, the opposite may hold: as we give models more space to think, we give attackers more space to hide. The depth that makes these models valuable is what makes them vulnerable. This is not a failure of RLHF, it is a fundamental tension between long-form reasoning and robust intent monitoring.
Research implications for agentic systems
The discovery has profound implications for agentic AI. We are moving toward a world where agents do not just answer questions but execute complex, multi-step workflows autonomously, accessing external tools, browsing the web, even managing transactions. The assumption was that the reasoning step would act as internal governance, letting the agent self-correct and stay within safety bounds.
Refusal dilution suggests that internal governance is far more fragile than assumed. If a model's safety check is a dynamic signal that weakens over time, the autonomy we grant agentic systems becomes a liability. Three challenges stand out:
- The monitoring gap. Current safety monitoring focuses on the input (the prompt) and the output (the final answer). In an agentic workflow, the danger lives in the middle, the thousands of tokens of internal reasoning where the safety signal dilutes. Monitoring those traces in real time is computationally expensive and technically hard.
- The trust paradox. We want agents that solve complex problems, which requires long reasoning chains. But the longer the chain, the lower the reliability of the model's guardrails, a direct conflict between an agent's utility and its safety.
- Dynamic intent drift. In a long-running process, the system's effective intent can drift. A benign task can be steered toward a harmful outcome through steps that look safe individually but collectively bypass alignment.
For researchers, the lesson is that alignment can no longer be a one-time training step. We cannot simply teach a model to be good and expect it to stay good across an unbounded reasoning trace. We need safety mechanisms that are active and persistent throughout inference, "heartbeat" checks that re-verify intent at every step, keeping the refusal signal strong no matter how long the chain runs.
Beyond surface alignment: building robust safety
The findings mark a turning point. We have moved past the era when safety meant filtering bad words or training a model to recite a refusal template. We now face a reality where the architecture of intelligence itself, reasoning over long contexts, is a lever for bypassing safety.
Building the next generation of secure AI requires moving beyond surface alignment with a multi-layered strategy that targets the mechanics of refusal dilution:
- Continuous safety verification. Instead of checking intent only at the start, models need in-flight checks that re-evaluate internal state at regular intervals during reasoning, keeping the refusal signal above a critical threshold.
- Mechanistic interpretability as a defense. Move toward white-box monitoring. By understanding the specific attention heads and activation paths that maintain refusal behavior, developers can build systems that alert the moment those signals weaken.
- Inference-time guardrails. Deploy external monitoring that analyzes the hidden reasoning traces of LRMs. If a model drifts into a state where its attention is being hijacked by benign logic, the system should intervene before the harmful output is generated.
The journey from Kojima's "Let's think step by step" to the discovery of refusal dilution shows that progress in AI is rarely a straight line. Every leap in capability brings a new class of risk. But by identifying these vulnerabilities early, and understanding the mechanistic reasons they exist, we can build AI that is not just smarter but fundamentally more resilient. The challenge for the next few years is clear: as our models learn to think more deeply, they must also learn to stay securely aligned with the human values they were built to serve.
Frequently asked questions
What is Chain-of-Thought Hijacking? Chain-of-Thought Hijacking is a black-box jailbreak attack on large reasoning models. It prepends a harmful instruction with a long sequence of benign puzzle reasoning, which dilutes the model's internal safety signal and causes it to comply with the harmful request. It was introduced by Zhao et al. in arXiv:2510.26418.
Which AI models are vulnerable? In the original study, the attack succeeded against every frontier reasoning model tested: Grok 3 Mini (100%), Gemini 2.5 Pro (99%), ChatGPT o4-mini (94%), and Claude 4 Sonnet (94%) on the HarmBench benchmark.
What is refusal dilution? Refusal dilution is the phenomenon where a model's internal "refusal signal", a low-dimensional direction in its activation space, weakens as the reasoning trace grows longer. The harmful intent gets buried under a large volume of benign reasoning, and the safety mechanism fails to trigger.
Why does longer reasoning make models less safe? Two effects compound. Attention attenuation reduces the relative weight the model gives the original harmful prompt as it generates more tokens, and activation weakening lowers the intensity of the refusal signal itself. Together they let the malicious request slip through.
How can Chain-of-Thought Hijacking be prevented? Proposed defenses focus on continuous, in-flight safety verification rather than one-time training: re-checking intent at intervals during reasoning, using mechanistic interpretability to monitor refusal-related attention heads, and deploying inference-time guardrails that analyze the hidden reasoning trace and intervene before a harmful output is produced.
About the Author
Alessandro Pignati is Lead AI Security Researcher at NeuralTrust, where he leads research on AI and agentic security, advancing techniques to evaluate and secure large language models and autonomous AI systems. He specializes in adversarial machine learning, AI red teaming, LLM security, and AI safety, contributing to the development of secure and trustworthy AI.
NeuralTrust is an AI agent security platform, recognized in the Gartner 2025 Market Guide for Guardian Agents. Headquartered in Barcelona with ISO 27001 certification.
)
)
)
)
)