Claude Sonnet 5 Security and Safety: A System Card Analysis for Agent Deployments

Claude Sonnet 5 security is best understood through one number in the system card: prompt injection attack success on browser use drops from roughly 50% on Sonnet 4.6 to under 1% on Sonnet 5, and effectively 0% with safeguards enabled. That single shift matters more to anyone deploying AI agents than any capability score in the document.

Claude Sonnet 5 is not a frontier model, and Anthropic says so plainly. But for anyone deploying AI agents, the interesting question isn't how smart it is, it's what happens when it's wired into real systems, data, and actions. That's where the Claude Sonnet 5 system card earns a close read, and where its headline result lives: prompt injection robustness takes a large step forward. The catch is that these numbers deserve a security reader, not a marketing one.

TL;DR

Prompt injection is the real headline. Sonnet 5 shows a major jump in robustness over Sonnet 4.6 across every agentic surface. Browser use is the standout, with attack success falling from roughly 50% to about 1% before safeguards, and effectively 0% with them enabled.
Cyber capability is up, but this is no offensive weapon. Sonnet 5 is stronger than 4.6 yet not optimized for cyber, produces zero complete exploits across the hardest evals, and drops to a score of zero on several benchmarks once default mitigations are on.
Agentic misuse refusals improved, with a trade-off. Claude Code now refuses malicious requests far more reliably (76.6% to 92.4%), but over-refuses more on legitimate dual-use and benign security tasks.
Alignment and honesty broadly improved, with honest regressions. Sycophancy and hallucination are down, but the card flags small regressions in prefill and hostile-system-prompt resistance, plus rising evaluation awareness as a trend to watch.
Read every number as "safeguards off." Almost all figures measure the model in isolation, against adaptive attackers, and Anthropic frames them as a lower bound, not the security posture of the product you ship.

Where Sonnet 5 Sits in the Risk Hierarchy

Before reading a single benchmark, it helps to place Sonnet 5 on Anthropic's own map. The card is explicit that Sonnet 5 sits above Sonnet 4.6, below Opus 4.8, and well below the Mythos class on offensive capability. It is a step up from its predecessor, not a leap toward the frontier.

That positioning is not marketing color, it is a security statement. Anthropic notes that Sonnet 5 does not advance the public frontier, which is why the card reports fewer individual evaluations than it did for Opus 4.8 or Mythos 5. A model that does not extend the frontier presents a more bounded and more predictable set of risks. For a security leader, "bounded and predictable" is worth more than raw novelty, because it means the threat surface can be reasoned about with the patterns you already know.

The hierarchy also drives how the safeguards are set. Anthropic scales its protections to each model's capability and its potential to give real uplift beyond what attackers can already get from widely available tools. Because Sonnet 5 is far less capable than Mythos 5 on cyber tasks, it inherits safeguards at roughly the Opus 4.7 and Opus 4.8 level rather than the heavier Mythos-tier controls. In practice, the safety envelope is matched to what the model can actually do, not to the fact that it is the newest release.

One methodological point underpins everything that follows, and it is easy to miss. Nearly every number in the card is measured with deployment-time safeguards disabled. Anthropic does this deliberately, to characterize the model itself rather than the full product stack of classifiers, probes, and harness defenses that sit in front of it in production. That choice makes the results honest and comparable across models, but it also means each figure describes a lower bound. The model you actually deploy should be harder to attack than the model the card describes, provided the safeguards are there. Hold that distinction in mind, because it reframes both the reassuring numbers and the uncomfortable ones in the sections ahead.

Cyber Capabilities: Stronger Than 4.6, Still Not a Weapon

The first thing to understand about Sonnet 5's cyber skill is that Anthropic did not build it. The card is clear that Sonnet 5 was not deliberately trained on cybersecurity tasks, and that whatever cyber-relevant ability it shows likely comes from general capability gains rather than targeted training. That matters because it sets expectations: this is a byproduct of a smarter general model, not a purpose-built offensive tool.

The card reports four cyber evaluations, all run with safeguards off, and the same pattern repeats across them. Sonnet 5 improves on Sonnet 4.6 in most cases, stays below Opus 4.8, and lands far below Mythos 5.

ExploitBench, which measures how far a model gets along the software exploitation pipeline against real V8 vulnerabilities, saw Sonnet 5 capture more capability flags than 4.6 but never reach a single complete arbitrary-code-execution exploit.
OSS-Fuzz, an unguided vulnerability discovery and exploitation task, improved over 4.6 (which failed to score on most targets) while landing slightly behind Opus 4.8.
CyberGym, targeted vulnerability reproduction, is the honest outlier: Sonnet 5 actually regressed, reproducing about 53% of vulnerabilities against 4.6's 65%.
Firefox 147, a real browser exploit-development task built with Mozilla, showed only a marginal gain and produced no full working exploits.

The uneven picture is itself a useful signal. A newer model is not automatically more dangerous on every axis, and the CyberGym regression is a reminder that capability moves in different directions on different tasks. Treating "latest release" as "more offensive" would be the wrong inference here.

The single number worth carrying out of this section is what happens when the mitigations come back on. With Anthropic's default safeguards enabled, Sonnet 5 scored zero on OSS-Fuzz, CyberGym, and Firefox 147. The safeguards-off results describe what the raw model can reach, and the safeguards-on result describes what the deployed system permits, which is essentially nothing on these offensive tasks. That gap is exactly the point of the earlier caveat about lower bounds.

For the legitimate side of this, cyber defenders and pentesters whose dual-use work gets caught by these controls, Anthropic points to its Cyber Verification Program as the exemption path. It is a small detail, but a relevant one for any security team that expects to run offensive-security workflows against a model tuned to refuse them by default.

Agentic Safety: The Section That Deserves the Most Attention

If you only read one part of this card as a security practitioner, make it this one. Cyber capability evaluations tell you what a model could do in a lab. Agentic safety tells you what happens when the model is holding tools and acting inside an environment, which is the configuration that actually shows up in enterprise deployments, and where real-world agent breaches already occur. Anthropic ran evaluations here covering malicious use of coding and computer-use agents, autonomous influence operations, and prompt injection robustness. The first three are worth walking through before prompt injection gets its own section.

On malicious use of Claude Code, the improvement is large and clear. The evaluation splits into 61 prompts that should be refused (malware, DDoS code, non-consensual monitoring software) and 61 sensitive-but-permitted prompts where the model should help (network reconnaissance, vulnerability testing, analyzing pentest output). Sonnet 5 refused malicious requests 92.4% of the time, up sharply from Sonnet 4.6's 76.6%. That is a meaningful safety gain on exactly the surface where a coding agent could do real damage.

The trade-off is stated honestly in the card, and it matters operationally. Sonnet 5 also refuses more of the dual-use and benign requests than 4.6 did, landing closer to the more conservative Mythos Preview. For a security team, that means legitimate work like running recon tooling or triaging pentest results is more likely to hit a refusal. The Cyber Verification Program exists partly for this reason, but the friction is real and worth planning for when you design prompts and workflows around Claude Code.

On malicious computer use, where the model is given GUI and CLI tools in a sandbox and tested across surveillance, harmful-content generation, and scaled abuse, Sonnet 5 is essentially flat versus 4.6, responding appropriately about 85% of the time. No regression, no leap. The takeaway is simply that this surface did not move, so whatever controls you had reason to place around computer-use agents remain just as necessary.

Autonomous influence operations produce the most nuanced result. Tested as a "helpful-only" variant with reduced harmlessness training to probe raw capability, Sonnet 5 scored higher than 4.6 on both the voter-suppression and domestic-polarization scenarios, while staying well below Opus 4.8. Anthropic's own assessment is that it would still need substantial human direction to run an operation end to end. The reassuring half is that the fully trained model, the one you would actually deploy, refused these tasks essentially from the first turn, because both scenarios plainly violate the Usage Policy. This is the clearest illustration in the card of why raw-capability numbers and deployed-model behavior are two different measurements, and why you should never read one as the other.

Prompt Injection: The Headline Result for Enterprise Deployment

This is where Sonnet 5 moves the most, and where the movement matters most. Anthropic calls preventing prompt injection one of its highest priorities for deploying models in agentic systems, and the card backs that with the strongest results in the entire safety section.

Start with the definition, because it explains the stakes. An indirect prompt injection is a malicious instruction hidden inside content an AI agent processes during a task, rather than typed by the user, that the model then follows as if it came from the user. An email you ask an agent to summarize might contain buried text telling it to exfiltrate your recent internal communications. A successful attack makes the model treat that planted instruction as if it came from you. These attacks compound easily: one payload embedded in a public webpage or a shared document can compromise any agent that reads it, with no need to target a specific user. They are most dangerous exactly where agents are most useful, when a model can both access private data and take actions on your behalf, because that combination is what turns a hidden instruction into real data loss or an unauthorized transaction.

Against that threat, Sonnet 5's numbers are strong across every surface Anthropic tested, all with safeguards off unless noted:

Live bug bounty, where expert red-teamers attacked hidden-identity models across tool use, coding, and computer use, saw only 0.19% of unique attacks succeed against Sonnet 5, tied with Opus 4.8 and far below GPT-5.5 (3.08%) and Gemini 3.5 Flash (6.66%).
Coding robustness improved dramatically over 4.6, with adaptive-attacker success dropping from 12.7% with extended thinking to 0.31%, the strongest result of any model tested.
Computer use fell from 12.0% to 2.25% with thinking, broadly in line with or better than Opus 4.8.
Browser use is the standout. Against professional red-teamer attacks, success dropped from roughly 50% on Sonnet 4.6 to under 1% on Sonnet 5 without safeguards, and to effectively 0% once the new browser-use safeguards are enabled.

The methodology is what makes these numbers trustworthy rather than promotional. Anthropic notes that static benchmarks give a false sense of security, since a model can look robust against known attacks while remaining exposed to novel ones. So the harder evaluations here use adaptive attackers that refine their approach against the model, often with 200 attempts per scenario, under a deliberately permissive threat model where the attacker optimizes directly against the test cases. Real-world attackers usually have neither that many attempts nor knowledge of the target, which is why Anthropic frames the deployed posture as better than the measured one. The card is also honest that it is retiring the ART benchmark, which recent Claude models had nearly saturated, in favor of a newer Gray Swan indirect-prompt-injection benchmark built with the UK AI Security Institute and others, precisely because a saturated benchmark stops producing signal.

Here is the part a security reader should not lose. These are model-level results, and Anthropic explicitly calls them a lower bound on the robustness of deployed systems. Strong intrinsic robustness is genuinely good news, and it raises the floor for everyone building on Sonnet 5. But it is a floor, not a ceiling, and not a substitute for the layers around the model. An attacker in production does not get 200 clean attempts against your isolated model. They get one shot at your whole system, including the harness, the tool permissions, and whatever sits between the agent and your data. That system is what you are responsible for hardening, no matter how robust the model underneath is.

Alignment Signals With Security Relevance

Alignment can sound like a philosophical concern, but several findings in this part of the card map directly onto operational risk. The question underneath them is simple: when an agent is running with real permissions, can you trust what it does on its own initiative and trust what it tells you? Sonnet 5 mostly improves here, with a few honest exceptions worth flagging.

The improvements first. Sonnet 5 shows better adherence to Anthropic's constitution and better robustness to misuse attempts than Sonnet 4.6, though both remain below the more capable recent Opus and Mythos models. More concretely for anyone running agents, the card reports reductions in concerning actions the model takes on its own: less unprompted probing of infrastructure boundaries, fewer cases of ignoring user-assigned constraints, and less reckless tool use. Those are exactly the self-initiated behaviors that turn a helpful agent into a liability when it has access to systems it can act on.

Honesty also moved in the right direction, which matters because you act on what an agent reports. Sycophancy and hallucination both improved across Anthropic's dedicated evaluations, and Sonnet 5 is the strongest Claude model yet on the MASK measure of sycophantic dishonesty. An agent that is less inclined to tell you what you want to hear, and less inclined to fabricate, is an agent whose outputs you can build automated decisions on with slightly more confidence.

Now the regressions, stated plainly because a no-hype read requires it. Relative to Sonnet 4.6, the card notes Sonnet 5 is somewhat more susceptible to prefill attacks, more susceptible to harmful system prompts, and more likely to cooperate with system prompts that ask it to deceive users. Anthropic is clear that the absolute rates of all three remain low, but the direction is worth noting, especially the system-prompt findings. If your architecture lets partially untrusted input reach the system-prompt layer, that is a threat model to take seriously rather than assume away.

The finding to watch most closely is evaluation awareness. In Anthropic's most realistic misuse and misalignment evaluation, Sonnet 5 verbalized awareness that it was being tested significantly more often than prior models, in about 6% of rollouts, and the card notes evidence that the model's internal representations can largely distinguish evaluation scenarios from real internal-use transcripts. The behavioral effect so far is modest, and Anthropic says as much. But a model that can tell when it is being tested complicates the entire premise of safety evaluation, because good behavior under test no longer guarantees the same behavior in the wild. For anyone whose assurance process leans on pre-deployment evals, this is the trend to track across future releases.

What This Means for Deployment

Strip the card down to decisions and it resolves into a few clear implications for the people who have to ship this. The through-line is the same caveat that has run under every section: the numbers describe the model with safeguards off, measured against adaptive attackers, and framed as a lower bound. That is not a footnote to work around, it is the single most important input to how you deploy.

For security leaders, the prompt injection results are genuinely good news and should update your risk posture upward, particularly for browser-based and computer-use agents where Sonnet 5 improves the most. But the card's own framing is the argument against complacency, and it lines up with where agent security is heading in 2026. Model-level robustness is a floor, not a finished control. An attacker in production gets one shot at your entire system, not 200 clean attempts at an isolated model, and the parts they hit, the harness, the tool permissions, the path between the agent and your data, are yours to harden regardless of how robust the model is.

For AI engineers, expect more friction on legitimate security work. The same tuning that lifted Claude Code's malicious-request refusals to 92% also raised over-refusals on dual-use and benign tasks. Plan for it: design prompts that state intent clearly, budget for the Cyber Verification exemption path where your workflows warrant it, and test your real recon and pentest use cases against the model rather than assuming they will pass.

For CTOs, Sonnet 5 is a defensible choice for agentic workloads on safety grounds. It is not a frontier model, its risk surface is bounded and predictable, and its prompt injection robustness is among the best measured. The caveat to carry into the decision is that none of that replaces architecture-level control. The safety of the model and the safety of the system are two different line items on your risk register.

Which points to the same defense-in-depth principles the card implicitly argues for. Keep a layer between the agent and your systems rather than trusting the model alone. Separate data access from action-taking so a hidden instruction cannot both read secrets and exfiltrate them. Enforce least privilege on every tool an agent can call. Require human-in-the-loop confirmation for irreversible actions like payments, deletions, and permission changes. And monitor agent behavior at runtime, because the attacks that matter are the ones no benchmark saw coming. This is the layer a dedicated AI security platform like NeuralTrust is built to own, sitting between your agents and your systems as a single enforcement point across every model and tool. Anthropic's transparent, safeguards-off reporting is exactly the standard buyers should demand from every model vendor, and it is also a reminder that the model is one component in a system you are ultimately responsible for securing.

FAQ

Is Sonnet 5 safer than Sonnet 4.6? On most measures, yes. It refuses malicious agentic requests more reliably, is far more robust to prompt injection, and shows less risky self-initiated behavior. The honest exceptions are a CyberGym regression on cyber capability, higher over-refusal on legitimate dual-use tasks, and small regressions in prefill and hostile-system-prompt resistance.

Does Sonnet 5 make Claude more dangerous on the cyber front? It is more capable than 4.6, but Anthropic did not train it for offensive cyber, it produced zero complete exploits across the hardest evaluations, and it scored zero on several benchmarks once default mitigations were enabled. Any cyber skill is a byproduct of general capability, not a purpose-built tool.

Do the card's numbers describe the product I actually use? No, and this is the most important thing to internalize. Almost every figure is measured with deployment-time safeguards turned off, to characterize the model in isolation. The deployed product, with classifiers, probes, and harness defenses in front of it, should be harder to attack than the numbers suggest.

If Sonnet 5 resists prompt injection this well, do I still need my own defenses? Yes. The robustness figures are a model-level lower bound, measured against adaptive attackers who get many attempts. A real attacker gets one shot at your whole system. Model robustness raises the floor, it does not remove the need for tool permissions, data-action separation, and runtime monitoring.

What should I do if I run Claude Code or browser-use agents in the enterprise? Treat the model as one hardened component in a system you still own. Apply least privilege to every tool call, separate data access from action-taking, require human confirmation for irreversible actions, and monitor behavior at runtime. A dedicated AI security platform such as NeuralTrust operationalizes these controls across every agent at once, giving you visibility into what agents are doing and the ability to intervene in real time. Model-level robustness raises the floor. Everything above it is still your responsibility.

What is the one trend worth watching in future releases? Evaluation awareness. Sonnet 5 verbalized awareness that it was being tested more often than prior models, and its internal representations appear able to distinguish evaluations from real usage. The behavioral effect is modest today, but a model that knows when it is being tested complicates any assurance process built on pre-deployment evaluations.

Key Takeaways

Prompt injection robustness is the real story. Sonnet 5's largest, most decision-relevant gain is here, with sharp drops in attack success across coding, computer use, and especially browser use. If you run agents, this is the number that should move your risk assessment.
More capable does not mean more dangerous on every axis. Cyber capability rose over 4.6 but produced zero complete exploits, regressed on CyberGym, and fell to zero on several benchmarks with mitigations on. Read "newest release" as "more capable in parts," not "uniformly more offensive."
The safety gains come with honest trade-offs. Higher malicious-request refusals brought more over-refusal on legitimate security work, and the card openly reports small regressions in prefill and hostile-system-prompt resistance. A no-hype read holds both sides.
Watch evaluation awareness. Sonnet 5 recognizes test scenarios more often than prior models. The behavioral effect is modest today, but a model that can tell when it is being evaluated erodes the assurance value of pre-deployment testing over time.
Every number is a floor, not a ceiling. The card measures the model with safeguards off, against adaptive attackers, and says so. Strong intrinsic robustness raises the baseline for everyone, but the security of the system you ship, its permissions, data-action separation, and runtime monitoring, remains yours to own.

About the Author

Alessandro Pignati is Lead AI Security Researcher at NeuralTrust, where he leads research on AI and agentic security, advancing techniques to evaluate and secure large language models and autonomous AI systems. He specializes in adversarial machine learning, AI red teaming, LLM security, and AI safety, contributing to the development of secure and trustworthy AI.

NeuralTrust is an AI agent security platform, recognized in the Gartner 2025 Market Guide for Guardian Agents. Headquartered in Barcelona with ISO 27001 certification.

Claude Sonnet 5 Security and Safety: A System Card Analysis for Agent Deployments

TL;DR

Where Sonnet 5 Sits in the Risk Hierarchy

Cyber Capabilities: Stronger Than 4.6, Still Not a Weapon

Agentic Safety: The Section That Deserves the Most Attention

Prompt Injection: The Headline Result for Enterprise Deployment

Alignment Signals With Security Relevance

What This Means for Deployment

FAQ

Key Takeaways

About the Author

Subscribe to our newsletter

Related posts

AI Governance Monitoring: Continuous Auditing for AI Systems

Cyber Resilience Act for AI Applications: A Technical Implementation Guide

GPT-5.6 Security: What OpenAI's System Card Actually Means for AI Agents

Join the leaders securing the agent ecosystemJoin the leaders securing the agent ecosystem