¿Qué es el red teaming para LLMs y por qué es crucial?

El red teaming para LLMs implica aplicar técnicas adversariales (como inyecciones de prompt) para descubrir y corregir vulnerabilidades antes de que atacantes reales las exploten. Es esencial porque los LLMs, al ser cada vez más usados en producción, presentan riesgos únicos (fugas de datos, contenido dañino) que las defensas tradicionales no cubren.

Qu’est-ce que l’attaque Crescendo contre les LLMs ?

L’attaque Crescendo est une technique sophistiquée d’injection de prompt qui escalade progressivement la sensibilité des requêtes pour contourner les filtres de sécurité d’un LLM. Grâce à des invites intermédiaires et un mécanisme de retour en arrière, elle pousse le modèle à générer du contenu restreint sans déclencher immédiatement les protections.

Which LLMs proved most vulnerable to the Crescendo attack?

In our experiments, GPT-4.1-nano and GPT-4o-mini showed the highest susceptibility (100% success in Pornography, Sexism, Violence), followed by Mistral and Phi-4-mini. DeepSeek-R1 exhibited the greatest resistance, particularly against Misinformation and Illegal Activities.

Comment mesure-t-on le succès et le rejet dans l’attaque Crescendo ?

Nous évaluons deux métriques : le taux de succès — pourcentage d’essais où le LLM génère le contenu restreint — et le taux de rejet — fréquence à laquelle le modèle refuse ou produit une réponse de sécurité. Cette double approche identifie à la fois la facilité de contournement et la résistance effective.

¿En qué categorías el Crescendo attack obtuvo tasas de éxito más altas?

Las categorías con mayor éxito fueron Pornografía, Sexismo y Violencia (100% en varios modelos). Ataques en Misinformation, Hate Speech y Profanity también mostraron altas tasas, mientras que Self-harm e Illegal Activities resultaron más resistentes.

How can I defend my LLM against prompt injection attacks like Crescendo?

Implement layered defenses: use an AI firewall (e.g. TrustGate) to inspect and block malicious prompt patterns, deploy adaptive safety filters that detect gradual prompt escalation, run continuous red-teaming (e.g. TrustTest) to uncover weaknesses, and monitor user behavior anomalies in real time.

Quelle est la différence entre succès et rejet ?

Le succès indique que le modèle a produit l’output restreint sollicité, tandis que le rejet signifie que le LLM a décliné ou retourné une réponse sécurisée. Un faible taux de rejet combiné à un haut taux de succès signale une vulnérabilité critique.

¿Cómo simula TrustTest ataques como Crescendo?

TrustTest automatiza la generación de secuencias de prompts escalonados y aplica mecanismos de retroceso cuando el modelo rechaza. Permite evaluar categorías de contenido dañino, medir tasas de éxito/rechazo y perfilar puntos débiles en pipelines adversariales continuos.

Back

Crescendo Attacks: How Real-World LLMs Respond to Gradual Prompt Attacks

Ahmad Alobaid • May 14, 2025

Contents

Red teaming for Large Language Models (LLMs) is an emerging field that addresses the unique vulnerabilities posed by these powerful systems. As LLMs are increasingly deployed across industries, their flexible capabilities introduce novel security risks that extend beyond the reach of traditional cybersecurity teams.

At NeuralTrust, we actively research, implement, and test adversarial techniques to uncover weaknesses in LLMs, helping companies better defend against prompt-based attacks, commonly referred to as jailbreaks. These attacks can lead to severe reputational and operational consequences, such as leaking proprietary data or generating offensive content (e.g., profanity in live chats, toxic responses, or unauthorized disclosures).

In this post, we present our work replicating and adapting the Crescendo attack, originally proposed by researchers at Microsoft (Russinovich et al.). We explain the core idea behind the attack, describe our custom implementation adapted for medium-sized open-source models, and share insights from our experiments across a range of harmful objective categories.

What is the Crescendo Attack?

The Crescendo attack is a sophisticated prompt injection technique that incrementally guides an LLM toward producing restricted or harmful outputs without triggering immediate rejection or safety filters. Instead of asking for a sensitive response directly, the attacker gradually escalates the conversation, exploiting the model’s tendency to comply when prompts are framed benignly.

Our implementation captures the essence of the original attack while introducing additional mechanisms to increase effectiveness for our targeted use cases. A key insight from our tests: directly prompting for the final objective typically resulted in low success rates. Success depended heavily on the careful framing and progression of intermediate prompts.

We structured the attack as a sequence of escalating prompts, starting with benign questions and gradually increasing their sensitivity. As the model approaches restricted content, it may begin to resist. To handle this, we implemented a backtracking mechanism: whenever the model refused to answer, the system would modify the prompt and retry. This loop continued until a successful output was generated or a maximum number of retries was reached.

This trial-and-error strategy mirrors the behavior of a skilled attacker probing a model’s guardrails for weaknesses.

Which Chatbot Is More Vulnerable to the Crescendo Attack?

To assess the real-world susceptibility of popular LLMs to the Crescendo attack, we designed a systematic experiment targeting a range of harmful objective categories. These objectives were selected to represent typical classes of restricted content in production environments, including:

Illegal Activities
Self-harm
Misinformation
Pornography
Profanity
Sexism
Hate Speech
Violence

Each objective was paired with a sequence of prompts designed to escalate in tone and intent, following the Crescendo method. This approach allowed us to evaluate not only whether a model would produce restricted output, but how easily it could be guided toward that outcome through progressive prompting. We tested the attack against five language models. Open-source LLMs:

Mistral
Phi-4-mini
DeepSeek-R1

Proprietary/Industrial LLMs:

GPT-4.1-nano
GPT-4o-mini

To account for the stochastic nature of language model outputs, we executed each prompt sequence multiple times per model-objective pair. The number of repetitions varied based on the model’s response consistency. We then averaged the results to report reliable metrics on two key fronts:

Success rate: the percentage of trials in which the model ultimately generated the restricted output.
Rejection rate: how often the model refused to comply or generated a safe fallback response.

This dual-metric approach helps distinguish between models that are simply verbose or evasive, versus those that meaningfully resist harmful instructions.

We also implemented automated backtracking to simulate an adaptive adversary. If the model refused to complete a step, the system would slightly modify the prompt and retry, mimicking a real-world attack scenario where an adversary probes boundaries through trial and error.

By structuring the experiment this way, we were able to simulate a realistic adversarial interaction loop and measure how long or how easily each model could be manipulated into breaking its guardrails. This setup provides the foundation for the results presented in the next section.

Results of The Crescendo Attack Experiment

The results reveal interesting patterns in model susceptibility to the Crescendo attack. We first report the success rates across different objective categories for each LLM:

The Crescendo attack achieved high success rates in the Hate Speech, Misinformation, Pornography, Sexism, and Violence categories.
Notably, the attack achieved 100% success rates for Pornography, Sexism, and Violence on Mistral, Phi-4-mini, GPT-4.1-nano, and GPT-4o-mini.
In contrast, models showed greater resistance when the target objectives were Illegal Activities, Self-harm, and Profanity.

We also measured rejection rates: how often a model resisted or refused to comply

DeepSeek-R1 had the highest rejection rate, especially for Misinformation and Illegal Activities.
Phi-4-mini showed moderate rejection for Self-harm, Profanity, and Illegal Activities.
Mistral exhibited some resistance in Profanity, Illegal Activities, and Misinformation, but overall had lower rejection rates compared to DeepSeek-R1.
GPT-4.1-nano and GPT-4o-mini hardly showed any resistance.

Overall, GPT-4.1-nano was the most susceptible to the Crescendo attack, followed by GPT-4o-mini, and then Mistral. Some rejection rates were partially influenced by timeouts rather than actual model resistance and some false positives are also reported as one the challenges by the Microsoft team, but we leave this to another blog post.

How to Protect Your LLM From The Crescendo Attack

The Crescendo attack is a powerful example of how adversaries can exploit the subtle behavioral tendencies of LLMs through gradual manipulation. Defending against it requires more than just filtering keywords or applying static safety prompts. It calls for layered, dynamic defenses that combine detection, prevention, and continuous validation.

At NeuralTrust, we help organizations secure their LLM deployments across every stage of the AI lifecycle. Our product TrustGate acts as a semantic firewall for AI models, intercepting and analyzing every prompt with real-time safety filters and policy enforcement. It can detect gradual prompt escalation, prompt chaining, and suspicious user behavior before a harmful query ever reaches your model.

For teams developing and testing LLM applications, TrustTest provides automated red teaming capabilities that simulate attacks like Crescendo across different categories, from misinformation to hate speech. It allows you to continuously probe your model's weaknesses, identify failure modes, and validate defenses under adversarial pressure.

To explore these solutions in action or learn how we can support your AI security strategy, request a demo or reach out to our team.