August 4, 2025
5 min read
Jonathan Kemper
A large-scale red teaming study found every top AI agent failed at least one security test, exposing critical vulnerabilities.
Every Leading AI Agent Failed at Least One Security Test During a Massive Red Teaming Competition
Between March 8 and April 6, 2025, nearly 2,000 participants launched 1.8 million attacks on AI agents in a large-scale red teaming competition. Over 62,000 of these attempts succeeded, resulting in serious policy violations including unauthorized data access, illegal financial transactions, and regulatory breaches. The event was organized by Gray Swan AI and hosted by the UK AI Security Institute, with support from leading AI labs such as OpenAI, Anthropic, and Google Deepmind. Their goal was to rigorously test the security of 22 advanced language models across 44 real-world scenarios.100% of Agents Failed at Least One Test
The results revealed that every tested AI model was vulnerable, with each agent successfully attacked at least once in every tested category. On average, attacks succeeded 12.7% of the time. Researchers focused on four key behavior categories:- Confidentiality breaches
- Conflicting objectives
- Prohibited information
- Prohibited actions Indirect prompt injections—where malicious instructions are hidden in external sources like websites, PDFs, or emails—were especially effective, succeeding 27.1% of the time compared to just 5.7% for direct attacks.
- System prompt overrides using tags like
<system>
- Simulated internal reasoning ("faux reasoning")
- Fake session resets Even the most secure model tested, Claude 3.7 Sonnet, was vulnerable to these methods.
- Nearly 2,000 participants launched 1.8 million attacks on AI agents; all tested models failed at least one security test.
- Indirect prompt injections were particularly effective, with a 27.1% success rate.
- Attack techniques often transferred across models, revealing common vulnerabilities.
- The ART benchmark was created to document attacks and support ongoing security testing.
- AI-Driven Crypto Trading Tools Reshape Market Strategies in 2025
- Top 3 AI Crypto Coins to Watch Before the Next Bull Run
- Understanding Cryptocurrency Regulations in the United States
Claude Models Held Up Best, But None Are Secure
Anthropic's Claude models demonstrated the most robustness, including the smaller and older 3.5 Haiku model. However, no model was immune to attacks. The study found little correlation between model size, raw capabilities, or longer inference times and actual security resilience. It is important to note that the tests used Claude 3.7, not the newer Claude 4, which features stricter safeguards. Attacks often transferred across models, meaning techniques that compromised the most secure systems frequently succeeded against others with minimal modification. For instance, a single prompt attack succeeded 58% of the time on Google Gemini 1.5 Flash, 50% on Gemini 2.0 Flash, and 45% on Gemini 1.5 Pro. Common attack strategies included:A New Benchmark for Ongoing Testing
The competition’s findings led to the creation of the 'Agent Red Teaming' (ART) benchmark, a curated set of 4,700 high-quality attack prompts designed to help improve AI agent security. The researchers emphasized that these findings highlight fundamental weaknesses in current defenses and represent an urgent, realistic risk that must be addressed before broader deployment of AI agents. The ART benchmark will be maintained as a private leaderboard and updated regularly through future competitions to reflect evolving adversarial techniques.Rising Stakes for AI Agent Security
As AI providers increasingly invest in agent-based systems, these security concerns grow more critical. OpenAI recently introduced agent functionality in ChatGPT, and Google’s models are tuned for agent workflows. Even OpenAI CEO Sam Altman has warned users against trusting ChatGPT Agent with sensitive or personal data.Summary: