AI Market Logo
BTC $43,552.88 -0.46%
ETH $2,637.32 +1.23%
BNB $312.45 +0.87%
SOL $92.40 +1.16%
XRP $0.5234 -0.32%
ADA $0.8004 +3.54%
AVAX $32.11 +1.93%
DOT $19.37 -1.45%
MATIC $0.8923 +2.67%
LINK $14.56 +0.94%
HAIA $0.1250 +2.15%
BTC $43,552.88 -0.46%
ETH $2,637.32 +1.23%
BNB $312.45 +0.87%
SOL $92.40 +1.16%
XRP $0.5234 -0.32%
ADA $0.8004 +3.54%
AVAX $32.11 +1.93%
DOT $19.37 -1.45%
MATIC $0.8923 +2.67%
LINK $14.56 +0.94%
HAIA $0.1250 +2.15%
Every leading AI agent failed at least one security test during a massive red teaming competition
AI-security

Every leading AI agent failed at least one security test during a massive red teaming competition

A large-scale red teaming study found every top AI agent failed at least one security test, exposing critical vulnerabilities.

August 4, 2025
5 min read
Jonathan Kemper

A large-scale red teaming study found every top AI agent failed at least one security test, exposing critical vulnerabilities.

Every Leading AI Agent Failed at Least One Security Test During a Massive Red Teaming Competition

Between March 8 and April 6, 2025, nearly 2,000 participants launched 1.8 million attacks on AI agents in a large-scale red teaming competition. Over 62,000 of these attempts succeeded, resulting in serious policy violations including unauthorized data access, illegal financial transactions, and regulatory breaches. The event was organized by Gray Swan AI and hosted by the UK AI Security Institute, with support from leading AI labs such as OpenAI, Anthropic, and Google Deepmind. Their goal was to rigorously test the security of 22 advanced language models across 44 real-world scenarios.

100% of Agents Failed at Least One Test

The results revealed that every tested AI model was vulnerable, with each agent successfully attacked at least once in every tested category. On average, attacks succeeded 12.7% of the time. Researchers focused on four key behavior categories:
  • Confidentiality breaches
  • Conflicting objectives
  • Prohibited information
  • Prohibited actions
  • Indirect prompt injections—where malicious instructions are hidden in external sources like websites, PDFs, or emails—were especially effective, succeeding 27.1% of the time compared to just 5.7% for direct attacks.

    Claude Models Held Up Best, But None Are Secure

    Anthropic's Claude models demonstrated the most robustness, including the smaller and older 3.5 Haiku model. However, no model was immune to attacks. The study found little correlation between model size, raw capabilities, or longer inference times and actual security resilience. It is important to note that the tests used Claude 3.7, not the newer Claude 4, which features stricter safeguards. Attacks often transferred across models, meaning techniques that compromised the most secure systems frequently succeeded against others with minimal modification. For instance, a single prompt attack succeeded 58% of the time on Google Gemini 1.5 Flash, 50% on Gemini 2.0 Flash, and 45% on Gemini 1.5 Pro. Common attack strategies included:
  • System prompt overrides using tags like <system>
  • Simulated internal reasoning ("faux reasoning")
  • Fake session resets
  • Even the most secure model tested, Claude 3.7 Sonnet, was vulnerable to these methods.

    A New Benchmark for Ongoing Testing

    The competition’s findings led to the creation of the 'Agent Red Teaming' (ART) benchmark, a curated set of 4,700 high-quality attack prompts designed to help improve AI agent security. The researchers emphasized that these findings highlight fundamental weaknesses in current defenses and represent an urgent, realistic risk that must be addressed before broader deployment of AI agents. The ART benchmark will be maintained as a private leaderboard and updated regularly through future competitions to reflect evolving adversarial techniques.

    Rising Stakes for AI Agent Security

    As AI providers increasingly invest in agent-based systems, these security concerns grow more critical. OpenAI recently introduced agent functionality in ChatGPT, and Google’s models are tuned for agent workflows. Even OpenAI CEO Sam Altman has warned users against trusting ChatGPT Agent with sensitive or personal data.
    Summary:
  • Nearly 2,000 participants launched 1.8 million attacks on AI agents; all tested models failed at least one security test.
  • Indirect prompt injections were particularly effective, with a 27.1% success rate.
  • Attack techniques often transferred across models, revealing common vulnerabilities.
  • The ART benchmark was created to document attacks and support ongoing security testing.

  • Frequently Asked Questions (FAQ)

    AI Agent Security and Vulnerabilities

    Q: What was the main finding of the recent AI agent red teaming competition? A: The primary finding was that every leading AI agent tested failed at least one security test, highlighting widespread vulnerabilities in current AI defenses. Q: How many AI agents were tested, and what was the overall success rate of attacks? A: 22 advanced language models were tested, and overall, attacks succeeded 12.7% of the time, with nearly 2,000 participants launching 1.8 million attacks. Q: Which type of attack was most successful? A: Indirect prompt injections, where malicious instructions are hidden in external sources, were most successful, achieving a 27.1% success rate compared to direct attacks. Q: Were there any AI models that performed better than others in terms of security? A: Anthropic's Claude models, including the 3.5 Haiku model, showed more robustness, but no model was entirely immune to attacks. Q: Is there a correlation between AI model size and security resilience? A: The study found little correlation between model size, raw capabilities, or inference times and actual security resilience. Q: What are some common attack strategies used against AI agents? A: Common strategies include system prompt overrides, simulated internal reasoning ("faux reasoning"), and fake session resets. Q: What is the purpose of the 'Agent Red Teaming' (ART) benchmark? A: The ART benchmark is a curated set of attack prompts designed to help researchers and developers improve the security of AI agents by identifying and addressing fundamental weaknesses.

    Crypto Market AI's Take

    The findings from this extensive red teaming competition underscore the critical need for robust security measures as AI agents become more integrated into various industries, including finance. At AI Crypto Market, we prioritize the security of our AI-driven trading platform and AI agents. Our commitment to security is reflected in our use of enterprise-grade measures and adherence to stringent compliance protocols. Understanding these vulnerabilities is crucial for developing trustworthy AI systems, especially when dealing with sensitive financial data and transactions. Our focus on AI agents aims to leverage their capabilities while actively mitigating risks through continuous testing and security enhancements.

    More to Read:

  • AI-Driven Crypto Trading Tools Reshape Market Strategies in 2025
  • Top 3 AI Crypto Coins to Watch Before the Next Bull Run
  • Understanding Cryptocurrency Regulations in the United States
This article is based on the research paper by Zou et al. (2025) and reporting by Jonathan Kemper for THE DECODER. Source article