August 4, 2025
5 min read
Gary Marcus
Despite hype, AI agents remain unreliable and error-prone, falling short of expectations in 2025’s tech landscape.
If chatbots answer your queries, AI agents are supposed to do things on your behalf, actively managing your life and business. They might shop for you, book travel, organize your calendar, summarize news, track finances, maintain databases, or even manage software systems. Ultimately, agents should be able to perform any cognitive task you might ask a human to do.
By the end of 2024, major players like Google, OpenAI, and Anthropic announced their first AI agents were imminent. Anthropic’s Dario Amodei predicted that 2025 would be the year AI could perform at the level of a PhD student or early professional. OpenAI’s CEO Sam Altman expressed optimism that AI agents would “join the workforce” and materially impact company output. Google introduced Project Astra, aiming to build a universal AI assistant.
Tech columnist Kevin Roose of the New York Times was cautiously optimistic, but the question remains: do these agents actually work?
My own prediction at the start of 2025 was that AI agents would be hyped but remain unreliable except in narrow use cases. With five months left in the year, this assessment largely holds true.
All major companies have introduced agents, but none are reliably effective beyond limited scenarios. Google’s Astra remains in beta with restricted access. OpenAI released its ChatGPT agent, which operates by proactively selecting from a toolbox of skills to complete tasks using its own virtual computer. It can interact with calendars, browse text and visual websites, download and manipulate files, and adapt its approach for speed and accuracy.
However, ChatGPT agent is still early-stage and prone to mistakes, which introduces risks, especially when working directly with user data. These errors have significantly limited the practical utility of current AI agents, falling short of the promised breakthroughs.
By March 2025, it was clear that hype and reality were diverging. For example, the AI agent Manus, once hyped, failed to meet expectations. Reports have since accumulated highlighting failures and technical debt, especially in AI coding agents producing hard-to-debug, repetitive code. As MIT Professor Armando Solar-Lezama noted, AI is like a new credit card accumulating technical debt faster than before.
Penrose.com created a benchmark for basic accounting tasks using real data and found AI errors tend to compound over time, undermining reliability.
Hallucinations—AI confidently generating false or misleading information—remain a persistent and serious problem.
Industry insiders acknowledge a “demo to reality” gap that will likely persist for years. Recent analyses show AI agents failing 70% of tasks in some benchmarks, a failure rate that threatens the industry’s credibility.
Even enthusiastic users struggle to find practical use cases for ChatGPT agent’s limited capabilities.
Security is another major concern. AI agents’ superficial understanding makes them vulnerable to cyberattacks. Research from Carnegie Mellon University revealed that even the most secure systems were successfully attacked over 1% of the time, a dangerously high rate for critical applications.
These flaws are unsurprising because current AI systems rely on mimicry rather than deep understanding. They can imitate human language and task completion but lack true comprehension of their actions, leading to hallucinations and errors. Multi-step tasks multiply the chances for mistakes, sometimes resulting in catastrophic failures.
I do not expect AI agents to disappear; eventually, they could become invaluable time-savers. However, I doubt that large language models (LLMs) alone will provide the reliable foundation needed.
Recent reports from The Information and new academic papers confirm that pure scaling of LLMs yields diminishing returns and is unlikely to achieve artificial general intelligence (AGI) breakthroughs.
Without integrating neurosymbolic AI and rich world models—approaches I have advocated for years—reliable AI agents remain out of reach. Robust, trustworthy AI requires more than just bigger models; it needs fundamentally different architectures.
The current focus on LLMs as a shortcut has been a massive intellectual and economic mistake. Despite nearly a trillion dollars invested, these systems cannot reliably manage calendars, finances, or perform at the promised professional levels.
Yet, investment continues to pour into generative AI, while alternative approaches like neurosymbolic AI remain severely underfunded, receiving perhaps less than 1% of total AI investment.
Perhaps after enough failures, reality will set in, and the industry will reconsider its approach.
Over the years, I have anticipated many trends in AI. For those interested in thoughtful analysis, I invite you to subscribe and support this work.
Gary Marcus founded a machine learning company acquired by Uber and is the author of six books on natural and artificial intelligence.
Source: Originally published at garymarcus.substack.com on August 3, 2025.
Gary Marcus founded a machine learning company acquired by Uber and is the author of six books on natural and artificial intelligence.
Source: Originally published at garymarcus.substack.com on August 3, 2025.
Frequently Asked Questions (FAQ)
AI Agent Reliability and Functionality
Q: What are AI agents, and how do they differ from chatbots? A: AI agents are designed to perform tasks and take actions on your behalf, actively managing aspects of your life or business. This goes beyond chatbots, which primarily answer queries. AI agents can handle tasks like shopping, booking travel, managing calendars, summarizing information, tracking finances, and even managing software systems. Q: Have AI agents been successful in performing tasks beyond narrow use cases? A: Currently, most AI agents are still in their early stages and are not reliably effective beyond limited scenarios. Despite significant investment and announcements from major tech companies, practical utility has been significantly limited by errors and a gap between demo capabilities and real-world performance. Q: What are the main challenges hindering the reliability of current AI agents? A: Key challenges include:- Errors and Hallucinations: AI agents are prone to mistakes and confidently generating false or misleading information.
- "Demo to Reality" Gap: There's a persistent gap between advertised capabilities and actual performance.
- Technical Debt: Issues like AI coding agents producing hard-to-debug code contribute to unreliability.
- Lack of Deep Understanding: Current systems often mimic rather than truly comprehend, leading to errors in multi-step tasks.
- Security Vulnerabilities: Superficial understanding makes them susceptible to cyberattacks. Q: What is the "demo to reality" gap in AI agents? A: This refers to the discrepancy between the impressive demonstrations of AI agent capabilities presented by companies and their actual, often less reliable, performance in real-world applications. This gap is expected to persist for years. Q: Why are AI agents prone to errors and hallucinations? A: These issues stem from the current AI systems' reliance on mimicry rather than deep understanding. They can imitate human behavior but lack genuine comprehension, leading to factual inaccuracies and errors, especially in complex or multi-step tasks. Q: What is the typical failure rate for AI agents in benchmarks? A: Recent analyses indicate that AI agents can fail 70% of tasks in some benchmarks, a rate that significantly impacts their credibility and practical utility. Q: What are the security concerns associated with AI agents? A: AI agents have a superficial understanding, making them vulnerable to cyberattacks. Even secure systems have shown a non-negligible attack success rate, which is a significant concern for critical applications.
- AI Agents: Capabilities, Risks, and Growing Role
- AI-Driven Crypto Trading Tools Reshape Market Strategies in 2025
- The Future of Cryptocurrency Explained: What's Changing and Why It Matters