July 26, 2025
5 min read
Mels Dees
Anthropic develops AI agents that autonomously audit language models for misalignment, boosting scalability and security of AI systems.
Anthropic is developing AI agents that independently perform alignment audits on language models, significantly increasing the scalability and speed of security testing. This innovation could prove crucial for the reliable deployment of powerful AI systems such as Claude Opus 4.
In a recent publication, Anthropic introduces three types of AI agents that can investigate advanced language models for misaligned behavior without human intervention. The auditing agents are designed to automatically detect patterns, risks, and hidden goals in AI models. According to the company, they represent a crucial step in scaling up alignment research.
The three agents differ in their approach:
- Investigator Agent: Conducts open-ended research on a model using interpretation and analysis tools to uncover unintended or hidden goals.
- Evaluation Agent: Performs structured behavioral evaluations, testing the model for consistency and reliability in sensitive situations.
- Red-Teaming Agent: Generates prompts specifically designed to provoke deviant or harmful behavior. In test environments, Claude Opus 4, among other models, was subjected to these auditing processes. The agents proved capable of effectively auditing models that had been deliberately manipulated with misaligned instructions. By allowing different agents to work together within an overarching audit structure—such as a super-agent that integrates multiple signals—Anthropic significantly increased detection capabilities. In some cases, up to 42 percent of hidden behavioral characteristics were exposed.
- Investigator Agent: Conducts open-ended research to uncover hidden or unintended goals.
- Evaluation Agent: Performs structured evaluations to test model consistency and reliability.
- Red-Teaming Agent: Generates prompts to provoke potentially harmful behaviors.
- AI-Fueled Crypto Scams Are Booming Up 456%—And No One Is Safe
- AI-Driven Crypto Trading Tools Reshape Market Strategies in 2025
- Anthropic Deploys AI Agents to Audit Models for Safety