Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

AI & ML··2 min read·via ArXivOriginal source →

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

arXiv:2605.24216v1 Announce Type: new Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from

More Stories