AI agents can't handle your IT work yet. Here's the proof.
Frontier models score below 50% on real enterprise IT tasks — revealing a vast gap between hype and capability.
Summary
- ITBench-AA is the first benchmark measuring how well frontier AI models perform on actual enterprise IT operations tasks, not abstract benchmarks.
- Leading models (GPT-4, Claude, Gemini) all scored below 50%, with the best reaching only 48% accuracy on real IT scenarios.
- The benchmark tests practical workflows: ticket triage, log analysis, configuration management, and incident response—the exact work IT teams delegate first.
- This exposes a critical problem: agentic AI is being deployed into production before we can reliably measure what it can actually do.
- For IT leaders, this is permission to be sceptical about vendor claims and to test agents rigorously on YOUR tasks before trusting them with critical systems.






