For decades, artificial intelligence has been tested in a vacuum, pitting machines against humans. But this one-off, task-specific approach is failing to reflect AI's true impact.
In real-life scenarios, where AI interacts with multiple people over extended periods, its performance often falls short of benchmarks. Take medical radiology: highly ranked AI models speed up initial scans but fail to keep up with the complex, collaborative processes involved in patient care.
What’s needed is a shift towards Human–AI, Context-Specific Evaluation (HAIC) benchmarks. These would assess how well AI functions within human teams and workflows over longer periods, rather than just its isolated performance on static tests.
This approach could help bridge the gap between tech promises and real-world outcomes, reducing wasted resources and restoring public trust in AI by ensuring that models are truly ready for deployment.







