Interactive Benchmarks

arXiv:2603.04737v1 Announce Type: new Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive

Interactive Benchmarks

Interactive Benchmarks

More Stories

The Worst Acquisition in History, Again

TSA leaves passenger needing surgery after illegally forcing her through scanner

Show HN: Reconstruct any image using primitive shapes, runs in-browser via WASM

How Cursor is evolving through its Composer coding models built on Chinese open models, as coding agents like Claude Code threaten to make code editors obsolete