FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

AI & ML·May 29, 2026·2 min read·via ArXivOriginal source →

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

arXiv:2605.29001v1 Announce Type: new Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset th

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

More Stories

To see to it that the forces of Napoleon are driven out of Spain (1809)

SQLite is all you need for durable workflows

Bill C-22 Is a Mess of the Government's Own Making

CVE-2026-48710: A Maintainer's Perspective