FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

AI & ML··2 min read·via ArXivOriginal source →

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

arXiv:2605.29001v1 Announce Type: new Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset th

More Stories