The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Wojciech Zarzecki, Jan Dubi\'nski, Sebastian Cygert

Jun 3, 2026 at 04:00

7 Views

0 Comments

arXiv:2606.03305v1 Announce Type: new Abstract: Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training...

Read the full article at the source.

Read Original Article

Was this helpful?