arXiv:2606.03198v1 Announce Type: cross Abstract: Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes...
Read the full article at the source.
Comments (0)
No comments yet. Be the first to comment!