arXiv:2606.11209v1 Announce Type: cross Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a...
Read the full article at the source.
Comments (0)
No comments yet. Be the first to comment!