arXiv:2606.05718v1 Announce Type: cross Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or...
Read the full article at the source.
Comments (0)
No comments yet. Be the first to comment!