arXiv:2606.05718v1 Announce Type: cross Abstract: On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or...
Læs hele artiklen hos kilden.
Kommentarer (0)
Ingen kommentarer ennå. Bli den første til å kommentere!