arXiv:2606.05434v1 Announce Type: cross Abstract: Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO...
Les hele artikkelen hos kilden.
Kommentarer (0)
Ingen kommentarer ennå. Bli den første til å kommentere!