LLM

<em>Simpler is Better:</em> Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

Luning Wang*, Zichen Zhang*, Junkuan Liu*.

We study three types of reward functions — normal, cosine, and dynamic — for long chain-of-thought reinforcement learning in Small Language Models, and find that the simple normal reward consistently outperforms more complex designs, suggesting that simpler rewards are good enough for eliciting reasoning in smaller models.
<em>Simpler is Better:</em> Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models