GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models