Implementing DeepSeek R1's GRPO algorithm from scratch