How the community trained Gemma to "Think" with Tunix and TPUs Google hosted a Kaggle hackathon challenging developers to train non-reasoning Gemma-2-2B and Gemma-3-1B models into general reasoning models using Tunix and Kaggle TPUs. Over 11,000 entrants and 300+ submissions demonstrated that effective reasoning training is achievable with limited compute, with winning teams combining supervised fine-tuning, preference optimization, and reinforcement learning. The winning techniques, including G-RaR's rubric-based reward system and a three-stage pipeline using SimPO and GRPO, enable models to produce structured reasoning traces for complex tasks across key industries. Large Language Models LLMs often benefit from "thinking" before they speak for complex tasks. Frontier LLMs like Gemini 3 and leading open weight models like Gemma 4 can produce explicit reasoning traces, commonly called Chain-of-Thought, before answering user questions. But how this reasoning capability is trained is often not disclosed. While there are many reasoning tutorials https://github.com/google/tunix/blob/main/examples/grpo gemma.ipynb available on the Internet to train for simple verifiable tasks such as math or coding, accessible and easy-to-reproduce training recipes including data, training strategy, runnable code and evaluations for general reasoning remain scarce. This motivated us to hold the Google Tunix Hack: Train a model to show its work https://www.kaggle.com/competitions/google-tunix-hackathon hackathon on Kaggle: we challenged developers to transform non-reasoning base models Gemma-2-2B and Gemma-3-1B into general reasoning models, using Tunix and Kaggle TPUs. The response was overwhelming: over 11,000 entrants and 300+ high-quality submissions proved that decent reasoning training can be done by the community even with a very limited compute budget Kaggle TPU v5e-8 for 9 hours . In this post, we’ll highlight the techniques used by the winners and share key recipes that allow models to reason across key vertical industries, so you can train your own reasoning models. Highlighting the Winners: Key Innovations The winning submissions demonstrated a sophisticated understanding of post-training, combining supervised learning, preference optimization, and reinforcement learning in creative ways. G-RaR trains Gemma models to produce structured reasoning by combining Supervised Fine-Tuning SFT with GRPO, driven by a novel rubric-based LLM-as-judge reward system. How It Improves Reasoning The model's reasoning power is improved by explicitly training it to "show its work" inside