{"slug": "gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for", "title": "Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks", "summary": "According to the article, Gemma 4's most underrated feature is its built-in \"thinking tokens,\" which provide a visible inner monologue where the model automatically reasons through tasks before generating a final output. This chain-of-thought architecture is particularly powerful for evaluation tasks, such as grading interview answers or essays, because it allows the model to systematically assess multiple criteria before synthesizing a response. The combination of thinking tokens and a 128K context window also enables the model to identify patterns across long conversations and tailor its feedback accordingly.", "body_md": "*This is a submission for the Gemma 4 Challenge: Write about Gemma 4*\n\nMost coverage of Gemma 4 focuses on the headline numbers: 128K context, 31B parameters, multimodal input. But after building a production app with it, I think the most underrated feature is something barely mentioned in the docs: **native thinking tokens**.\n\nHere's what they are, why they matter, and what they unlock.\n\n## What Are Thinking Tokens?\n\nWhen you call Gemma 4's API, the response includes parts with a `thought: true`\n\nflag:\n\n```\n{\n  \"candidates\": [{\n    \"content\": {\n      \"parts\": [\n        {\n          \"text\": \"The user asked about sorting algorithms...\\n- They mentioned merge sort but not quicksort...\\n- Their Big-O analysis is correct...\\n- They missed the space complexity point...\",\n          \"thought\": true\n        },\n        {\n          \"text\": \"Great explanation of merge sort! One thing to add: while your O(n log n) analysis is correct, it's worth mentioning...\"\n        }\n      ]\n    }\n  }],\n  \"usageMetadata\": {\n    \"promptTokenCount\": 45,\n    \"candidatesTokenCount\": 52,\n    \"totalTokenCount\": 312,\n    \"thoughtsTokenCount\": 215\n  }\n}\n```\n\nThe model generates internal reasoning (215 tokens of thinking) that you, the developer, can see but that the model treats as its \"inner monologue.\" The final output (52 tokens) is what the user sees.\n\nThis is not chain-of-thought prompting. This is **chain-of-thought architecture** — the model does it automatically, without being told to \"think step by step.\"\n\n## Why This Matters for Evaluation Tasks\n\nI discovered this while building an [AI interview coach](https://github.com/hajirufai/gemma4-interview-coach). The app asks interview questions, the user answers, and Gemma 4 evaluates the response.\n\nHere's the thing: **evaluation is harder than generation**. Generating a plausible interview question is easy. Evaluating whether someone's answer is good — considering structure, specificity, depth, relevance, and experience level — requires the model to hold multiple criteria in working memory simultaneously.\n\n### Without thinking tokens (typical LLM behavior):\n\n```\nUser: \"Tell me about a time you led a team through a difficult project.\"\nAnswer: \"I led my team to deliver the Q3 release on time.\"\n\nModel output: \"Good job mentioning leadership! Try to add more details next time.\"\n```\n\nThe model races to produce output without deeply considering what was missing.\n\n### With thinking tokens (Gemma 4):\n\n```\nThinking: \"The candidate claims they led a team through a difficult project.\n  - STAR analysis:\n    - Situation: Not specified (what project? what was difficult?)\n    - Task: Vague ('deliver Q3 release')\n    - Action: Missing entirely (what did THEY specifically do?)\n    - Result: 'on time' — but no metrics, no impact\n  - Experience level: Mid-level, so I should expect specific examples\n  - Pattern: This is the third vague answer — I should push harder this time\n  - Selected approach: Acknowledge the topic, then probe for specifics\"\n\nOutput: \"You picked a relevant example, but I need more. What specifically made\nthis project difficult? What decisions did YOU make as the lead? And what\nwas the measurable outcome? Give me numbers — team size, timeline, impact.\"\n```\n\nThe thinking tokens let the model systematically work through evaluation criteria before committing to a response. The result feels like feedback from an experienced interviewer, not a chatbot.\n\n## Three Patterns Where Thinking Tokens Excel\n\n### 1. Multi-Criteria Evaluation\n\nWhen your task has multiple evaluation dimensions (like grading an essay, reviewing code, or scoring an interview answer), thinking tokens let the model address each criterion before synthesizing a final judgment.\n\n```\n# The model naturally breaks down scoring without explicit prompting\nprompt = \"\"\"\nScore this answer for: clarity, depth, and relevance.\nAnswer: \"Machine learning is when computers learn from data.\"\n\"\"\"\n# Gemma 4's thinking will evaluate each dimension separately,\n# then produce a balanced score. Other models might just say \"7/10.\"\n```\n\n### 2. Pattern Recognition Across Context\n\nWith 128K context AND thinking tokens, Gemma 4 can notice patterns across a long conversation and reason about them:\n\n```\nThinking: \"Looking at answers 1-7:\n  - Answers 1, 3, and 5 all lack specific metrics\n  - Answer 4 was the strongest — they used concrete numbers\n  - This is a pattern: the candidate is good at storytelling\n    but weak at quantification\n  - I should focus my feedback on this specific gap\"\n```\n\nThis combination — long context for memory + thinking for analysis — is genuinely new.\n\n### 3. Calibrated Difficulty Adjustment\n\nThe model can reason about whether to make the next question easier or harder:\n\n```\nThinking: \"The candidate nailed the last two system design questions.\n  - They correctly identified the bottleneck\n  - Their caching strategy was solid\n  - But they haven't been tested on distributed consensus yet\n  - Since they're interviewing for a senior role, I should\n    push into harder territory now\"\n```\n\n## 26B MoE vs 31B Dense: When to Use Each\n\nAfter testing both extensively, here's my take:\n\n| 26B MoE | 31B Dense | |\n|---|---|---|\nSpeed |\n1-3s per response | 5-10s per response |\nThinking quality |\nGood — covers main points | Excellent — catches edge cases |\nBest for |\nInteractive chat, quick feedback | Deep analysis, final reports |\nThinking tokens used |\n~100-200 per response | ~200-400 per response |\n\n**My recommendation**: Use 26B MoE for the conversational back-and-forth, and 31B Dense for summary/evaluation tasks where speed matters less.\n\nIn my interview coach app, I default to 26B MoE because conversational latency matters more than marginal evaluation quality. But if you're building a code review tool or essay grader where the user can wait 10 seconds, go with 31B Dense.\n\n## Practical Tips for Building with Thinking Tokens\n\n### 1. Filter them in your UI\n\n``` js\nconst parts = response.candidates[0].content.parts;\nconst visibleText = parts.filter(p => !p.thought).map(p => p.text).join('');\nconst thinkingText = parts.filter(p => p.thought).map(p => p.text).join('');\n```\n\nUsers should see the polished output, not the internal reasoning.\n\n### 2. Log the thinking for debugging\n\nThe thinking tokens are incredibly useful for understanding why the model gave a particular response. I log them during development:\n\n```\nif (process.env.NODE_ENV === 'development') {\n    console.log('🧠 Model thinking:', thinkingText);\n}\n```\n\n### 3. Don't fight the thinking — design around it\n\nIf you prompt Gemma 4 to \"just give me a one-word answer,\" it'll still think internally. That's fine — the thinking tokens don't appear in the output. But they DO count toward your token usage.\n\nFor simple tasks (classification, yes/no), the thinking overhead might not be worth it. For complex tasks (evaluation, planning, multi-step reasoning), it's exactly what you want.\n\n### 4. Temperature affects thinking quality\n\nAt low temperature (0.1-0.3), thinking tokens are more systematic and thorough. At high temperature (0.8+), they're more creative but occasionally tangential. For evaluation tasks, I recommend 0.5-0.7.\n\n## The Bigger Picture\n\nThinking tokens represent a shift from \"prompt engineering\" to \"reasoning architecture.\" Instead of crafting elaborate prompts that force step-by-step reasoning, the model does it natively.\n\nThis matters because:\n\n-\n**Simpler prompts, better results**— You don't need \"Let's think step by step.\" The model already does. -\n**More reliable evaluation**— The model is less likely to give snap judgments on complex tasks. -\n**Transparent reasoning**— You can inspect the thinking to understand (and debug) the model's logic.\n\nCombined with 128K context and the efficiency of the MoE architecture, Gemma 4 is uniquely positioned for applications that need to **reason over long interactions** — tutoring, coaching, mentoring, code review, and any task where shallow responses aren't good enough.\n\n## Try It Yourself\n\nThe best way to see thinking tokens in action is to build something that requires evaluation. Here's a minimal example:\n\n```\ncurl \"https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent?key=YOUR_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"contents\": [{\"parts\":[{\"text\":\"Evaluate this Python code for bugs and style issues:\\n\\ndef fibonacci(n):\\n  if n <= 1: return n\\n  return fibonacci(n-1) + fibonacci(n-2)\"}]}]\n  }'\n```\n\nLook at the response. You'll see the thinking tokens breaking down the code analysis before the final review appears. That's Gemma 4's hidden superpower at work.\n\n*This post was inspired by building Interview Coach, an open-source AI interview practice tool powered by Gemma 4. The code is MIT licensed — fork it and build something better.*", "url": "https://wpnews.pro/news/gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for", "canonical_source": "https://dev.to/thyalpha001/gemma-4s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for-evaluation-tasks-389c", "published_at": "2026-05-23 11:10:06+00:00", "updated_at": "2026-05-23 11:33:33.338396+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "products"], "entities": ["Gemma 4", "Gemma"], "alternates": {"html": "https://wpnews.pro/news/gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for", "markdown": "https://wpnews.pro/news/gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for.md", "text": "https://wpnews.pro/news/gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for.txt", "jsonld": "https://wpnews.pro/news/gemma-4-s-hidden-superpower-why-built-in-thinking-tokens-change-everything-for.jsonld"}}