I recently had the opportunity to test GitHub Copilot's multi-model capabilities, experimenting with the major models available on the market: Claude, Gemini, and ChatGPT. To maximize their effectiveness, I paired them with Spec-kit to provide deep repository context.
After extensive daily use, here are my clear observations and arguments for when to use which model in your development workflow.
Claude - Feels Like a Cheat Code!
Code Generation: Best. It is unmatched in generating code and is highly capable of understanding complex, multi-microservice codebases. The generated code strictly adheres to our codebase's principles and practices (e.g., camelCase, snake_case). The code generated is completely free of deprecated methods and leverages the latest packages and coding practices. #
Context Analysis: Best. It deeply understands the given requirements and flawlessly searches all impacted areas of the repository. #
General Technical Analysis: Best. You can often one-shot the prompt and get exactly what you need. The solutions suggest will most likely be the one selected. #
Token Usage: Very High. It is almost impossible to limit token usage; the larger the codebase, the faster it burns through tokens. On average, I was consuming about $200 worth of tokens in a single month. Thats just me, I had 5 others in my team.
Gemini - Better, But be vigilant
Code Generation: Good. It is fairly good at code generation but does occasionally get stuck in loops and can take too long to provide a solution. The code isn't always guaranteed to follow our set principles and naming conventions without extra nudging. You cannot reliably one-shot prompts for the best solution; it requires back-and-forth iteration. #
Context Analysis: Very Good. It is nearly on par with Claude. It analyzes the entire codebase and accurately identifies impacted areas. #
General Technical Analysis: Very Good. When provided with clean and clear details about the expected result, it gives excellent solutions. However, if instructions are vague, it tends to hallucinate. #
Token Usage: Optimal. It doesn’t eat away at your budget like Claude. This is the absolute best alternative for daily tasks or when your Claude credits run low.
ChatGPT - Unreliable for Complex Codebases
Code Generation: Poor. It rarely generates a complete solution and frequently forgets to update impacted classes or infrastructure code. It hallucinates too often, and the output frequently feels like a copy-paste from Stack Overflow without the necessary adaptations to suit our specific codebase. #
Context Analysis: Fair. Despite struggling with actual code generation, it does a decent job at high-level analysis. However, accuracy and consistency remain significant issues. #
General Technical Analysis: Basic. It works if you are starting from a clean slate or building a simple script. For larger projects, it sometimes fails to remember technical details mentioned at the start of the conversation. #
Token Usage: N/A. I couldn't use it long-term because the results were too inconsistent to justify the effort.
The Secret Weapon: Spec-kit Regardless of the model you choose, Spec-kit has become an absolute must-have. It acts as the bridge for providing LLMs with PR guidelines and coding best practices. For example, we use it to enforce rules like:
PubSub Topic Naming: Must follow <component>.<boundary>.<type>.<entity-purpose>[.<direction>.<ext-component>] #
Architecture Rules: Apply idempotency, outbox/inbox patterns, and at-least-once delivery with retries/backoff. #
Database Script Naming: Must follow V<x>_<y>_<z>__<description>.sql
These details might feel small, but they provide the essential context the LLM needs to generate highly relevant, drop-in-ready code, saving you from having to manually optimize the output.
Supercharging Context with MCPs
To further reduce manual typing, I integrated open-source Jira and Confluence MCPs (Model Context Protocols) into my Copilot plugin. This allows the AI to automatically pull context directly from business requirements and active Jira tickets. It is incredibly efficient, but be aware: feeding massive Confluence documents directly into the prompt will significantly increase your token usage.