Building Local AI Systems: Qwen3.6 + MCPs

Anthropic's Model Context Protocol (MCP) enables universal tool connectivity for AI agents, and Qwen3.6-35B-A3B, a Mixture of Experts model with 262k-token context, is optimized for MCP-based agentic tasks. A local GitHub developer assistant built with these tools can read issues, search code, draft fixes, and create pull requests entirely on-device without cloud dependencies.

Building Local AI Systems: Qwen3.6 + MCPs Define a tool once as an MCP server and any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model. Introducing MCP Every developer building with local AI hits the same wall eventually. The model works. It reasons well, writes solid code, and answers complex questions. But it cannot do everything. It cannot query your database, open a GitHub issue, or call your internal API. You are left writing custom Python wrappers for every tool you need, hardcoding the glue between model output and tool execution, and maintaining those wrappers every time an API changes. The Model Context Protocol MCP https://modelcontextprotocol.io/ was designed to solve exactly this. It is an open standard by Anthropic: a universal, pluggable protocol for AI tool connectivity. Define a tool once as an MCP server. Any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model. Qwen3.6-35B-A3B https://huggingface.co/Qwen/Qwen3.6-35B-A3B is the most capable local model for this kind of work right now. It has a 262,144-token context window, a Mixture of Experts MoE architecture that activates only 3B of its 35B parameters per forward pass which is why it fits on hardware that should not be able to run a 35B model , and was explicitly trained and evaluated on MCP-based agentic tasks. This article builds a local GitHub developer assistant: an agent that reads a repository's open issues, searches the relevant code, drafts a fix, and creates a pull request. The whole thing runs on your hardware, through MCP servers, with no cloud dependency. Understanding Qwen3.6-35B-A3B Understanding the architecture matters here because it directly explains what hardware you need and why the model performs the way it does on agentic tasks. The name encodes the key fact: 35B total parameters, A3B meaning 3B activated per forward pass. It is an MoE model with 256 experts per layer, routing 8 plus 1 shared experts per token https://huggingface.co/Qwen/Qwen3.6-35B-A3B . You get the knowledge capacity of a 35B model at the inference compute cost of a 3B model. That trade-off is why it fits on hardware that would collapse under a dense 35B. The hidden layout is where Qwen3.6 diverges most from other MoE models. Each block in the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Attention layers. DeltaNet is a linear attention mechanism; it processes sequences more efficiently than full quadratic attention, especially at long context lengths. The interleaved full Gated Attention layers provide the deep relational reasoning that linear attention alone misses. For an agent working through a 500-file repository, that combination matters: efficient processing at length combined with precise reasoning on the relevant sections. The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context length is not a comfort feature; it is an operational constraint. An agent reading source files, maintaining tool call history, tracking a multi-step plan, and injecting tool results back into context needs real headroom. Most 7B and 13B models cap at 8k or 32k tokens. Running out of context mid-task means the agent loses its own history and starts hallucinating tool results. Qwen3.6 was explicitly trained and evaluated on MCP-based agentic benchmarks. Two headline features came out of that training: Agentic Coding. Frontend workflows and repository-level reasoning — the model handles multi-file refactoring tasks with coherent reasoning across files, not just single-file edits in isolation. Thinking Preservation. A preserve thinking flag that retains reasoning traces from prior turns in a multi-turn conversation. When an agent reasons through a plan in turn one and then executes tool calls in turns two through five, preserve thinking=True keeps the turn-one reasoning available in the KV cache. Each subsequent turn benefits from that prior reasoning without paying the cost of re-deriving it. System Requirements There are three realistic deployment paths, and which one you use depends entirely on your hardware. GPU inference recommended for production agent workloads . Qwen3.6-35B-A3B in bfloat16 requires approximately 70 GB VRAM. In Q4 quantization, it fits in approximately 20–24 GB. A single RTX 4090 24 GB handles Q4. Two RTX 3090s with tensor parallelism handle Q4 as well. An A100 80 GB handles the full bfloat16 model. CPU/Hybrid via KTransformers. is the accessible path for developers without a 24 GB GPU. It offloads compute-heavy layers to GPU when available and runs the rest on CPU. With 64 GB system RAM, you can run Qwen3.6-35B-A3B in a usable if slower configuration. Response latency will be 30–120 seconds per turn depending on your CPU, which is workable for an agent doing background repository analysis but not for interactive coding sessions. KTransformers Smaller models for tutorial testing. The entire MCP integration pattern in this article is identical regardless of model size. If you want to follow along without the hardware for the full 35B model, use Qwen/Qwen2.5-7B-Instruct via Ollama ollama pull qwen2.5:7b or the Qwen3-8B model. The serving API is the same, the code is identical, and you can swap in the 35B model when hardware permits. Software requirements: Python 3.11+ required python --version python -m venv qwen-mcp-env source qwen-mcp-env/bin/activate macOS / Linux qwen-mcp-env\Scripts\activate Windows Core packages pip install \ "openai =1.30.0" \ "qwen-agent =0.0.10" \ "mcp =1.0.0" \ "httpx =0.27.0" Serving framework -- choose one pip install "vllm =0.19.0" NVIDIA GPU pip install "sglang =0.5.10" NVIDIA GPU faster prefill for long context pip install "ktransformers" CPU/hybrid Node.js 18+ is required for pre-built MCP servers installed via npx node --version Serving Qwen3.6 Locally with an OpenAI-Compatible API Before wiring in any MCP servers, you need a running inference server. Both SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the same API surface, just pointed at localhost instead of api.openai.com http://api.openai.com . // SGLang Recommended for Long-Context Agent Workloads Install SGLang with full dependencies pip install "sglang all =0.5.10" Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled. --reasoning-parser qwen3 correctly handles the ... blocks. --tool-call-parser qwen3 coder routes tool call outputs to the right format. --enable-prefix-caching is critical for agent workloads -- enables KV cache reuse across turns, which is what makes preserve thinking efficient in practice. python -m sglang.launch server \ --model-path Qwen/Qwen3.6-35B-A3B \ --host 0.0.0.0 \ --port 30000 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3 coder \ --enable-prefix-caching \ --tp 2 tensor parallel across 2 GPUs; remove if using single GPU // vLLM pip install "vllm =0.19.0" vLLM equivalent with the same critical flags vllm serve Qwen/Qwen3.6-35B-A3B \ --host 0.0.0.0 \ --port 8000 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3 coder \ --enable-prefix-caching-v2 \ --tensor-parallel-size 2 // Smaller Model via Ollama ollama pull qwen2.5:7b ollama serve Ollama's API is OpenAI-compatible at http://localhost:11434/v1 Once the server is running, verify it before going any further: Health check -- should return {"status": "ok"} or similar curl http://localhost:30000/health Test the chat completions endpoint with a simple query curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": {"role": "user", "content": "Reply with: ready"} , "max tokens": 10 }' If you get a JSON response with a choices array, the server is ready. Do not proceed to MCP setup until this works. Every integration failure you will encounter later is easier to debug when you know the serving layer is solid. Understanding MCP and Why It Changes the Agent Architecture Before writing any agent code, it helps to understand what MCP actually does at the protocol level, because that understanding prevents a category of bugs that come from treating MCP as just a fancier function-calling API. MCP is a JSON-RPC 2.0 protocol https://modelcontextprotocol.io/docs/concepts/architecture running over stdio or HTTP transport. When an MCP client connects to a server, the first thing it does is call tools/list to discover what tools the server exposes. Each tool comes back with a name, a description, and an input schema defined in JSON Schema. The model reads this schema. It is the model's contract with the tool. When the model wants to call a tool, it emits a structured tool call object. The MCP client — not the model — actually executes the call by sending a tools/call request to the server. The server handles execution and returns a result. The client injects that result back into the conversation as a tool role message. The model reads the result and decides the next step. This separation is important. The model decides what to call and with what arguments. The client handles execution. The server handles the actual work. Your code never hardwires a tool to a model; you just tell the client which servers are available. There are two ways to use MCP with Qwen3.6: Via Qwen-Agent : the official qwen agent library handles tool discovery, call parsing, result injection, and multi-turn conversation management automatically. Less code, less control. Right for most use cases. Via the MCP Python SDK directly : you handle the agentic loop yourself using mcp.ClientSession . More code, full visibility into every message, complete control over error handling and retry logic. Right for production systems where you need to monitor every step. This article covers both, starting with Qwen-Agent. Building the Local GitHub Developer Assistant The agent does four things in sequence: reads open issues from a GitHub repository, finds the relevant code, drafts a fix, and opens a pull request. All locally, all through MCP. // Part 1: Environment and MCP Server Setup Set your GitHub personal access token Required by the GitHub MCP server for API calls export GITHUB TOKEN=ghp your token here Pre-built MCP servers install via npx -- no separate install step npx handles this on first use when the agent starts the servers Verify npx is available: npx --version Create a project directory: mkdir qwen-github-agent cd qwen-github-agent // Part 2: Qwen-Agent Implementation The fastest path to a working agent. Qwen-Agent handles the full loop automatically. github agent qwenagent.py Prerequisites: pip install qwen-agent openai npm / npx must be installed for the MCP servers GITHUB TOKEN env var must be set Local serving endpoint must be running see previous section How to run: python github agent qwenagent.py from qwen agent.agents import Assistant ── Server configuration ────────────────────────────────────────────────────── Point at your local serving endpoint. Change the base url to match whichever server you started: SGLang: http://localhost:30000/v1 vLLM: http://localhost:8000/v1 Ollama: http://localhost:11434/v1 LLM CONFIG = { "model": "Qwen/Qwen3.6-35B-A3B", "model server": "http://localhost:30000/v1", "api key": "EMPTY", Local servers do not require a real key Thinking mode sampling params from the official model card best practices "generate cfg": { "temperature": 0.6, "top p": 0.95, "top k": 20, "min p": 0.0, "thought in history": True, This is the preserve thinking flag in Qwen-Agent }, } ── MCP server configuration ────────────────────────────────────────────────── Each server key names the server; the value is the stdio launch command. Qwen-Agent starts each server as a subprocess and manages the MCP sessions. MCP SERVERS = { "mcpServers": { "filesystem": { "command": "npx", "args": "-y", "@modelcontextprotocol/server-filesystem", Grant the agent access to the current working directory In production, restrict to the specific repository path "." }, "github": { "command": "npx", "args": "-y", "@modelcontextprotocol/server-github" , "env": { The GitHub MCP server reads this env var for API authentication "GITHUB TOKEN": "${GITHUB TOKEN}" } }, } } ── System prompt ───────────────────────────────────────────────────────────── SYSTEM PROMPT = """You are a senior software engineer with full access to a GitHub repository via MCP tools. When given a repository and task: 1. List open issues to understand what needs fixing 2. Use filesystem tools to read relevant source files and tests 3. Identify the root cause based on the code and the issue description 4. Write a targeted fix -- minimal changes, no refactoring unrelated to the bug 5. Create a pull request with a clear title and description referencing the issue Always explain your reasoning at each step. Think through edge cases before writing code. If you are uncertain about a file's purpose, read it before modifying it.""" ── Agent setup ─────────────────────────────────────────────────────────────── agent = Assistant llm=LLM CONFIG, name="GitHub Developer Assistant", description="Reads issues, fixes bugs, opens pull requests -- locally via MCP.", system message=SYSTEM PROMPT, mcp servers=MCP SERVERS, ── Run the agent ───────────────────────────────────────────────────────────── def run agent task: str : """ Run the agent on a task description and stream the output. The agent will make tool calls automatically; Qwen-Agent handles the full loop including tool execution and result injection. """ messages = {"role": "user", "content": task} print f"Task: {task}\n{'─' 70}" Qwen-Agent's run is a generator that yields intermediate steps Each yielded message shows a tool call, a tool result, or the final answer for response in agent.run messages=messages : response is a list of messages representing the conversation so far The last message contains the most recent output last = response -1 role = last.get "role", "" content = last.get "content", "" if role == "assistant" and content: Strip and display the thinking block separately for readability import re thinking = re.search r" . ? ", content, re.DOTALL if thinking: print f" thinking {thinking.group 1 .strip :200 }..." clean = re.sub r". ?", "", content, flags=re.DOTALL .strip if clean: print f" agent {clean}" elif role == "tool": tool name = last.get "name", "unknown tool" print f" tool:{tool name} result received" if name == " main ": run agent "In the repository myorg/my-api-project, find the open issue about " "the login endpoint returning 200 for invalid tokens. Read the relevant " "code and tests, fix the bug, and open a pull request." How to run: python github agent qwenagent.py // Part 3: Raw MCP SDK Implementation For teams who need full control over every protocol message, custom error handling, per-tool retry logic, and audit logging of every tool call and result: github agent raw.py Prerequisites: pip install mcp openai httpx GITHUB TOKEN env var must be set, local server must be running How to run: python github agent raw.py import asyncio import json import os import re from openai import AsyncOpenAI from mcp import ClientSession, StdioServerParameters from mcp.client.stdio import stdio client ── Local serving client ─────────────────────────────────────────────────────── client = AsyncOpenAI base url="http://localhost:30000/v1", api key="EMPTY", MODEL = "Qwen/Qwen3.6-35B-A3B" ── Response processing ─────────────────────────────────────────────────────── def strip thinking text: str - str: """Remove ... blocks. Used when we only need the action.""" return re.sub r". ?", "", text, flags=re.DOTALL .strip def extract thinking text: str - str: """Extract the content of the thinking block for logging.""" m = re.search r" . ? ", text, re.DOTALL return m.group 1 .strip if m else "" def process response response, preserve thinking: bool = True - dict: """ Process a chat completion response from Qwen3.6. Handles two output formats: 1. Tool call via the API's function call / tool calls field when --tool-call-parser is active 2. Tool call embedded in the message content as JSON Args: response: The OpenAI-compatible completion response preserve thinking: If True, keep thinking content in output for the next turn's KV cache benefit Returns: dict with thinking, tool calls, final answer, has tool calls, is terminal """ choice = response.choices 0 message = choice.message Path 1: Tool calls in the structured field preferred -- requires tool-call-parser flag if message.tool calls: tool calls = { "name": tc.function.name, "arguments": json.loads tc.function.arguments , "call id": tc.id, } for tc in message.tool calls thinking = extract thinking message.content or "" return { "thinking": thinking if preserve thinking else "", "tool calls": tool calls, "final answer": "", "has tool calls": True, "is terminal": False, } Path 2: Tool calls embedded in content text fallback content = message.content or "" tag matches = re.findall r" . ? ", content, re.DOTALL tool calls = for m in tag matches: try: tool calls.append json.loads m.strip except json.JSONDecodeError: pass thinking = extract thinking content final answer = re.sub r". ?", "", content, flags=re.DOTALL final answer = re.sub r". ?", "", final answer, flags=re.DOTALL .strip return { "thinking": thinking if preserve thinking else "", "tool calls": tool calls, "final answer": final answer, "has tool calls": len tool calls 0, "is terminal": len tool calls == 0 and bool final answer , } ── Core agent loop ─────────────────────────────────────────────────────────── async def run github agent task: str, repo: str, max turns: int = 20 : """ Run the GitHub developer assistant agent. Connects to filesystem and GitHub MCP servers, discovers their tools, and runs the Qwen3.6 agent loop until the task is complete or max turns reached. """ Start both MCP servers and establish sessions fs params = StdioServerParameters command="npx", args= "-y", "@modelcontextprotocol/server-filesystem", "." , gh params = StdioServerParameters command="npx", args= "-y", "@modelcontextprotocol/server-github" , env={ os.environ, "GITHUB TOKEN": os.environ.get "GITHUB TOKEN", "" }, async with stdio client fs params as fs read, fs write , \ ClientSession fs read, fs write as fs session, \ stdio client gh params as gh read, gh write , \ ClientSession gh read, gh write as gh session: Initialize both sessions await fs session.initialize await gh session.initialize Discover all available tools from both servers fs tools result = await fs session.list tools gh tools result = await gh session.list tools Build the OpenAI-format tool list for the model all tools = tool to session = {} Maps tool name to the MCP session that owns it for tool in fs tools result.tools: all tools.append { "type": "function", "function": { "name": tool.name, "description": tool.description, "parameters": tool.inputSchema, } } tool to session tool.name = fs session for tool in gh tools result.tools: all tools.append { "type": "function", "function": { "name": tool.name, "description": tool.description, "parameters": tool.inputSchema, } } tool to session tool.name = gh session print f"Tools available: {len all tools } {len fs tools result.tools } filesystem, " f"{len gh tools result.tools } GitHub " Build conversation history system prompt = f"""You are a senior software engineer with access to the repository {repo}. Use the available tools to investigate issues, read code, write fixes, and create pull requests. Think step by step. Read before you modify. Minimal changes only.""" messages = {"role": "system", "content": system prompt}, {"role": "user", "content": task}, ── Agent loop ───────────────────────────────────────────────────────── for turn in range max turns : print f"\n Turn {turn + 1} " Call the model response = await client.chat.completions.create model=MODEL, messages=messages, tools=all tools if all tools else None, tool choice="auto", Thinking mode sampling params from the official best practices temperature=0.6, top p=0.95, top k=20, min p=0.0, max tokens=4096, extra body={ preserve thinking keeps reasoning context across turns for KV cache efficiency on long agent sessions "preserve thinking": True, } result = process response response, preserve thinking=True if result "thinking" : print f" thinking {result 'thinking' :200 }..." Terminal state -- agent has produced a final answer if result "is terminal" : print f"\n DONE \n{result 'final answer' }" return result "final answer" Tool call state -- execute each tool and inject results if result "has tool calls" : Append the assistant's message with tool calls to history messages.append { "role": "assistant", "content": response.choices 0 .message.content or "", "tool calls": response.choices 0 .message.tool calls or , } for call in result "tool calls" : tool name = call "name" tool args = call.get "arguments", {} call id = call.get "call id", "" print f" tool {tool name} {json.dumps tool args :80 }... " session = tool to session.get tool name if not session: result content = f"Error: tool '{tool name}' not found" else: try: tool result = await session.call tool tool name, tool args result content = str tool result.content Truncate very long results to protect context budget if len result content 12000: result content = result content :12000 + "\n... truncated " except Exception as e: result content = f"Error: {e}" print f" result {result content :150 }..." messages.append { "role": "tool", "content": result content, "tool call id": call id, "name": tool name, } print f" WARNING max turns {max turns} reached without terminal state" ── Entry point ─────────────────────────────────────────────────────────────── if name == " main ": asyncio.run run github agent task= "Find the open issue about the login endpoint returning 200 for invalid tokens. " "Read src/auth.py and tests/test auth.py to understand the bug. " "Fix the verify token function and open a pull request with your changes." , repo="myorg/my-api-project", How to run: python github agent raw.py The raw SDK path gives you what Qwen-Agent abstracts: you can see every tool call, every result, and every message injected into the conversation history. The tool to session routing dict is the key mechanism; it maps each tool name to the MCP session that owns it, so the agent can call any tool from any connected server without knowing which server provides it. Writing a Custom MCP Server Pre-built MCP servers handle the filesystem and GitHub. When you need something that does not exist — querying an internal database, wrapping a CI/CD API, running code analysis tools — you write an MCP server. Here is a complete code quality server that exposes ruff and pytest as MCP tools. code quality server.py A custom MCP server exposing code quality tools to Qwen3.6. Prerequisites: pip install mcp ruff pytest How to run standalone for testing : python code quality server.py To add to the Qwen-Agent config: "code quality": { "command": "python", "args": "/absolute/path/to/code quality server.py" } import asyncio import json import subprocess import sys from mcp.server.fastmcp import FastMCP FastMCP is a high-level MCP server framework -- reduces boilerplate significantly mcp = FastMCP "code quality" @mcp.tool def run linter file path: str, fix: bool = False - str: """ Run ruff linter on a Python file and return structured lint results. Use this before modifying a file to understand its current quality state, and after making changes to verify the fix did not introduce new issues. Args: file path: Absolute or relative path to the Python file to lint. fix: If true, automatically fix safe issues in place. Returns: JSON string with issues list, issue count, and files modified. """ cmd = "python", "-m", "ruff", "check", file path, "--output-format=json" if fix: cmd.append "--fix" try: result = subprocess.run cmd, capture output=True, text=True, timeout=30 ruff returns exit code 1 when issues are found -- not an error output = result.stdout or result.stderr Parse ruff's JSON output try: issues = json.loads output if output.strip else except json.JSONDecodeError: issues = formatted = { "line": issue.get "location", {} .get "row", 0 , "col": issue.get "location", {} .get "column", 0 , "code": issue.get "code", "" , "message": issue.get "message", "" , "fix available": issue.get "fix" is not None, } for issue in issues if isinstance issue, dict return json.dumps { "file": file path, "issues": formatted, "total issues": len formatted , "fixed": "auto-fix applied" if fix else "no auto-fix", }, indent=2 except subprocess.TimeoutExpired: return json.dumps {"error": "Linter timed out after 30s", "file": file path} except FileNotFoundError: return json.dumps {"error": "ruff not found -- install with: pip install ruff"} @mcp.tool def run tests target: str, verbose: bool = False - str: """ Run pytest on a module or directory and return structured pass/fail results. Use this after writing a fix to verify the fix makes failing tests pass without breaking other tests. Args: target: Path to the test file or directory to run e.g. tests/, tests/test auth.py verbose: If true, include full pytest output in the result. Returns: JSON string with pass count, fail count, failure details, and duration. """ cmd = "python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q" if verbose: cmd.append "-v" try: result = subprocess.run cmd, capture output=True, text=True, timeout=120 output = result.stdout Parse pytest-json-report output if available try: report = json.loads output summary = report.get "summary", {} failures = { "test": t "nodeid" , "message": t.get "call", {} .get "longrepr", "" :500 , } for t in report.get "tests", if t.get "outcome" == "failed" return json.dumps { "target": target, "passed": summary.get "passed", 0 , "failed": summary.get "failed", 0 , "errors": summary.get "error", 0 , "total": summary.get "total", 0 , "duration": summary.get "duration", 0 , "failures": failures, "stdout": result.stdout :2000 if verbose else "", }, indent=2 except json.JSONDecodeError: Fallback: return raw output if JSON report not available return json.dumps { "target": target, "stdout": result.stdout :3000 , "stderr": result.stderr :1000 , "exit code": result.returncode, } except subprocess.TimeoutExpired: return json.dumps {"error": f"Tests timed out after 120s for target: {target}"} except FileNotFoundError: return json.dumps {"error": "pytest not found -- install with: pip install pytest"} if name == " main ": mcp.run transport="stdio" Add it to either agent implementation's server config: In Qwen-Agent MCP SERVERS dict: "code quality": { "command": "python", "args": "/absolute/path/to/code quality server.py" } In the raw SDK, add a third StdioServerParameters: cq params = StdioServerParameters command="python", args= "/absolute/path/to/code quality server.py" , Test the server standalone before connecting the agent: Test the server in MCP inspector mode npx @modelcontextprotocol/inspector python code quality server.py Opens a browser UI where you can call run linter and run tests directly Tuning Thinking Mode and Preserving Reasoning The thinking mode decision affects latency significantly enough that it is worth treating as an explicit architecture choice, not an afterthought. In thinking mode, Qwen3.6 generates a chain-of-thought reasoning trace inside tags before producing its action. For a 5-step agent task, that trace adds 1,000 to 5,000 tokens per turn depending on task complexity. Those tokens take time to generate and consume context budget. When that cost is worth paying: - Planning steps where the agent decides what to do next. - Debugging sessions where the problem is genuinely ambiguous. - Multi-file refactoring where the agent needs to reason about side effects across files. The reasoning trace catches mistakes before they become tool calls with wrong arguments. When it is not worth paying: mechanical tool-call loops where each step is unambiguous — list directory → read file → write file → commit . The model does not need to think hard about these steps. Non-thinking mode is faster and produces the same quality output. Switch modes per-request, not globally: Thinking mode planning, debugging, complex multi-file tasks THINKING PARAMS = { "temperature": 0.6, "top p": 0.95, "top k": 20, "min p": 0.0, } Non-thinking mode mechanical loops, fast status checks Pass enable thinking=False in the chat template, or use system prompt: Add "/no think" to the system prompt to suppress thinking mode. NON THINKING PARAMS = { "temperature": 0.7, "top p": 0.8, "top k": 20, "min p": 0.0, } The preserve thinking flag — the Qwen3.6-specific capability that retains reasoning context across turns — directly impacts inference efficiency when prefix caching is active. Here is why it matters practically: in a 10-turn agent session, each turn shares a prefix of the conversation history. When preserve thinking=True , the full reasoning trace from prior turns stays in the history. The KV cache on the server side recognizes the shared prefix across turns and avoids recomputing it. The effective tokens-per-second rate for long sessions is meaningfully higher than without it, particularly when serving infrastructure like SGLang with --enable-prefix-caching is running. The practical rule: use preserve thinking=True for agent sessions that will run for more than 5 turns. Use preserve thinking=False or non-thinking mode for single-turn queries and short pipelines where the overhead is a waste. Conclusion Qwen3.6-35B-A3B's MoE architecture gives you 35B model quality at 3B activation cost. Its 262k context window gives you room to hold an entire code review session in context. Its explicit training on MCP-based agentic benchmarks means it knows how to use tools correctly, not just call them. MCP provides the connective tissue. Define a tool once as an MCP server. Every Qwen3.6 session and every other MCP-compatible model can discover and call it without custom glue. The GitHub and filesystem servers in this article are two of hundreds of pre-built servers in the MCP ecosystem https://modelcontextprotocol.io/examples . The custom code quality server shows the pattern for anything that does not already exist. The GitHub developer assistant in this article is one application of the pattern. The same architecture — local model, MCP tools, and agentic loop — works for a research assistant that searches academic databases and drafts literature reviews, a DevOps agent that reads CloudWatch logs and opens incident tickets, or a data pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is growing fast. The local model capability is already there. is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Shittu Olumide https://www.linkedin.com/in/olumide-shittu/