# Building Local AI Systems: Qwen3.6 + MCPs

> Source: <https://www.kdnuggets.com/building-local-ai-systems-qwen3-6-mcps>
> Published: 2026-06-30 14:00:14+00:00

# Building Local AI Systems: Qwen3.6 + MCPs

Define a tool once as an MCP server and any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model.

## # Introducing MCP

Every developer building with local AI hits the same wall eventually. The model works. It reasons well, writes solid code, and answers complex questions. But it cannot do everything. It cannot query your database, open a GitHub issue, or call your internal API. You are left writing custom Python wrappers for every tool you need, hardcoding the glue between model output and tool execution, and maintaining those wrappers every time an API changes.

The [ Model Context Protocol (MCP)](https://modelcontextprotocol.io/) was designed to solve exactly this. It is an open standard by Anthropic: a universal, pluggable protocol for AI tool connectivity. Define a tool once as an MCP server. Any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model.

[ Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) is the most capable local model for this kind of work right now. It has a 262,144-token context window, a Mixture of Experts (MoE) architecture that activates only 3B of its 35B parameters per forward pass (which is why it fits on hardware that should not be able to run a 35B model), and was explicitly trained and evaluated on MCP-based agentic tasks.

This article builds a local GitHub developer assistant: an agent that reads a repository's open issues, searches the relevant code, drafts a fix, and creates a pull request. The whole thing runs on your hardware, through MCP servers, with no cloud dependency.

## # Understanding Qwen3.6-35B-A3B

Understanding the architecture matters here because it directly explains what hardware you need and why the model performs the way it does on agentic tasks.

The name encodes the key fact: 35B total parameters, A3B meaning 3B activated per forward pass. It is an MoE model with [256 experts per layer, routing 8 plus 1 shared experts per token](https://huggingface.co/Qwen/Qwen3.6-35B-A3B). You get the knowledge capacity of a 35B model at the inference compute cost of a 3B model. That trade-off is why it fits on hardware that would collapse under a dense 35B.

The hidden layout is where Qwen3.6 diverges most from other MoE models. Each block in the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Attention layers. DeltaNet is a linear attention mechanism; it processes sequences more efficiently than full quadratic attention, especially at long context lengths. The interleaved full Gated Attention layers provide the deep relational reasoning that linear attention alone misses. For an agent working through a 500-file repository, that combination matters: efficient processing at length combined with precise reasoning on the relevant sections.

The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context length is not a comfort feature; it is an operational constraint. An agent reading source files, maintaining tool call history, tracking a multi-step plan, and injecting tool results back into context needs real headroom. Most 7B and 13B models cap at 8k or 32k tokens. Running out of context mid-task means the agent loses its own history and starts hallucinating tool results.

Qwen3.6 was explicitly trained and evaluated on MCP-based agentic benchmarks. Two headline features came out of that training:

**Agentic Coding.** Frontend workflows and repository-level reasoning — the model handles multi-file refactoring tasks with coherent reasoning across files, not just single-file edits in isolation.**Thinking Preservation.** A`preserve_thinking`

flag that retains reasoning traces from prior turns in a multi-turn conversation. When an agent reasons through a plan in turn one and then executes tool calls in turns two through five,`preserve_thinking=True`

keeps the turn-one reasoning available in the KV cache. Each subsequent turn benefits from that prior reasoning without paying the cost of re-deriving it.

## # System Requirements

There are three realistic deployment paths, and which one you use depends entirely on your hardware.

**GPU inference (recommended for production agent workloads).** Qwen3.6-35B-A3B in bfloat16 requires approximately 70 GB VRAM. In Q4 quantization, it fits in approximately 20–24 GB. A single RTX 4090 (24 GB) handles Q4. Two RTX 3090s with tensor parallelism handle Q4 as well. An A100 80 GB handles the full bfloat16 model.**CPU/Hybrid via KTransformers.** is the accessible path for developers without a 24 GB GPU. It offloads compute-heavy layers to GPU when available and runs the rest on CPU. With 64 GB system RAM, you can run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency will be 30–120 seconds per turn depending on your CPU, which is workable for an agent doing background repository analysis but not for interactive coding sessions.**KTransformers****Smaller models for tutorial testing.** The entire MCP integration pattern in this article is identical regardless of model size. If you want to follow along without the hardware for the full 35B model, use`Qwen/Qwen2.5-7B-Instruct`

via Ollama (`ollama pull qwen2.5:7b`

) or the Qwen3-8B model. The serving API is the same, the code is identical, and you can swap in the 35B model when hardware permits.

Software requirements:

```
# Python 3.11+ required
python --version

python -m venv qwen-mcp-env
source qwen-mcp-env/bin/activate    # macOS / Linux
qwen-mcp-env\Scripts\activate       # Windows

# Core packages
pip install \
  "openai>=1.30.0" \
  "qwen-agent>=0.0.10" \
  "mcp>=1.0.0" \
  "httpx>=0.27.0"

# Serving framework -- choose one
pip install "vllm>=0.19.0"       # NVIDIA GPU
pip install "sglang>=0.5.10"     # NVIDIA GPU (faster prefill for long context)
pip install "ktransformers"      # CPU/hybrid

# Node.js 18+ is required for pre-built MCP servers installed via npx
node --version
```

## # Serving Qwen3.6 Locally with an OpenAI-Compatible API

Before wiring in any MCP servers, you need a running inference server. Both SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the same API surface, just pointed at localhost instead of [api.openai.com](http://api.openai.com).

#### // SGLang (Recommended for Long-Context Agent Workloads)

```
# Install SGLang with full dependencies
pip install "sglang[all]>=0.5.10"

# Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled.
# --reasoning-parser qwen3 correctly handles the ... blocks.
# --tool-call-parser qwen3_coder routes tool call outputs to the right format.
# --enable-prefix-caching is critical for agent workloads -- enables KV cache reuse
#   across turns, which is what makes preserve_thinking efficient in practice.

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-35B-A3B \
    --host 0.0.0.0 \
    --port 30000 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching \
    --tp 2    # tensor parallel across 2 GPUs; remove if using single GPU
```

#### // vLLM

```
pip install "vllm>=0.19.0"

# vLLM equivalent with the same critical flags
vllm serve Qwen/Qwen3.6-35B-A3B \
    --host 0.0.0.0 \
    --port 8000 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching-v2 \
    --tensor-parallel-size 2
```

#### // Smaller Model via Ollama

```
ollama pull qwen2.5:7b
ollama serve
# Ollama's API is OpenAI-compatible at http://localhost:11434/v1
```

Once the server is running, verify it before going any further:

```
# Health check -- should return {"status": "ok"} or similar
curl http://localhost:30000/health

# Test the chat completions endpoint with a simple query
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Reply with: ready"}],
    "max_tokens": 10
  }'
```

If you get a JSON response with a `choices`

array, the server is ready. Do not proceed to MCP setup until this works. Every integration failure you will encounter later is easier to debug when you know the serving layer is solid.

## # Understanding MCP and Why It Changes the Agent Architecture

Before writing any agent code, it helps to understand what MCP actually does at the protocol level, because that understanding prevents a category of bugs that come from treating MCP as just a fancier function-calling API.

MCP is a [JSON-RPC 2.0 protocol](https://modelcontextprotocol.io/docs/concepts/architecture) running over stdio or HTTP transport. When an MCP client connects to a server, the first thing it does is call `tools/list`

to discover what tools the server exposes. Each tool comes back with a name, a description, and an input schema defined in JSON Schema. The model reads this schema. It is the model's contract with the tool.

When the model wants to call a tool, it emits a structured tool call object. The MCP client — not the model — actually executes the call by sending a `tools/call`

request to the server. The server handles execution and returns a result. The client injects that result back into the conversation as a `tool`

role message. The model reads the result and decides the next step.

This separation is important. The model decides what to call and with what arguments. The client handles execution. The server handles the actual work. Your code never hardwires a tool to a model; you just tell the client which servers are available.

There are two ways to use MCP with Qwen3.6:

**Via Qwen-Agent**: the official`qwen_agent`

library handles tool discovery, call parsing, result injection, and multi-turn conversation management automatically. Less code, less control. Right for most use cases.**Via the MCP Python SDK directly**: you handle the agentic loop yourself using`mcp.ClientSession`

. More code, full visibility into every message, complete control over error handling and retry logic. Right for production systems where you need to monitor every step.

This article covers both, starting with Qwen-Agent.

## # Building the Local GitHub Developer Assistant

The agent does four things in sequence: reads open issues from a GitHub repository, finds the relevant code, drafts a fix, and opens a pull request. All locally, all through MCP.

#### // Part 1: Environment and MCP Server Setup

```
# Set your GitHub personal access token
# Required by the GitHub MCP server for API calls
export GITHUB_TOKEN=ghp_your_token_here

# Pre-built MCP servers install via npx -- no separate install step
# npx handles this on first use when the agent starts the servers
# Verify npx is available:
npx --version
```

Create a project directory:

```
mkdir qwen-github-agent
cd qwen-github-agent
```

#### // Part 2: Qwen-Agent Implementation

The fastest path to a working agent. Qwen-Agent handles the full loop automatically.

```
# github_agent_qwenagent.py
# Prerequisites: pip install qwen-agent openai
#   npm / npx must be installed for the MCP servers
#   GITHUB_TOKEN env var must be set
#   Local serving endpoint must be running (see previous section)
#
# How to run:
#   python github_agent_qwenagent.py

from qwen_agent.agents import Assistant

# ── Server configuration ──────────────────────────────────────────────────────

# Point at your local serving endpoint.
# Change the base_url to match whichever server you started:
#   SGLang:  http://localhost:30000/v1
#   vLLM:    http://localhost:8000/v1
#   Ollama:  http://localhost:11434/v1
LLM_CONFIG = {
    "model":     "Qwen/Qwen3.6-35B-A3B",
    "model_server": "http://localhost:30000/v1",
    "api_key":   "EMPTY",           # Local servers do not require a real key

    # Thinking mode sampling params (from the official model card best practices)
    "generate_cfg": {
        "temperature":       0.6,
        "top_p":             0.95,
        "top_k":             20,
        "min_p":             0.0,
        "thought_in_history": True,   # This is the preserve_thinking flag in Qwen-Agent
    },
}

# ── MCP server configuration ──────────────────────────────────────────────────
# Each server key names the server; the value is the stdio launch command.
# Qwen-Agent starts each server as a subprocess and manages the MCP sessions.

MCP_SERVERS = {
    "mcpServers": {
        "filesystem": {
            "command": "npx",
            "args": [
                "-y",
                "@modelcontextprotocol/server-filesystem",
                # Grant the agent access to the current working directory
                # In production, restrict to the specific repository path
                "."
            ]
        },
        "github": {
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-github"],
            "env": {
                # The GitHub MCP server reads this env var for API authentication
                "GITHUB_TOKEN": "${GITHUB_TOKEN}"
            }
        },
    }
}

# ── System prompt ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """You are a senior software engineer with full access to a GitHub repository
via MCP tools.

When given a repository and task:
1. List open issues to understand what needs fixing
2. Use filesystem tools to read relevant source files and tests
3. Identify the root cause based on the code and the issue description
4. Write a targeted fix -- minimal changes, no refactoring unrelated to the bug
5. Create a pull request with a clear title and description referencing the issue

Always explain your reasoning at each step. Think through edge cases before writing code.
If you are uncertain about a file's purpose, read it before modifying it."""

# ── Agent setup ───────────────────────────────────────────────────────────────

agent = Assistant(
    llm=LLM_CONFIG,
    name="GitHub Developer Assistant",
    description="Reads issues, fixes bugs, opens pull requests -- locally via MCP.",
    system_message=SYSTEM_PROMPT,
    mcp_servers=MCP_SERVERS,
)

# ── Run the agent ─────────────────────────────────────────────────────────────

def run_agent(task: str):
    """
    Run the agent on a task description and stream the output.
    The agent will make tool calls automatically; Qwen-Agent handles
    the full loop including tool execution and result injection.
    """
    messages = [{"role": "user", "content": task}]

    print(f"Task: {task}\n{'─' * 70}")

    # Qwen-Agent's run() is a generator that yields intermediate steps
    # Each yielded message shows a tool call, a tool result, or the final answer
    for response in agent.run(messages=messages):
        # response is a list of messages representing the conversation so far
        # The last message contains the most recent output
        last = response[-1]
        role    = last.get("role", "")
        content = last.get("content", "")

        if role == "assistant" and content:
            # Strip and display the thinking block separately for readability
            import re
            thinking = re.search(r"(.*?)", content, re.DOTALL)
            if thinking:
                print(f"[thinking] {thinking.group(1).strip()[:200]}...")
            clean = re.sub(r".*?", "", content, flags=re.DOTALL).strip()
            if clean:
                print(f"[agent] {clean}")

        elif role == "tool":
            tool_name = last.get("name", "unknown_tool")
            print(f"[tool:{tool_name}] result received")

if __name__ == "__main__":
    run_agent(
        "In the repository myorg/my-api-project, find the open issue about "
        "the login endpoint returning 200 for invalid tokens. Read the relevant "
        "code and tests, fix the bug, and open a pull request."
    )
```

**How to run:**

```
python github_agent_qwenagent.py
```

#### // Part 3: Raw MCP SDK Implementation

For teams who need full control over every protocol message, custom error handling, per-tool retry logic, and audit logging of every tool call and result:

```
# github_agent_raw.py
# Prerequisites: pip install mcp openai httpx
#   GITHUB_TOKEN env var must be set, local server must be running
#
# How to run:
#   python github_agent_raw.py

import asyncio
import json
import os
import re
from openai import AsyncOpenAI
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# ── Local serving client ───────────────────────────────────────────────────────
client = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)

MODEL = "Qwen/Qwen3.6-35B-A3B"

# ── Response processing ───────────────────────────────────────────────────────

def strip_thinking(text: str) -> str:
    """Remove ... blocks. Used when we only need the action."""
    return re.sub(r".*?", "", text, flags=re.DOTALL).strip()

def extract_thinking(text: str) -> str:
    """Extract the content of the thinking block for logging."""
    m = re.search(r"(.*?)", text, re.DOTALL)
    return m.group(1).strip() if m else ""

def process_response(response, preserve_thinking: bool = True) -> dict:
    """
    Process a chat completion response from Qwen3.6.

    Handles two output formats:
    1. Tool call via the API's function_call / tool_calls field (when --tool-call-parser is active)
    2. Tool call embedded in the message content as JSON

    Args:
        response:          The OpenAI-compatible completion response
        preserve_thinking: If True, keep thinking content in output for
                           the next turn's KV cache benefit

    Returns:
        dict with thinking, tool_calls, final_answer, has_tool_calls, is_terminal
    """
    choice  = response.choices[0]
    message = choice.message

    # Path 1: Tool calls in the structured field (preferred -- requires tool-call-parser flag)
    if message.tool_calls:
        tool_calls = [
            {
                "name":      tc.function.name,
                "arguments": json.loads(tc.function.arguments),
                "call_id":   tc.id,
            }
            for tc in message.tool_calls
        ]
        thinking = extract_thinking(message.content or "")
        return {
            "thinking":       thinking if preserve_thinking else "",
            "tool_calls":     tool_calls,
            "final_answer":   "",
            "has_tool_calls": True,
            "is_terminal":    False,
        }

    # Path 2: Tool calls embedded in content text (fallback)
    content = message.content or ""
    tag_matches = re.findall(r"(.*?)", content, re.DOTALL)
    tool_calls = []
    for m in tag_matches:
        try:
            tool_calls.append(json.loads(m.strip()))
        except json.JSONDecodeError:
            pass

    thinking     = extract_thinking(content)
    final_answer = re.sub(r".*?", "", content, flags=re.DOTALL)
    final_answer = re.sub(r".*?", "", final_answer, flags=re.DOTALL).strip()

    return {
        "thinking":       thinking if preserve_thinking else "",
        "tool_calls":     tool_calls,
        "final_answer":   final_answer,
        "has_tool_calls": len(tool_calls) > 0,
        "is_terminal":    len(tool_calls) == 0 and bool(final_answer),
    }

# ── Core agent loop ───────────────────────────────────────────────────────────

async def run_github_agent(task: str, repo: str, max_turns: int = 20):
    """
    Run the GitHub developer assistant agent.

    Connects to filesystem and GitHub MCP servers, discovers their tools,
    and runs the Qwen3.6 agent loop until the task is complete or max_turns reached.
    """
    # Start both MCP servers and establish sessions
    fs_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "."],
    )
    gh_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-github"],
        env={**os.environ, "GITHUB_TOKEN": os.environ.get("GITHUB_TOKEN", "")},
    )

    async with stdio_client(fs_params) as (fs_read, fs_write), \
               ClientSession(fs_read, fs_write) as fs_session, \
               stdio_client(gh_params) as (gh_read, gh_write), \
               ClientSession(gh_read, gh_write) as gh_session:

        # Initialize both sessions
        await fs_session.initialize()
        await gh_session.initialize()

        # Discover all available tools from both servers
        fs_tools_result = await fs_session.list_tools()
        gh_tools_result = await gh_session.list_tools()

        # Build the OpenAI-format tool list for the model
        all_tools = []
        tool_to_session = {}   # Maps tool name to the MCP session that owns it

        for tool in fs_tools_result.tools:
            all_tools.append({
                "type": "function",
                "function": {
                    "name":        tool.name,
                    "description": tool.description,
                    "parameters":  tool.inputSchema,
                }
            })
            tool_to_session[tool.name] = fs_session

        for tool in gh_tools_result.tools:
            all_tools.append({
                "type": "function",
                "function": {
                    "name":        tool.name,
                    "description": tool.description,
                    "parameters":  tool.inputSchema,
                }
            })
            tool_to_session[tool.name] = gh_session

        print(f"Tools available: {len(all_tools)} ({len(fs_tools_result.tools)} filesystem, "
              f"{len(gh_tools_result.tools)} GitHub)")

        # Build conversation history
        system_prompt = f"""You are a senior software engineer with access to the repository {repo}.
Use the available tools to investigate issues, read code, write fixes, and create pull requests.
Think step by step. Read before you modify. Minimal changes only."""

        messages = [
            {"role": "system",  "content": system_prompt},
            {"role": "user",    "content": task},
        ]

        # ── Agent loop ─────────────────────────────────────────────────────────
        for turn in range(max_turns):
            print(f"\n[Turn {turn + 1}]")

            # Call the model
            response = await client.chat.completions.create(
                model=MODEL,
                messages=messages,
                tools=all_tools if all_tools else None,
                tool_choice="auto",
                # Thinking mode sampling params from the official best practices
                temperature=0.6,
                top_p=0.95,
                top_k=20,
                min_p=0.0,
                max_tokens=4096,
                extra_body={
                    # preserve_thinking keeps reasoning context across turns
                    # for KV cache efficiency on long agent sessions
                    "preserve_thinking": True,
                }
            )

            result = process_response(response, preserve_thinking=True)

            if result["thinking"]:
                print(f"[thinking] {result['thinking'][:200]}...")

            # Terminal state -- agent has produced a final answer
            if result["is_terminal"]:
                print(f"\n[DONE]\n{result['final_answer']}")
                return result["final_answer"]

            # Tool call state -- execute each tool and inject results
            if result["has_tool_calls"]:
                # Append the assistant's message with tool calls to history
                messages.append({
                    "role":       "assistant",
                    "content":    response.choices[0].message.content or "",
                    "tool_calls": response.choices[0].message.tool_calls or [],
                })

                for call in result["tool_calls"]:
                    tool_name = call["name"]
                    tool_args = call.get("arguments", {})
                    call_id   = call.get("call_id", "")

                    print(f"[tool] {tool_name}({json.dumps(tool_args)[:80]}...)")

                    session = tool_to_session.get(tool_name)
                    if not session:
                        result_content = f"Error: tool '{tool_name}' not found"
                    else:
                        try:
                            tool_result = await session.call_tool(tool_name, tool_args)
                            result_content = str(tool_result.content)
                            # Truncate very long results to protect context budget
                            if len(result_content) > 12000:
                                result_content = result_content[:12000] + "\n...[truncated]"
                        except Exception as e:
                            result_content = f"Error: {e}"

                    print(f"[result] {result_content[:150]}...")

                    messages.append({
                        "role":        "tool",
                        "content":     result_content,
                        "tool_call_id": call_id,
                        "name":        tool_name,
                    })

        print(f"[WARNING] max_turns ({max_turns}) reached without terminal state")

# ── Entry point ───────────────────────────────────────────────────────────────

if __name__ == "__main__":
    asyncio.run(run_github_agent(
        task=(
            "Find the open issue about the login endpoint returning 200 for invalid tokens. "
            "Read src/auth.py and tests/test_auth.py to understand the bug. "
            "Fix the verify_token function and open a pull request with your changes."
        ),
        repo="myorg/my-api-project",
    ))
```

**How to run:**

```
python github_agent_raw.py
```

The raw SDK path gives you what Qwen-Agent abstracts: you can see every tool call, every result, and every message injected into the conversation history. The `tool_to_session`

routing dict is the key mechanism; it maps each tool name to the MCP session that owns it, so the agent can call any tool from any connected server without knowing which server provides it.

## # Writing a Custom MCP Server

Pre-built MCP servers handle the filesystem and GitHub. When you need something that does not exist — querying an internal database, wrapping a CI/CD API, running code analysis tools — you write an MCP server. Here is a complete `code_quality`

server that exposes `ruff`

and `pytest`

as MCP tools.

```
# code_quality_server.py
# A custom MCP server exposing code quality tools to Qwen3.6.
#
# Prerequisites:
#   pip install mcp ruff pytest
#
# How to run standalone (for testing):
#   python code_quality_server.py
#
# To add to the Qwen-Agent config:
#   "code_quality": {
#       "command": "python",
#       "args": ["/absolute/path/to/code_quality_server.py"]
#   }

import asyncio
import json
import subprocess
import sys
from mcp.server.fastmcp import FastMCP

# FastMCP is a high-level MCP server framework -- reduces boilerplate significantly
mcp = FastMCP("code_quality")

@mcp.tool()
def run_linter(file_path: str, fix: bool = False) -> str:
    """
    Run ruff linter on a Python file and return structured lint results.
    Use this before modifying a file to understand its current quality state,
    and after making changes to verify the fix did not introduce new issues.

    Args:
        file_path: Absolute or relative path to the Python file to lint.
        fix:       If true, automatically fix safe issues in place.

    Returns:
        JSON string with issues list, issue count, and files modified.
    """
    cmd = ["python", "-m", "ruff", "check", file_path, "--output-format=json"]
    if fix:
        cmd.append("--fix")

    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        # ruff returns exit code 1 when issues are found -- not an error
        output = result.stdout or result.stderr

        # Parse ruff's JSON output
        try:
            issues = json.loads(output) if output.strip() else []
        except json.JSONDecodeError:
            issues = []

        formatted = [
            {
                "line":    issue.get("location", {}).get("row", 0),
                "col":     issue.get("location", {}).get("column", 0),
                "code":    issue.get("code", ""),
                "message": issue.get("message", ""),
                "fix_available": issue.get("fix") is not None,
            }
            for issue in issues
            if isinstance(issue, dict)
        ]

        return json.dumps({
            "file":         file_path,
            "issues":       formatted,
            "total_issues": len(formatted),
            "fixed":        "auto-fix applied" if fix else "no auto-fix",
        }, indent=2)

    except subprocess.TimeoutExpired:
        return json.dumps({"error": "Linter timed out after 30s", "file": file_path})
    except FileNotFoundError:
        return json.dumps({"error": "ruff not found -- install with: pip install ruff"})

@mcp.tool()
def run_tests(target: str, verbose: bool = False) -> str:
    """
    Run pytest on a module or directory and return structured pass/fail results.
    Use this after writing a fix to verify the fix makes failing tests pass
    without breaking other tests.

    Args:
        target:  Path to the test file or directory to run (e.g. tests/, tests/test_auth.py)
        verbose: If true, include full pytest output in the result.

    Returns:
        JSON string with pass count, fail count, failure details, and duration.
    """
    cmd = ["python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q"]
    if verbose:
        cmd.append("-v")

    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
        output = result.stdout

        # Parse pytest-json-report output if available
        try:
            report = json.loads(output)
            summary  = report.get("summary", {})
            failures = [
                {
                    "test":    t["nodeid"],
                    "message": t.get("call", {}).get("longrepr", "")[:500],
                }
                for t in report.get("tests", [])
                if t.get("outcome") == "failed"
            ]
            return json.dumps({
                "target":   target,
                "passed":   summary.get("passed", 0),
                "failed":   summary.get("failed", 0),
                "errors":   summary.get("error", 0),
                "total":    summary.get("total", 0),
                "duration": summary.get("duration", 0),
                "failures": failures,
                "stdout":   result.stdout[:2000] if verbose else "",
            }, indent=2)

        except json.JSONDecodeError:
            # Fallback: return raw output if JSON report not available
            return json.dumps({
                "target":  target,
                "stdout":  result.stdout[:3000],
                "stderr":  result.stderr[:1000],
                "exit_code": result.returncode,
            })

    except subprocess.TimeoutExpired:
        return json.dumps({"error": f"Tests timed out after 120s for target: {target}"})
    except FileNotFoundError:
        return json.dumps({"error": "pytest not found -- install with: pip install pytest"})

if __name__ == "__main__":
    mcp.run(transport="stdio")
```

Add it to either agent implementation's server config:

```
# In Qwen-Agent MCP_SERVERS dict:
"code_quality": {
    "command": "python",
    "args": ["/absolute/path/to/code_quality_server.py"]
}

# In the raw SDK, add a third StdioServerParameters:
cq_params = StdioServerParameters(
    command="python",
    args=["/absolute/path/to/code_quality_server.py"],
)
```

Test the server standalone before connecting the agent:

```
# Test the server in MCP inspector mode
npx @modelcontextprotocol/inspector python code_quality_server.py
# Opens a browser UI where you can call run_linter and run_tests directly
```

## # Tuning Thinking Mode and Preserving Reasoning

The thinking mode decision affects latency significantly enough that it is worth treating as an explicit architecture choice, not an afterthought.

In thinking mode, Qwen3.6 generates a chain-of-thought reasoning trace inside

tags before producing its action. For a 5-step agent task, that trace adds 1,000 to 5,000 tokens per turn depending on task complexity. Those tokens take time to generate and consume context budget.

When that cost is worth paying:

- Planning steps where the agent decides what to do next.
- Debugging sessions where the problem is genuinely ambiguous.
- Multi-file refactoring where the agent needs to reason about side effects across files.

The reasoning trace catches mistakes before they become tool calls with wrong arguments. When it is not worth paying: mechanical tool-call loops where each step is unambiguous — **list directory → read file → write file → commit**. The model does not need to think hard about these steps. Non-thinking mode is faster and produces the same quality output.

Switch modes per-request, not globally:

```
# Thinking mode (planning, debugging, complex multi-file tasks)
THINKING_PARAMS = {
    "temperature": 0.6,
    "top_p":       0.95,
    "top_k":       20,
    "min_p":       0.0,
}

# Non-thinking mode (mechanical loops, fast status checks)
# Pass enable_thinking=False in the chat template, or use system prompt:
# Add "/no_think" to the system prompt to suppress thinking mode.
NON_THINKING_PARAMS = {
    "temperature": 0.7,
    "top_p":       0.8,
    "top_k":       20,
    "min_p":       0.0,
}
```

The `preserve_thinking`

flag — the Qwen3.6-specific capability that retains reasoning context across turns — directly impacts inference efficiency when prefix caching is active. Here is why it matters practically: in a 10-turn agent session, each turn shares a prefix of the conversation history. When `preserve_thinking=True`

, the full reasoning trace from prior turns stays in the history. The KV cache on the server side recognizes the shared prefix across turns and avoids recomputing it. The effective tokens-per-second rate for long sessions is meaningfully higher than without it, particularly when serving infrastructure like SGLang with `--enable-prefix-caching`

is running.

The practical rule: use `preserve_thinking=True`

for agent sessions that will run for more than 5 turns. Use `preserve_thinking=False`

(or non-thinking mode) for single-turn queries and short pipelines where the overhead is a waste.

## # Conclusion

Qwen3.6-35B-A3B's MoE architecture gives you 35B model quality at 3B activation cost. Its 262k context window gives you room to hold an entire code review session in context. Its explicit training on MCP-based agentic benchmarks means it knows how to use tools correctly, not just call them.

MCP provides the connective tissue. Define a tool once as an MCP server. Every Qwen3.6 session and every other MCP-compatible model can discover and call it without custom glue. The GitHub and filesystem servers in this article are two of hundreds of pre-built servers in the [MCP ecosystem](https://modelcontextprotocol.io/examples). The custom `code_quality`

server shows the pattern for anything that does not already exist.

The GitHub developer assistant in this article is one application of the pattern. The same architecture — local model, MCP tools, and agentic loop — works for a research assistant that searches academic databases and drafts literature reviews, a DevOps agent that reads CloudWatch logs and opens incident tickets, or a data pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is growing fast. The local model capability is already there.

is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on

[Shittu Olumide](https://www.linkedin.com/in/olumide-shittu/)