cd /news/large-language-models/building-local-ai-systems-qwen3-6-mc… · home topics large-language-models article
[ARTICLE · art-44981] src=kdnuggets.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Building Local AI Systems: Qwen3.6 + MCPs

Anthropic's Model Context Protocol (MCP) enables universal tool connectivity for AI agents, and Qwen3.6-35B-A3B, a Mixture of Experts model with 262k-token context, is optimized for MCP-based agentic tasks. A local GitHub developer assistant built with these tools can read issues, search code, draft fixes, and create pull requests entirely on-device without cloud dependencies.

read22 min views1 publishedJun 30, 2026
Building Local AI Systems: Qwen3.6 + MCPs
Image: Kdnuggets (auto-discovered)

Define a tool once as an MCP server and any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model.

# Introducing MCP #

Every developer building with local AI hits the same wall eventually. The model works. It reasons well, writes solid code, and answers complex questions. But it cannot do everything. It cannot query your database, open a GitHub issue, or call your internal API. You are left writing custom Python wrappers for every tool you need, hardcoding the glue between model output and tool execution, and maintaining those wrappers every time an API changes.

The Model Context Protocol (MCP) was designed to solve exactly this. It is an open standard by Anthropic: a universal, pluggable protocol for AI tool connectivity. Define a tool once as an MCP server. Any MCP-compatible client, any model, any framework, can discover and call it with zero custom integration code per model.

Qwen3.6-35B-A3B is the most capable local model for this kind of work right now. It has a 262,144-token context window, a Mixture of Experts (MoE) architecture that activates only 3B of its 35B parameters per forward pass (which is why it fits on hardware that should not be able to run a 35B model), and was explicitly trained and evaluated on MCP-based agentic tasks.

This article builds a local GitHub developer assistant: an agent that reads a repository's open issues, searches the relevant code, drafts a fix, and creates a pull request. The whole thing runs on your hardware, through MCP servers, with no cloud dependency.

# Understanding Qwen3.6-35B-A3B #

Understanding the architecture matters here because it directly explains what hardware you need and why the model performs the way it does on agentic tasks.

The name encodes the key fact: 35B total parameters, A3B meaning 3B activated per forward pass. It is an MoE model with 256 experts per layer, routing 8 plus 1 shared experts per token. You get the knowledge capacity of a 35B model at the inference compute cost of a 3B model. That trade-off is why it fits on hardware that would collapse under a dense 35B.

The hidden layout is where Qwen3.6 diverges most from other MoE models. Each block in the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Attention layers. DeltaNet is a linear attention mechanism; it processes sequences more efficiently than full quadratic attention, especially at long context lengths. The interleaved full Gated Attention layers provide the deep relational reasoning that linear attention alone misses. For an agent working through a 500-file repository, that combination matters: efficient processing at length combined with precise reasoning on the relevant sections.

The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context length is not a comfort feature; it is an operational constraint. An agent reading source files, maintaining tool call history, tracking a multi-step plan, and injecting tool results back into context needs real headroom. Most 7B and 13B models cap at 8k or 32k tokens. Running out of context mid-task means the agent loses its own history and starts hallucinating tool results.

Qwen3.6 was explicitly trained and evaluated on MCP-based agentic benchmarks. Two headline features came out of that training:

Agentic Coding. Frontend workflows and repository-level reasoning — the model handles multi-file refactoring tasks with coherent reasoning across files, not just single-file edits in isolation.Thinking Preservation. Apreserve_thinking

flag that retains reasoning traces from prior turns in a multi-turn conversation. When an agent reasons through a plan in turn one and then executes tool calls in turns two through five,preserve_thinking=True

keeps the turn-one reasoning available in the KV cache. Each subsequent turn benefits from that prior reasoning without paying the cost of re-deriving it.

# System Requirements #

There are three realistic deployment paths, and which one you use depends entirely on your hardware.

GPU inference (recommended for production agent workloads). Qwen3.6-35B-A3B in bfloat16 requires approximately 70 GB VRAM. In Q4 quantization, it fits in approximately 20–24 GB. A single RTX 4090 (24 GB) handles Q4. Two RTX 3090s with tensor parallelism handle Q4 as well. An A100 80 GB handles the full bfloat16 model.CPU/Hybrid via KTransformers. is the accessible path for developers without a 24 GB GPU. It offloads compute-heavy layers to GPU when available and runs the rest on CPU. With 64 GB system RAM, you can run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency will be 30–120 seconds per turn depending on your CPU, which is workable for an agent doing background repository analysis but not for interactive coding sessions.KTransformers****Smaller models for tutorial testing. The entire MCP integration pattern in this article is identical regardless of model size. If you want to follow along without the hardware for the full 35B model, useQwen/Qwen2.5-7B-Instruct

via Ollama (ollama pull qwen2.5:7b

) or the Qwen3-8B model. The serving API is the same, the code is identical, and you can swap in the 35B model when hardware permits.

Software requirements:

python --version

python -m venv qwen-mcp-env
source qwen-mcp-env/bin/activate    # macOS / Linux
qwen-mcp-env\Scripts\activate       # Windows

pip install \
  "openai>=1.30.0" \
  "qwen-agent>=0.0.10" \
  "mcp>=1.0.0" \
  "httpx>=0.27.0"

pip install "vllm>=0.19.0"       # NVIDIA GPU
pip install "sglang>=0.5.10"     # NVIDIA GPU (faster prefill for long context)
pip install "ktransformers"      # CPU/hybrid

node --version

# Serving Qwen3.6 Locally with an OpenAI-Compatible API #

Before wiring in any MCP servers, you need a running inference server. Both SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the same API surface, just pointed at localhost instead of api.openai.com.

// SGLang (Recommended for Long-Context Agent Workloads)

pip install "sglang[all]>=0.5.10"


python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-35B-A3B \
    --host 0.0.0.0 \
    --port 30000 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching \
    --tp 2    # tensor parallel across 2 GPUs; remove if using single GPU

// vLLM

pip install "vllm>=0.19.0"

vllm serve Qwen/Qwen3.6-35B-A3B \
    --host 0.0.0.0 \
    --port 8000 \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-prefix-caching-v2 \
    --tensor-parallel-size 2

// Smaller Model via Ollama

ollama pull qwen2.5:7b
ollama serve

Once the server is running, verify it before going any further:

curl http://localhost:30000/health

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Reply with: ready"}],
    "max_tokens": 10
  }'

If you get a JSON response with a choices

array, the server is ready. Do not proceed to MCP setup until this works. Every integration failure you will encounter later is easier to debug when you know the serving layer is solid.

# Understanding MCP and Why It Changes the Agent Architecture #

Before writing any agent code, it helps to understand what MCP actually does at the protocol level, because that understanding prevents a category of bugs that come from treating MCP as just a fancier function-calling API.

MCP is a JSON-RPC 2.0 protocol running over stdio or HTTP transport. When an MCP client connects to a server, the first thing it does is call tools/list

to discover what tools the server exposes. Each tool comes back with a name, a description, and an input schema defined in JSON Schema. The model reads this schema. It is the model's contract with the tool.

When the model wants to call a tool, it emits a structured tool call object. The MCP client — not the model — actually executes the call by sending a tools/call

request to the server. The server handles execution and returns a result. The client injects that result back into the conversation as a tool

role message. The model reads the result and decides the next step.

This separation is important. The model decides what to call and with what arguments. The client handles execution. The server handles the actual work. Your code never hardwires a tool to a model; you just tell the client which servers are available.

There are two ways to use MCP with Qwen3.6:

Via Qwen-Agent: the officialqwen_agent

library handles tool discovery, call parsing, result injection, and multi-turn conversation management automatically. Less code, less control. Right for most use cases.Via the MCP Python SDK directly: you handle the agentic loop yourself usingmcp.ClientSession

. More code, full visibility into every message, complete control over error handling and retry logic. Right for production systems where you need to monitor every step.

This article covers both, starting with Qwen-Agent.

# Building the Local GitHub Developer Assistant #

The agent does four things in sequence: reads open issues from a GitHub repository, finds the relevant code, drafts a fix, and opens a pull request. All locally, all through MCP.

// Part 1: Environment and MCP Server Setup

export GITHUB_TOKEN=ghp_your_token_here

npx --version

Create a project directory:

mkdir qwen-github-agent
cd qwen-github-agent

// Part 2: Qwen-Agent Implementation

The fastest path to a working agent. Qwen-Agent handles the full loop automatically.

#

from qwen_agent.agents import Assistant


LLM_CONFIG = {
    "model":     "Qwen/Qwen3.6-35B-A3B",
    "model_server": "http://localhost:30000/v1",
    "api_key":   "EMPTY",           # Local servers do not require a real key

    "generate_cfg": {
        "temperature":       0.6,
        "top_p":             0.95,
        "top_k":             20,
        "min_p":             0.0,
        "thought_in_history": True,   # This is the preserve_thinking flag in Qwen-Agent
    },
}


MCP_SERVERS = {
    "mcpServers": {
        "filesystem": {
            "command": "npx",
            "args": [
                "-y",
                "@modelcontextprotocol/server-filesystem",
                "."
            ]
        },
        "github": {
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-github"],
            "env": {
                "GITHUB_TOKEN": "${GITHUB_TOKEN}"
            }
        },
    }
}

SYSTEM_PROMPT = """You are a senior software engineer with full access to a GitHub repository
via MCP tools.

When given a repository and task:
1. List open issues to understand what needs fixing
2. Use filesystem tools to read relevant source files and tests
3. Identify the root cause based on the code and the issue description
4. Write a targeted fix -- minimal changes, no refactoring unrelated to the bug
5. Create a pull request with a clear title and description referencing the issue

Always explain your reasoning at each step. Think through edge cases before writing code.
If you are uncertain about a file's purpose, read it before modifying it."""


agent = Assistant(
    llm=LLM_CONFIG,
    name="GitHub Developer Assistant",
    description="Reads issues, fixes bugs, opens pull requests -- locally via MCP.",
    system_message=SYSTEM_PROMPT,
    mcp_servers=MCP_SERVERS,
)


def run_agent(task: str):
    """
    Run the agent on a task description and stream the output.
    The agent will make tool calls automatically; Qwen-Agent handles
    the full loop including tool execution and result injection.
    """
    messages = [{"role": "user", "content": task}]

    print(f"Task: {task}\n{'─' * 70}")

    for response in agent.run(messages=messages):
        last = response[-1]
        role    = last.get("role", "")
        content = last.get("content", "")

        if role == "assistant" and content:
            import re
            thinking = re.search(r"(.*?)", content, re.DOTALL)
            if thinking:
                print(f"[thinking] {thinking.group(1).strip()[:200]}...")
            clean = re.sub(r".*?", "", content, flags=re.DOTALL).strip()
            if clean:
                print(f"[agent] {clean}")

        elif role == "tool":
            tool_name = last.get("name", "unknown_tool")
            print(f"[tool:{tool_name}] result received")

if __name__ == "__main__":
    run_agent(
        "In the repository myorg/my-api-project, find the open issue about "
        "the login endpoint returning 200 for invalid tokens. Read the relevant "
        "code and tests, fix the bug, and open a pull request."
    )

How to run:

python github_agent_qwenagent.py

// Part 3: Raw MCP SDK Implementation

For teams who need full control over every protocol message, custom error handling, per-tool retry logic, and audit logging of every tool call and result:

#

import asyncio
import json
import os
import re
from openai import AsyncOpenAI
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

client = AsyncOpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",
)

MODEL = "Qwen/Qwen3.6-35B-A3B"


def strip_thinking(text: str) -> str:
    """Remove ... blocks. Used when we only need the action."""
    return re.sub(r".*?", "", text, flags=re.DOTALL).strip()

def extract_thinking(text: str) -> str:
    """Extract the content of the thinking block for logging."""
    m = re.search(r"(.*?)", text, re.DOTALL)
    return m.group(1).strip() if m else ""

def process_response(response, preserve_thinking: bool = True) -> dict:
    """
    Process a chat completion response from Qwen3.6.

    Handles two output formats:
    1. Tool call via the API's function_call / tool_calls field (when --tool-call-parser is active)
    2. Tool call embedded in the message content as JSON

    Args:
        response:          The OpenAI-compatible completion response
        preserve_thinking: If True, keep thinking content in output for
                           the next turn's KV cache benefit

    Returns:
        dict with thinking, tool_calls, final_answer, has_tool_calls, is_terminal
    """
    choice  = response.choices[0]
    message = choice.message

    if message.tool_calls:
        tool_calls = [
            {
                "name":      tc.function.name,
                "arguments": json.loads(tc.function.arguments),
                "call_id":   tc.id,
            }
            for tc in message.tool_calls
        ]
        thinking = extract_thinking(message.content or "")
        return {
            "thinking":       thinking if preserve_thinking else "",
            "tool_calls":     tool_calls,
            "final_answer":   "",
            "has_tool_calls": True,
            "is_terminal":    False,
        }

    content = message.content or ""
    tag_matches = re.findall(r"(.*?)", content, re.DOTALL)
    tool_calls = []
    for m in tag_matches:
        try:
            tool_calls.append(json.loads(m.strip()))
        except json.JSONDecodeError:
            pass

    thinking     = extract_thinking(content)
    final_answer = re.sub(r".*?", "", content, flags=re.DOTALL)
    final_answer = re.sub(r".*?", "", final_answer, flags=re.DOTALL).strip()

    return {
        "thinking":       thinking if preserve_thinking else "",
        "tool_calls":     tool_calls,
        "final_answer":   final_answer,
        "has_tool_calls": len(tool_calls) > 0,
        "is_terminal":    len(tool_calls) == 0 and bool(final_answer),
    }


async def run_github_agent(task: str, repo: str, max_turns: int = 20):
    """
    Run the GitHub developer assistant agent.

    Connects to filesystem and GitHub MCP servers, discovers their tools,
    and runs the Qwen3.6 agent loop until the task is complete or max_turns reached.
    """
    fs_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "."],
    )
    gh_params = StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-github"],
        env={**os.environ, "GITHUB_TOKEN": os.environ.get("GITHUB_TOKEN", "")},
    )

    async with stdio_client(fs_params) as (fs_read, fs_write), \
               ClientSession(fs_read, fs_write) as fs_session, \
               stdio_client(gh_params) as (gh_read, gh_write), \
               ClientSession(gh_read, gh_write) as gh_session:

        await fs_session.initialize()
        await gh_session.initialize()

        fs_tools_result = await fs_session.list_tools()
        gh_tools_result = await gh_session.list_tools()

        all_tools = []
        tool_to_session = {}   # Maps tool name to the MCP session that owns it

        for tool in fs_tools_result.tools:
            all_tools.append({
                "type": "function",
                "function": {
                    "name":        tool.name,
                    "description": tool.description,
                    "parameters":  tool.inputSchema,
                }
            })
            tool_to_session[tool.name] = fs_session

        for tool in gh_tools_result.tools:
            all_tools.append({
                "type": "function",
                "function": {
                    "name":        tool.name,
                    "description": tool.description,
                    "parameters":  tool.inputSchema,
                }
            })
            tool_to_session[tool.name] = gh_session

        print(f"Tools available: {len(all_tools)} ({len(fs_tools_result.tools)} filesystem, "
              f"{len(gh_tools_result.tools)} GitHub)")

        system_prompt = f"""You are a senior software engineer with access to the repository {repo}.
Use the available tools to investigate issues, read code, write fixes, and create pull requests.
Think step by step. Read before you modify. Minimal changes only."""

        messages = [
            {"role": "system",  "content": system_prompt},
            {"role": "user",    "content": task},
        ]

        for turn in range(max_turns):
            print(f"\n[Turn {turn + 1}]")

            response = await client.chat.completions.create(
                model=MODEL,
                messages=messages,
                tools=all_tools if all_tools else None,
                tool_choice="auto",
                temperature=0.6,
                top_p=0.95,
                top_k=20,
                min_p=0.0,
                max_tokens=4096,
                extra_body={
                    "preserve_thinking": True,
                }
            )

            result = process_response(response, preserve_thinking=True)

            if result["thinking"]:
                print(f"[thinking] {result['thinking'][:200]}...")

            if result["is_terminal"]:
                print(f"\n[DONE]\n{result['final_answer']}")
                return result["final_answer"]

            if result["has_tool_calls"]:
                messages.append({
                    "role":       "assistant",
                    "content":    response.choices[0].message.content or "",
                    "tool_calls": response.choices[0].message.tool_calls or [],
                })

                for call in result["tool_calls"]:
                    tool_name = call["name"]
                    tool_args = call.get("arguments", {})
                    call_id   = call.get("call_id", "")

                    print(f"[tool] {tool_name}({json.dumps(tool_args)[:80]}...)")

                    session = tool_to_session.get(tool_name)
                    if not session:
                        result_content = f"Error: tool '{tool_name}' not found"
                    else:
                        try:
                            tool_result = await session.call_tool(tool_name, tool_args)
                            result_content = str(tool_result.content)
                            if len(result_content) > 12000:
                                result_content = result_content[:12000] + "\n...[truncated]"
                        except Exception as e:
                            result_content = f"Error: {e}"

                    print(f"[result] {result_content[:150]}...")

                    messages.append({
                        "role":        "tool",
                        "content":     result_content,
                        "tool_call_id": call_id,
                        "name":        tool_name,
                    })

        print(f"[WARNING] max_turns ({max_turns}) reached without terminal state")


if __name__ == "__main__":
    asyncio.run(run_github_agent(
        task=(
            "Find the open issue about the login endpoint returning 200 for invalid tokens. "
            "Read src/auth.py and tests/test_auth.py to understand the bug. "
            "Fix the verify_token function and open a pull request with your changes."
        ),
        repo="myorg/my-api-project",
    ))

How to run:

python github_agent_raw.py

The raw SDK path gives you what Qwen-Agent abstracts: you can see every tool call, every result, and every message injected into the conversation history. The tool_to_session

routing dict is the key mechanism; it maps each tool name to the MCP session that owns it, so the agent can call any tool from any connected server without knowing which server provides it.

# Writing a Custom MCP Server #

Pre-built MCP servers handle the filesystem and GitHub. When you need something that does not exist — querying an internal database, wrapping a CI/CD API, running code analysis tools — you write an MCP server. Here is a complete code_quality

server that exposes ruff

and pytest

as MCP tools.

#
#
#

import asyncio
import json
import subprocess
import sys
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("code_quality")

@mcp.tool()
def run_linter(file_path: str, fix: bool = False) -> str:
    """
    Run ruff linter on a Python file and return structured lint results.
    Use this before modifying a file to understand its current quality state,
    and after making changes to verify the fix did not introduce new issues.

    Args:
        file_path: Absolute or relative path to the Python file to lint.
        fix:       If true, automatically fix safe issues in place.

    Returns:
        JSON string with issues list, issue count, and files modified.
    """
    cmd = ["python", "-m", "ruff", "check", file_path, "--output-format=json"]
    if fix:
        cmd.append("--fix")

    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
        output = result.stdout or result.stderr

        try:
            issues = json.loads(output) if output.strip() else []
        except json.JSONDecodeError:
            issues = []

        formatted = [
            {
                "line":    issue.get("location", {}).get("row", 0),
                "col":     issue.get("location", {}).get("column", 0),
                "code":    issue.get("code", ""),
                "message": issue.get("message", ""),
                "fix_available": issue.get("fix") is not None,
            }
            for issue in issues
            if isinstance(issue, dict)
        ]

        return json.dumps({
            "file":         file_path,
            "issues":       formatted,
            "total_issues": len(formatted),
            "fixed":        "auto-fix applied" if fix else "no auto-fix",
        }, indent=2)

    except subprocess.TimeoutExpired:
        return json.dumps({"error": "Linter timed out after 30s", "file": file_path})
    except FileNotFoundError:
        return json.dumps({"error": "ruff not found -- install with: pip install ruff"})

@mcp.tool()
def run_tests(target: str, verbose: bool = False) -> str:
    """
    Run pytest on a module or directory and return structured pass/fail results.
    Use this after writing a fix to verify the fix makes failing tests pass
    without breaking other tests.

    Args:
        target:  Path to the test file or directory to run (e.g. tests/, tests/test_auth.py)
        verbose: If true, include full pytest output in the result.

    Returns:
        JSON string with pass count, fail count, failure details, and duration.
    """
    cmd = ["python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q"]
    if verbose:
        cmd.append("-v")

    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
        output = result.stdout

        try:
            report = json.loads(output)
            summary  = report.get("summary", {})
            failures = [
                {
                    "test":    t["nodeid"],
                    "message": t.get("call", {}).get("longrepr", "")[:500],
                }
                for t in report.get("tests", [])
                if t.get("outcome") == "failed"
            ]
            return json.dumps({
                "target":   target,
                "passed":   summary.get("passed", 0),
                "failed":   summary.get("failed", 0),
                "errors":   summary.get("error", 0),
                "total":    summary.get("total", 0),
                "duration": summary.get("duration", 0),
                "failures": failures,
                "stdout":   result.stdout[:2000] if verbose else "",
            }, indent=2)

        except json.JSONDecodeError:
            return json.dumps({
                "target":  target,
                "stdout":  result.stdout[:3000],
                "stderr":  result.stderr[:1000],
                "exit_code": result.returncode,
            })

    except subprocess.TimeoutExpired:
        return json.dumps({"error": f"Tests timed out after 120s for target: {target}"})
    except FileNotFoundError:
        return json.dumps({"error": "pytest not found -- install with: pip install pytest"})

if __name__ == "__main__":
    mcp.run(transport="stdio")

Add it to either agent implementation's server config:

"code_quality": {
    "command": "python",
    "args": ["/absolute/path/to/code_quality_server.py"]
}

cq_params = StdioServerParameters(
    command="python",
    args=["/absolute/path/to/code_quality_server.py"],
)

Test the server standalone before connecting the agent:

npx @modelcontextprotocol/inspector python code_quality_server.py

# Tuning Thinking Mode and Preserving Reasoning #

The thinking mode decision affects latency significantly enough that it is worth treating as an explicit architecture choice, not an afterthought.

In thinking mode, Qwen3.6 generates a chain-of-thought reasoning trace inside

tags before producing its action. For a 5-step agent task, that trace adds 1,000 to 5,000 tokens per turn depending on task complexity. Those tokens take time to generate and consume context budget.

When that cost is worth paying:

  • Planning steps where the agent decides what to do next.
  • Debugging sessions where the problem is genuinely ambiguous.
  • Multi-file refactoring where the agent needs to reason about side effects across files.

The reasoning trace catches mistakes before they become tool calls with wrong arguments. When it is not worth paying: mechanical tool-call loops where each step is unambiguous — list directory → read file → write file → commit. The model does not need to think hard about these steps. Non-thinking mode is faster and produces the same quality output.

Switch modes per-request, not globally:

THINKING_PARAMS = {
    "temperature": 0.6,
    "top_p":       0.95,
    "top_k":       20,
    "min_p":       0.0,
}

NON_THINKING_PARAMS = {
    "temperature": 0.7,
    "top_p":       0.8,
    "top_k":       20,
    "min_p":       0.0,
}

The preserve_thinking

flag — the Qwen3.6-specific capability that retains reasoning context across turns — directly impacts inference efficiency when prefix caching is active. Here is why it matters practically: in a 10-turn agent session, each turn shares a prefix of the conversation history. When preserve_thinking=True

, the full reasoning trace from prior turns stays in the history. The KV cache on the server side recognizes the shared prefix across turns and avoids recomputing it. The effective tokens-per-second rate for long sessions is meaningfully higher than without it, particularly when serving infrastructure like SGLang with --enable-prefix-caching

is running.

The practical rule: use preserve_thinking=True

for agent sessions that will run for more than 5 turns. Use preserve_thinking=False

(or non-thinking mode) for single-turn queries and short pipelines where the overhead is a waste.

# Conclusion #

Qwen3.6-35B-A3B's MoE architecture gives you 35B model quality at 3B activation cost. Its 262k context window gives you room to hold an entire code review session in context. Its explicit training on MCP-based agentic benchmarks means it knows how to use tools correctly, not just call them.

MCP provides the connective tissue. Define a tool once as an MCP server. Every Qwen3.6 session and every other MCP-compatible model can discover and call it without custom glue. The GitHub and filesystem servers in this article are two of hundreds of pre-built servers in the MCP ecosystem. The custom code_quality

server shows the pattern for anything that does not already exist.

The GitHub developer assistant in this article is one application of the pattern. The same architecture — local model, MCP tools, and agentic loop — works for a research assistant that searches academic databases and drafts literature reviews, a DevOps agent that reads CloudWatch logs and opens incident tickets, or a data pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is growing fast. The local model capability is already there.

is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on

Shittu Olumide

── more in #large-language-models 4 stories · sorted by recency
── more on @anthropic 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/building-local-ai-sy…] indexed:0 read:22min 2026-06-30 ·