{"slug": "tensorsharp-open-source-local-llm-inference-engine", "title": "TensorSharp: Open-Source Local LLM Inference Engine", "summary": "TensorSharp, a new open-source C# inference engine, now enables developers to run large language models locally using GGUF files. The engine supports multiple model architectures including Gemma 4, Qwen 3, and Mistral 3, and offers CPU, CUDA, and MLX backends with features like continuous batching and multimodal inference.", "body_md": "A C# inference engine for running large language models (LLMs) locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access.\n\n| Start here | Use this when you want to... |\n|---|---|\n|\n\n[Supported model architectures](#supported-model-architectures)[Compute backends](#compute-backends)[HTTP APIs](#http-apis)[Per-model architecture cards](/zhongkaifu/TensorSharp/blob/main/docs/models/README.md)[Paged attention & continuous batching](/zhongkaifu/TensorSharp/blob/main/docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md)[Inference benchmark matrix](/zhongkaifu/TensorSharp/blob/main/docs/inference_benchmark_matrix.md)[Server API examples](/zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/API_EXAMPLES.md)[Server integration tests](/zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/testdata/README.md)| Area | Status |\n|---|---|\n| Model families | Gemma 3/4, Qwen 3, Qwen 3.5/3.6-family GGUFs (`qwen35` , `qwen35moe` , `qwen3next` ), GPT OSS, Nemotron-H (incl. Nemotron 3 Nano Omni), and Mistral 3 |\n| Inference hosts | CLI, interactive REPL, ASP.NET Core web UI, Ollama-style API, OpenAI Chat Completions-style API |\n| Backends | Pure C# CPU, direct CUDA/cuBLAS (`cuda` ), MLX Metal (`mlx` ), GGML CPU, GGML Metal, GGML CUDA |\n| Multimodal | Gemma 4 image/video/audio; Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni image input |\n| Continuous batching | vLLM-style paged KV cache, block-hash prefix sharing across requests, iteration-level scheduler (enabled by default; opt-out via `--no-continuous-batching` ) |\n| Server model scope | One explicitly hosted GGUF via `--model` ; optional explicit projector via `--mmproj` ; no directory scanning |\n| Observability | Structured per-turn logs, queue status, and KV-cache reuse metrics across Web UI, Ollama, and OpenAI response shapes |\n\n**Multi-architecture support**-- Gemma 4, Gemma 3, Qwen 3, Qwen 3.5/3.6-family, GPT OSS, Nemotron-H, Mistral 3** Multimodal inference**-- image, video, and audio inputs (Gemma 4); images for Gemma 3 / Qwen 3.5-family / Mistral 3 / Nemotron-H Omni** Thinking / reasoning mode**-- structured chain-of-thought output with`<think>`\n\n/`<|channel>thought`\n\n/`<|channel>analysis`\n\ntags (Qwen 3, Qwen 3.5/3.6-family, Gemma 4, GPT OSS, Nemotron-H)**Tool calling / function calling**-- models can invoke user-defined tools; multi-turn tool-call conversations supported across all three API styles** Quantized model support**-- loads GGUF files with Q4_K_M, Q8_0, F16, MXFP4, and other quantization formats; performs native quantized matmul without dequantizing to FP32, including memory-efficient pure C# CPU loading for large GGUFs**GPU-accelerated**-- GGML Metal on macOS, GGML CUDA on Windows/Linux with NVIDIA GPUs, a direct CUDA/cuBLAS backend with PTX kernels, and an MLX backend for Apple Silicon (mlx-c / Metal), all with CPU fallbacks for unsupported ops**Optimized pure C# CPU backend**-- managed GEMM fast paths plus fused SIMD kernels for RMSNorm, RoPE, softmax, fused activations, and other inference hot paths**Continuous batching & paged KV cache**-- vLLM-style block-paged KV pool with block-hash prefix sharing across requests, iteration-level scheduler that admits / preempts sequences mid-batch, optional SSD-backed tier for very large KV working sets, and a native fused paged-attention kernel (`TSGgml_PagedAttentionForward`\n\n) that drives`ggml_flash_attn_ext`\n\non Metal/CUDA. Enabled by default in`TensorSharp.Server`\n\n; opt-out with`--no-continuous-batching`\n\n. See[docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md](/zhongkaifu/TensorSharp/blob/main/docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md).**Batched / parallel inference**--`IBatchedPagedModel.ForwardBatch`\n\nimplementations for Mistral 3, Gemma 4, GPT OSS, Qwen 3, Qwen 3.5/3.6-family, and Nemotron-H all run by default and pack N sequences into a single forward pass with paged K/V scatter and per-sequence attention via the native kernel. Each model exposes a`TS_<FAMILY>_BATCHED=0`\n\nescape hatch (e.g.`TS_GEMMA4_BATCHED=0`\n\n,`TS_QWEN35_BATCHED=0`\n\n,`TS_GPTOSS_BATCHED=0`\n\n,`TS_NEMOTRON_BATCHED=0`\n\n) to fall back to the per-sequence KV-swap path for A/B comparison or regression isolation.**Ollama & OpenAI API compatibility**-- drop-in replacement endpoints for existing tooling** Configurable sampling**-- temperature, top-k, top-p, min-p, repetition/presence/frequency penalties, seed, stop sequences** Chat templates**-- auto-loaded from GGUF metadata (Jinja2), with hardcoded fallbacks per architecture** Inference engine**-- the new`InferenceEngine`\n\n(worker-thread scheduler + paged block pool) replaces the legacy single-request FIFO queue inside`TensorSharp.Server`\n\n. The HTTP adapters still emit queue-position chunks for backward compatibility but the engine itself handles concurrency.**Batch processing**-- JSONL input support in the console application, plus a built-in inference benchmark for prefill/decode throughput** Streaming**-- token-by-token output via SSE (web) or stdout (console), with abort/stop support for in-flight generations** Hybrid SSM-Transformer**-- Nemotron-H mixes Mamba2 SSM layers, attention-only layers, and MoE FFN layers in a single model. The Mamba2 step has both a per-sequence native kernel and a batched native kernel (`TSGgml_NemotronMamba2BatchedStepF32`\n\n, NEON SIMD + GCD parallelism) used by the batched path.**Hybrid Attention-Recurrent**-- Qwen 3.5/3.6-family models mix full-attention layers with GatedDeltaNet recurrent layers; the batched path keeps recurrent running state in a per-slot recurrent-state pool**Mixture of Experts**-- Gemma 4 MoE variants (e.g. gemma-4-26B-A4B), GPT OSS MoE (e.g. gpt-oss-20b), Qwen 3.5/3.6-family MoE (`qwen35moe`\n\n/`qwen3next`\n\nvariants such as Qwen3.5-35B-A3B), and Nemotron-H MoE FFN layers**Batched GPU MoE**-- a single fused GGML graph dispatch handles all selected experts (plus the optional shared expert and residual add) for Qwen 3.5/3.6-family and Nemotron-H decode, eliminating per-expert round-trips**KV cache codecs**-- pluggable codec interface (`IKvBlockCodec`\n\n) with a built-in TurboQuant (Q4 / Q8) compressed codec for paged blocks, configurable via`--paged-kv-quant-bits`\n\n**Message editing**-- edit or delete previous messages in the web chat UI and regenerate from that point** Text/Image/Audio/Video uploads**-- the web UI accepts file uploads up to 500 MB, with automatic token-budget-aware truncation for large text files** Per-turn observability**-- structured logs capture the full user input and the full raw assistant output (both`<think>`\n\nreasoning and the final result) plus the KV cache hit ratio. The same cache-hit stats are surfaced through every API:`prompt_cache_hit_tokens`\n\n/`prompt_cache_hit_ratio`\n\n(Ollama),`usage.prompt_tokens_details.cached_tokens`\n\n(OpenAI), and`promptTokens`\n\n/`kvReusedTokens`\n\n/`kvReusePercent`\n\nin the Web UI SSE`done`\n\nevent\n\n| Architecture | GGUF arch keys | Example Models | Multimodal | Thinking | Tool Calling | Card |\n|---|---|---|---|---|---|---|\n| Gemma 4 | `gemma4` |\ngemma-4-E4B, gemma-4-31B, gemma-4-26B-A4B (MoE) | Image, Video, Audio | Yes | Yes |\n|\n\n`gemma3`\n\n[gemma3.md](/zhongkaifu/TensorSharp/blob/main/docs/models/gemma3.md)`qwen3`\n\n[qwen3.md](/zhongkaifu/TensorSharp/blob/main/docs/models/qwen3.md)`qwen35`\n\n, `qwen35moe`\n\n, `qwen3next`\n\n[qwen35.md](/zhongkaifu/TensorSharp/blob/main/docs/models/qwen35.md)`gptoss`\n\n, `gpt-oss`\n\n[gptoss.md](/zhongkaifu/TensorSharp/blob/main/docs/models/gptoss.md)`nemotron_h`\n\n, `nemotron_h_moe`\n\n[nemotron.md](/zhongkaifu/TensorSharp/blob/main/docs/models/nemotron.md)`mistral3`\n\n[mistral3.md](/zhongkaifu/TensorSharp/blob/main/docs/models/mistral3.md)See the [per-model architecture cards](/zhongkaifu/TensorSharp/blob/main/docs/models/README.md) for end-to-end documentation of each architecture (origin, forward graph, components, parameters, weight naming, and how TensorSharp implements / optimizes prefill and decode).\n\nTensorSharp loads models in GGUF format. Below are Hugging Face links where you can download GGUF files for each supported architecture. Pick a quantization that fits your hardware (Q4_K_M for low memory, Q8_0 for higher quality, etc.).\n\n| Architecture | Model | GGUF Download |\n|---|---|---|\n| Gemma 4 | gemma-4-E4B-it |\n|\n\n[ggml-org/gemma-4-31B-it-GGUF](https://huggingface.co/ggml-org/gemma-4-31B-it-GGUF)[ggml-org/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF)[google/gemma-3-4b-it-qat-q4_0-gguf](https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf)[Qwen/Qwen3-4B-GGUF](https://huggingface.co/Qwen/Qwen3-4B-GGUF)[unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF)[ggml-org/Qwen3.5-35B-A3B-GGUF](https://huggingface.co/ggml-org/Qwen3.5-35B-A3B-GGUF)[ggml-org/gpt-oss-20b-GGUF](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF)[bartowski/nvidia_Nemotron-H-8B-Reasoning-128K-GGUF](https://huggingface.co/bartowski/nvidia_Nemotron-H-8B-Reasoning-128K-GGUF)[bartowski/nvidia_Nemotron-H-47B-Reasoning-128K-GGUF](https://huggingface.co/bartowski/nvidia_Nemotron-H-47B-Reasoning-128K-GGUF)[bartowski/Mistral-Small-3.1-24B-Instruct-2503-GGUF](https://huggingface.co/bartowski/Mistral-Small-3.1-24B-Instruct-2503-GGUF)[bartowski/Mistral-Small-3.1-24B-Instruct-2503-GGUF](https://huggingface.co/bartowski/Mistral-Small-3.1-24B-Instruct-2503-GGUF)| Backend | Flag | Best fit | Description |\n|---|---|---|---|\n| Direct CUDA/cuBLAS | `--backend cuda` |\nNVIDIA inference and experimentation | Uses the CUDA Driver API, cuBLAS GEMM, PTX kernels for common float32 ops (fill, unary, binary, ternary, activations, RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat), and native quantized matmul/get-rows for supported GGUF quant types. Unsupported ops route through CPU fallbacks while preserving tensor semantics. |\n| MLX Metal | `--backend mlx` |\nApple Silicon (alternative to GGML Metal) | GPU-accelerated path built on\n`async_eval` to overlap GPU/CPU work, batched MoE decode with stacked expert weight slabs, MoE expert offload, GGUF mmap pinned in physical RAM via `mlock(2)` , host-derived allocator caps (`TS_MLX_MEMORY_LIMIT_MB` / `TS_MLX_CACHE_LIMIT_MB` / `TS_MLX_WIRED_LIMIT_MB` ), and a CPU fallback for ops that aren't yet wired up. Requires `libmlxc` (built locally by `TensorSharp.Backends.MLX/build-native-macos.sh` or located via `TENSORSHARP_MLX_LIBRARY` / `TENSORSHARP_MLX_LIBRARY_DIR` ). |\n\n`--backend ggml_metal`\n\n`--backend ggml_cuda`\n\n`--backend ggml_cpu`\n\n`--backend cpu`\n\n```\nTensorSharp/\n├── TensorSharp.Core/            # Core tensor library (Tensor, Ops, memory, device abstraction, CPU SIMD/managed quantized kernels)\n├── TensorSharp.Runtime/         # GGUF, tokenizers, templates, sampling, protocol parsing\n│   ├── Paged/                   # Paged KV cache primitives (BlockPool, BlockTable, KvBlock, BlockHashIndex, PagedKvStorage, PagedKvBatchOps, ManagedPagedAttention)\n│   ├── Scheduling/              # Continuous batching engine (InferenceEngine, BatchExecutor, ContinuousBatchScheduler, SequenceState, SchedulerConfig/Output, InferenceRequestHandle)\n│   ├── PagedKvCacheManager.cs   # Per-session paged KV manager (block allocation, prefix reuse)\n│   ├── PagedKvBlockStore.cs     # On-disk / RAM-tiered paged block storage with optional SSD spillover\n│   ├── SsdKvBlockTier.cs        # SSD-backed cold tier for paged blocks\n│   ├── TurboQuantKvCodec.cs     # Quantized KV block codec (Q4 / Q8) implementing IKvBlockCodec\n│   ├── PrefillChunking.cs       # Chunked-prefill helper used by SWA / very long prompts\n│   ├── KvBlockHash.cs           # Content-addressed block hash for prefix-cache sharing\n│   └── Logging/                 # JSON-line file logger + per-turn telemetry\n├── TensorSharp.Models/          # Model architectures and multimodal encoders/injectors\n│   ├── Models/<Family>/         # One folder per architecture (Gemma3, Gemma4, GptOss, Mistral3, Nemotron, Qwen3, Qwen35)\n│   │   ├── <Family>Model.cs                # Legacy per-sequence ModelBase implementation\n│   │   └── <Family>Model.BatchedForward.cs # IBatchedPagedModel.ForwardBatch — batched/paged path (Mistral3, Gemma4, GptOss, Qwen35, Nemotron, Qwen3)\n│   ├── Paged/                   # Tensor-side paged-attention helpers (TensorPagedAttention)\n│   ├── KvBlockTransfer.cs       # Helpers for extract/inject of KV blocks across sequences\n│   └── ModelMultimodalInjector.cs # Vision / audio / video embedding injection\n├── TensorSharp.Backends.GGML/   # GGML backend bindings (Metal/CUDA/CPU via native library)\n├── TensorSharp.Backends.Cuda/   # Direct CUDA backend using CUDA Driver API, cuBLAS, and PTX kernels\n├── TensorSharp.Backends.MLX/    # Apple Silicon MLX backend (mlx-c / Metal). Native bridge is built via `build-native-macos.sh`.\n├── TensorSharp.GGML.Native/     # Native C++ bridge to ggml (builds libGgmlOps, split into focused source files)\n│   ├── ggml_ops_core.cpp                  # Element-wise, reductions, basic shape ops\n│   ├── ggml_ops_elementwise.cpp           # Element-wise / activation fusions\n│   ├── ggml_ops_matmul.cpp                # GEMM / quantized matmul\n│   ├── ggml_ops_fused.cpp                 # Cross-cutting fused per-layer kernels\n│   ├── ggml_ops_norm_attn.cpp             # Norm + attention fusions\n│   ├── ggml_ops_transformer.cpp           # Full-layer fused transformer kernels (decode + prefill)\n│   ├── ggml_ops_moe.cpp                   # Mixture-of-Experts forward / fused router\n│   ├── ggml_ops_gated_delta_net.cpp       # Qwen 3.5/3.6 GatedDeltaNet kernels (per-seq + batched)\n│   ├── ggml_ops_mamba2.cpp                # Nemotron Mamba2 kernels (per-seq + batched SIMD)\n│   ├── ggml_ops_paged_attention.cpp       # Paged-attention native kernel (drives ggml_flash_attn_ext + sinks variant)\n│   ├── ggml_ops_training.cpp              # Training-only kernels (unused at runtime)\n│   └── tests/                              # Native unit + smoke tests\n├── TensorSharp.Server/          # Web chatbot + API server (ASP.NET Core)\n│   ├── Program.cs               # Slim bootstrap: DI wiring, middleware, endpoint mapping, paged-KV + continuous-batching CLI translation\n│   ├── ModelService.cs          # Facade that keeps the public server inference API stable; owns the InferenceEngineHost\n│   ├── ModelLifecycleService.cs # Model load/dispose and backend selection (CPU / CUDA / MLX / GGML CPU/Metal/CUDA)\n│   ├── InferenceEngineHost.cs   # DI-registered per-model InferenceEngine singleton (continuous batching entry point)\n│   ├── ChatGenerationPipeline.cs # Prompt rendering, submits to InferenceEngine, streams tokens, stop handling\n│   ├── InferenceTelemetry.cs    # Prompt/eval timing, TTFT, tokens/sec, full input/output logs\n│   ├── ChatHistoryPreparer.cs   # History normalization, raw-token splice helpers, multimodal order helpers\n│   ├── ChatSession.cs           # Per-conversation tracked history + raw assistant tokens\n│   ├── SessionManager.cs        # Thread-safe session registry (default + per-tab sessions)\n│   ├── InferenceQueue.cs        # Backward-compatible queue-status surface (engine itself handles concurrency)\n│   ├── BackendCatalog.cs        # Discovery of available compute backends (CPU / CUDA / MLX / GGML*)\n│   ├── TextUploadHelper.cs      # Token-budget-aware text-file truncation\n│   ├── WebUiChatPolicy.cs       # Web UI chat request validation\n│   ├── OpenAIResponseFormatParser.cs  # OpenAI response_format (json_object / json_schema) parsing\n│   ├── Hosting/                 # Startup-time concerns: options builder (ServerOptionsBuilder), backend resolution, logging, web root, paged-KV / continuous-batching CLI translation\n│   ├── RequestParsers/          # JSON request parsing (sampling, chat messages, tool functions)\n│   ├── ResponseSerializers/     # Per-protocol response shape factories (Ollama, OpenAI, Web UI)\n│   ├── StreamingWriters/        # SSE + NDJSON wire-format helpers\n│   ├── ProtocolAdapters/        # Per-protocol request handlers (WebUiAdapter, OllamaAdapter, OpenAIChatAdapter)\n│   ├── Endpoints/               # ASP.NET Core endpoint mapping (one extension method per protocol)\n│   ├── Logging/                 # Request logging middleware + low-noise path support\n│   ├── wwwroot/index.html       # Chat UI\n│   ├── testdata/                # Integration test suites (bash + Python)\n│   └── API_EXAMPLES.md          # Detailed API documentation\n├── TensorSharp.Cli/             # CLI application (one-shot generation, interactive REPL, batch JSONL, benchmarks)\n├── InferenceWeb.Tests/          # xUnit unit tests covering ops, KV cache, paged scheduler, batched-model correctness, web/server helpers\n├── AdvUtils/                    # Utility library (logger)\n├── docs/                        # Developer reference\n│   ├── models/                  # Per-model architecture cards (one .md per model, EN + 中文)\n│   ├── PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md  # Paged KV cache, prefix sharing, scheduler, per-model batched-forward status\n│   └── inference_benchmark_matrix.md  # Cross-engine throughput matrix (TensorSharp vs llama.cpp vs Ollama)\n├── benchmarks/                  # Reproducible benchmark harnesses\n│   └── inference_matrix/        # Driver scripts, modelfiles, prompts, and per-cell raw JSON results\n└── ExternalProjects/            # ggml/ is cloned from github.com/ggml-org/ggml at build time (not committed)\n```\n\nThe repository is now split along package boundaries so consumers can depend on only the layers they actually need.\n\n| Project | NuGet package | Public namespace | Responsibility |\n|---|---|---|---|\n`TensorSharp.Core` |\n`TensorSharp.Core` |\n`TensorSharp` |\nTensor primitives, ops, allocators, storage, and device abstraction |\n`TensorSharp.Runtime` |\n`TensorSharp.Runtime` |\n`TensorSharp.Runtime` |\nGGUF parsing, tokenizers, prompt rendering, sampling, output protocol parsing, paged KV cache, continuous-batching scheduler |\n`TensorSharp.Models` |\n`TensorSharp.Models` |\n`TensorSharp.Models` |\n`ModelBase` , architecture implementations, multimodal encoders, batched / paged forward passes, and model-side execution helpers |\n`TensorSharp.Backends.GGML` |\n`TensorSharp.Backends.GGML` |\n`TensorSharp.GGML` |\nGGML-backed execution and native interop |\n`TensorSharp.Backends.Cuda` |\n`TensorSharp.Backends.Cuda` |\n`TensorSharp.Cuda` |\nDirect CUDA allocator, storage, cuBLAS GEMM, PTX kernels, and quantized CUDA ops |\n`TensorSharp.Backends.MLX` |\n`TensorSharp.Backends.MLX` |\n`TensorSharp.MLX` |\nApple Silicon MLX backend (mlx-c / Metal) with quantized / fused / compiled kernels and MoE expert offload |\n`TensorSharp.Server` |\n`TensorSharp.Server` |\n`TensorSharp.Server` |\nASP.NET Core server, OpenAI/Ollama adapters, inference engine host, web UI |\n`TensorSharp.Cli` |\n`TensorSharp.Cli` |\n`TensorSharp.Cli` |\nConsole host and debugging / batch tooling |\n\nThis split keeps engine users off the web stack, keeps API-layer changes from leaking into core/runtime packages, and makes future benchmark or eval-harness projects easier to publish independently.\n\nValidate package metadata and README dependency boundaries before publishing:\n\n```\npwsh ./eng/verify-packages.ps1\n```\n\nThe verifier runs `dotnet pack`\n\nfor the public packages above and fails if an internal dependency such as `AdvUtils`\n\nleaks into the `.nuspec`\n\n, or if a TensorSharp package depends on a layer outside this table.\n\n[.NET 10 SDK](https://dotnet.microsoft.com/download/dotnet/10.0)the GGML/CUDA native builds clone the ggml sources from`git`\n\nand network access:[github.com/ggml-org/ggml](https://github.com/ggml-org/ggml)into`ExternalProjects/ggml/`\n\non first build (see`eng/fetch-ggml.sh`\n\n/`eng/fetch-ggml.ps1`\n\n). The clone tracks ggml's default branch (`master`\n\n); pin a different ref with`TENSORSHARP_GGML_GIT_REF`\n\n, or set`TENSORSHARP_GGML_NO_UPDATE=1`\n\nto skip the network update once cloned (offline rebuilds)**macOS (Metal backend):** CMake 3.20+ and Xcode command-line tools for building the native GGML library; the MLX backend additionally builds`libmlxc`\n\nfrom`TensorSharp.Backends.MLX/Native/`\n\nvia`bash TensorSharp.Backends.MLX/build-native-macos.sh`\n\n**Windows (GGML CPU / CUDA backends):** CMake 3.20+ and Visual Studio 2022 C++ build tools; for`ggml_cuda`\n\nor`cuda`\n\n, install an NVIDIA driver plus CUDA Toolkit 12.x or another compatible CUDA toolkit with cuBLAS**Linux (GGML CPU / CUDA backends):** CMake 3.20+; for`ggml_cuda`\n\nor`cuda`\n\n, install an NVIDIA driver plus CUDA Toolkit 12.x or another compatible CUDA toolkit with cuBLAS- GGUF model files (e.g., from\n[Hugging Face](https://huggingface.co))\n\n```\ndotnet build TensorSharp.slnx\n# Console application\ndotnet build TensorSharp.Cli/TensorSharp.Cli.csproj\n\n# Web application\ndotnet build TensorSharp.Server/TensorSharp.Server.csproj\n```\n\nThe native library is built automatically during the first `dotnet build`\n\nif it doesn't exist. To build it manually:\n\n```\ncd TensorSharp.GGML.Native\n```\n\nmacOS:\n\n```\nbash build-macos.sh\n```\n\nLinux (CPU-only):\n\n```\nbash build-linux.sh\n```\n\nLinux (GGML_CUDA enabled):\n\n```\nbash build-linux.sh --cuda\n```\n\nWindows (CPU-only):\n\n```\n.\\build-windows.ps1 --no-cuda\n```\n\nWindows (GGML_CUDA enabled):\n\n```\n.\\build-windows.ps1 --cuda\n```\n\nOn Windows and Linux, the native build script auto-detects the visible NVIDIA GPU compute capability and passes a narrow `CMAKE_CUDA_ARCHITECTURES`\n\nvalue to ggml-cuda (for example `86-real`\n\non an RTX 3080), which cuts CUDA build time substantially. The native build also runs in parallel by default with a conservative job cap so `nvcc`\n\ndoes not overwhelm typical developer machines.\n\nIf you want to override the auto-detected architecture list or the default build parallelism, use either environment variables or explicit build flags:\n\n```\nTENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real' bash build-linux.sh --cuda\nbash build-linux.sh --cuda --cuda-arch='86-real;89-real'\nTENSORSHARP_GGML_NATIVE_BUILD_PARALLEL_LEVEL=2 bash build-linux.sh --cuda\n$env:TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real'; .\\build-windows.ps1 --cuda\n.\\build-windows.ps1 --cuda --cuda-arch='86-real;89-real'\n$env:TENSORSHARP_GGML_NATIVE_BUILD_PARALLEL_LEVEL=2; .\\build-windows.ps1 --cuda\n```\n\nYou can also request a CUDA-enabled native build from `dotnet build`\n\n:\n\n```\nTENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release\n$env:TENSORSHARP_GGML_NATIVE_ENABLE_CUDA='ON'; dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release\n```\n\nOn macOS this compiles `libGgmlOps.dylib`\n\nwith Metal GPU support. On Windows and Linux, the native scripts preserve an existing CUDA-enabled build and auto-enable GGML_CUDA when a CUDA toolchain is detected; `build-windows.ps1 --cuda`\n\n, `build-linux.sh --cuda`\n\n, and `TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON`\n\nforce CUDA explicitly. The build output is automatically copied to the application's output directory.\n\nThe direct `cuda`\n\nbackend is built as managed C# plus PTX kernels. During `dotnet build`\n\n, `TensorSharp.Backends.Cuda`\n\ncompiles `native/kernels/*.cu`\n\nto `native/ptx/*.ptx`\n\nwhen `nvcc`\n\nis available; if `nvcc`\n\nis missing, the build continues and PTX-backed ops use CPU fallbacks. cuBLAS-backed GEMM still requires the CUDA runtime libraries to be discoverable at run time.\n\nThe MLX backend depends on `libmlxc`\n\n(the C bindings for [MLX](https://github.com/ml-explore/mlx)). The repository pins a known-good tag of `mlx-c`\n\nin `TensorSharp.Backends.MLX/Native/MLX_C_VERSION`\n\nand a helper script fetches and builds it:\n\n```\nbash TensorSharp.Backends.MLX/build-native-macos.sh\n```\n\nThe script writes the resulting libraries (`libmlxc.dylib`\n\n, `libmlx.dylib`\n\n, and any backend deps) into `TensorSharp.Backends.MLX/Native/dist/`\n\n. At run time the backend probes the application directory first; you can also point it to a custom install with `TENSORSHARP_MLX_LIBRARY=<path-to-libmlxc.dylib>`\n\nor `TENSORSHARP_MLX_LIBRARY_DIR=<dir-with-libmlxc>`\n\n. If the library cannot be located the backend reports unavailable and `--backend mlx`\n\nis rejected at startup.\n\n```\ncd TensorSharp.Cli/bin\n\n# Text inference\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \\\n    --max-tokens 200 --backend ggml_metal\n\n# Text inference on Windows/Linux + NVIDIA GPU\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \\\n    --max-tokens 200 --backend ggml_cuda\n\n# Interactive turn-by-turn chat (REPL) with KV cache reuse and slash commands\n./TensorSharp.Cli --model <model.gguf> --backend ggml_metal --interactive\n./TensorSharp.Cli --model <model.gguf> --backend ggml_metal -i \\\n    --system \"You are a terse assistant.\" --temperature 0.7 --top-p 0.9 --think\n\n# Image inference (Gemma 3/4, Qwen 3.5-family)\n./TensorSharp.Cli --model <model.gguf> --image photo.png --backend ggml_metal\n\n# Video inference (Gemma 4)\n./TensorSharp.Cli --model <model.gguf> --video clip.mp4 --backend ggml_metal\n\n# Audio inference (Gemma 4)\n./TensorSharp.Cli --model <model.gguf> --audio speech.wav --backend ggml_metal\n\n# Thinking / reasoning mode\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal --think\n\n# Tool calling\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \\\n    --tools tools.json\n\n# With sampling parameters\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \\\n    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.2 --seed 42\n\n# Batch processing (JSONL)\n./TensorSharp.Cli --model <model.gguf> --input-jsonl requests.jsonl \\\n    --output results.txt --backend ggml_metal\n\n# Multi-turn chat simulation with KV-cache reuse (mirrors the web UI behavior)\n./TensorSharp.Cli --model <model.gguf> --multi-turn-jsonl chat.jsonl \\\n    --backend ggml_metal --max-tokens 200\n\n# Throughput benchmark: best-of-N prefill and decode timing\n./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \\\n    --benchmark --bench-prefill 256 --bench-decode 128 --bench-runs 3\n\n# KV-cache reuse benchmark: measure prefill speedup across multiple chat turns\n# (compares with-cache vs forced-reset prefill latency for an 8-turn conversation)\n./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \\\n    --bench-kvcache --bench-kv-turns 4 --max-tokens 64\n\n# Inspect the rendered prompt and tokenization without running inference\n./TensorSharp.Cli --model <model.gguf> --input prompt.txt --dump-prompt\n\n# Compare hardcoded fallback templates against GGUF Jinja2 templates for every\n# *.gguf file in a directory (useful when adding new architectures)\n./TensorSharp.Cli --test-templates ~/models\n```\n\n**Command-line options:**\n\n| Option | Description |\n|---|---|\n`--model <path>` |\nPath to a GGUF model file (required) |\n`--input <path>` |\nText file containing the user prompt |\n`--input-jsonl <path>` |\nJSONL file with batch requests (one JSON per line) |\n`--multi-turn-jsonl <path>` |\nJSONL file for multi-turn chat simulation with KV cache reuse |\n`--output <path>` |\nWrite generated text to this file |\n`--image <path>` |\nImage file for vision inference |\n`--video <path>` |\nVideo file for video inference |\n`--audio <path>` |\nAudio file (WAV, MP3, OGG) for audio inference |\n`--mmproj <path>` |\nPath to the multimodal projector GGUF file |\n`--max-tokens <N>` |\nMaximum tokens to generate (default: 100) |\n`--backend <type>` |\nCompute backend: `cpu` , `cuda` , `mlx` , `ggml_cpu` , `ggml_metal` , or `ggml_cuda` |\n`--kv-cache-dtype <type>` |\nKV cache precision: `f32` (default), `f16` , or `q8_0` . Quantized / half-precision KV caches reduce memory at the cost of small numerical drift; benchmarks live in\n`docs/inference_benchmark_matrix.md` |\n`--interactive` / `-i` |\nStart an interactive REPL chat session (turn-by-turn input/output) with KV cache reuse, slash commands, hot-swappable model/backend/projector, file attachments (image, audio, video, text) and live sampling tuning. See the Interactive REPL commands section below for the full list. |\n`--system <text>` |\nSystem prompt to seed the interactive session (overridden inside the REPL by `/system` ) |\n`--system-file <path>` |\nRead the initial system prompt from a UTF-8 text file (alternative to `--system` ) |\n`--think` |\nEnable thinking/reasoning mode (chain-of-thought) |\n`--tools <path>` |\nJSON file with tool/function definitions |\n`--temperature <f>` |\nSampling temperature (0 = greedy) |\n`--top-k <N>` |\nTop-K filtering (0 = disabled) |\n`--top-p <f>` |\nNucleus sampling threshold (1.0 = disabled) |\n`--min-p <f>` |\nMinimum probability filtering (0 = disabled) |\n`--repeat-penalty <f>` |\nRepetition penalty (1.0 = none) |\n`--presence-penalty <f>` |\nPresence penalty (0 = disabled) |\n`--frequency-penalty <f>` |\nFrequency penalty (0 = disabled) |\n`--seed <N>` |\nRandom seed (-1 = non-deterministic) |\n`--stop <string>` |\nStop sequence (can be repeated) |\n`--dump-prompt` |\nRender the prompt + tokenization and exit (no generation) |\n`--benchmark` |\nRun a synthetic prefill/decode throughput benchmark |\n`--bench-prefill <N>` |\nSynthetic prefill length in tokens (default: 32) |\n`--bench-decode <N>` |\nSynthetic decode length in tokens (default: 64) |\n`--bench-runs <N>` |\nNumber of benchmark runs; reports best and average (default: 1) |\n`--bench-kvcache` |\nRun a multi-turn KV-cache reuse benchmark (with-cache vs forced-reset prefill) |\n`--bench-kv-turns <N>` |\nNumber of conversation turns for `--bench-kvcache` (default: 4, max: 8) |\n`--bench-chunked` |\nRun a chunked-prefill micro-benchmark (Gemma 4) |\n`--warmup-runs <N>` |\nNumber of throw-away forward passes before timing real text / multimodal prompts (default: 0) |\n`--test-chunked-prefill` |\nRun the chunked-prefill correctness check (compares chunked vs non-chunked logits) |\n`--correct-prefill <N>` |\nPrompt length used by `--test-chunked-prefill` |\n`--correct-decode <N>` |\nDecode length used by `--test-chunked-prefill` |\n`--test` |\nRun built-in tokenizer + Qwen3 chat-template + ollama-comparison tests |\n`--test-templates <dir>` |\nValidate hardcoded chat templates against GGUF Jinja2 templates for every *.gguf in `<dir>` |\n`--log-level <lvl>` |\nConsole + file logger level: `trace` , `debug` , `info` , `warning` , `error` , `critical` , `off` |\n`--log-dir <path>` |\nDirectory for the JSON-line file logger (default: `<binDir>/logs` ) |\n`--log-file <0|1>` |\nDisable (`0` ) or enable (`1` ) the file logger (default: enabled) |\n`--log-console <0|1>` |\nDisable (`0` ) or enable (`1` ) the console logger (default: enabled) |\n\nThe multimodal projector file is auto-detected if placed alongside the model file with a recognized name (e.g., `gemma-4-mmproj-F16.gguf`\n\n).\n\n**JSONL input format:**\n\nEach line is a JSON object with `messages`\n\n, optional `prompt`\n\n, and optional sampling parameters:\n\n```\n{\"id\": \"q1\", \"messages\": [{\"role\": \"user\", \"content\": \"What is 2+3?\"}], \"max_tokens\": 50}\n{\"id\": \"q2\", \"messages\": [{\"role\": \"user\", \"content\": \"Write a haiku.\"}], \"max_tokens\": 100, \"temperature\": 0.8}\n```\n\n**Interactive REPL commands:**\n\nOnce the CLI is launched with `--interactive`\n\n/ `-i`\n\n, you can drive the running session with slash commands. Type `/help`\n\n(or `/?`\n\n) inside the REPL for the same list. Anything that does not start with `/`\n\nis treated as a user turn.\n\nThe prompt header summarizes the current state on every turn — model, backend, architecture, context length, projector, conversation depth, and any attachments queued for the next turn (e.g. `[turn 3 (2 attachments pending)]> `\n\n). Press Ctrl+C while generating to interrupt the current reply; press Ctrl+C at the prompt to exit.\n\nConversation:\n\n| Command | Description |\n|---|---|\n`/help` , `/?` |\nShow all interactive commands |\n`/exit` , `/quit` |\nLeave the session |\n`/reset` , `/new` |\nClear conversation history and KV cache |\n`/history` |\nPrint the conversation history |\n`/save <file>` |\nAppend the current transcript to a UTF-8 file |\n`/system <text>` |\nSet the system prompt (empty argument clears it). Resets KV cache. |\n`/think on|off` |\nToggle thinking/reasoning mode for supported models |\n`/multiline on|off` |\nToggle multi-line input (terminate the message with a single `.` on its own line) |\n\nModel and runtime:\n\n| Command | Description |\n|---|---|\n`/info` , `/status` |\nShow the loaded model, backend, architecture, context/vocab size, projector, conversation depth, and pending attachments |\n`/model <path>` |\nLoad a different `.gguf` model on the current backend (resets the session) |\n`/backend <name>` |\nReload the current model on a different backend: `cpu` , `cuda` , `mlx` , `ggml_cpu` , `ggml_metal` , or `ggml_cuda` |\n`/mmproj <path>` |\nLoad (or replace) the multimodal projector for the current model. Aliases: `/projector` |\n\nSampling (live, persists across turns):\n\n| Command | Description |\n|---|---|\n`/sampling` , `/show` |\nPrint the current sampling configuration |\n`/max <N>` |\nMaximum reply length in tokens |\n`/temp <float>` |\nSampling temperature (0 = greedy) |\n`/topk <int>` |\nTop-K filtering (0 = disabled) |\n`/topp <float>` |\nTop-P / nucleus threshold (1.0 = disabled) |\n`/minp <float>` |\nMin-P filtering (0 = disabled) |\n`/repeat <float>` |\nRepetition penalty (1.0 = none) |\n`/presence <float>` |\nPresence penalty |\n`/frequency <float>` |\nFrequency penalty |\n`/seed <int>` |\nRandom seed (-1 = non-deterministic) |\n`/stop <text>` |\nAdd a stop sequence |\n`/clearstop` |\nRemove all stop sequences |\n\nUploads (queued for the next user turn, then auto-cleared after the turn):\n\n| Command | Description |\n|---|---|\n`/image <path>` , `/img <path>` |\nAttach an image (vision-capable models only) |\n`/audio <path>` |\nAttach an audio file (Gemma 4) |\n`/video <path>` , `/vid <path>` |\nAttach a video; frames are extracted automatically (Gemma 4) |\n`/text <path>` , `/file <path>` , `/txt <path>` |\nInline a UTF-8 text/markdown/csv/code file into the next prompt (large files are token-budget truncated) |\n`/clearattach` |\nDrop any pending image/audio/video/text attachments without sending a turn |\n\nQuoted paths (single or double quotes) are accepted, so drag-and-drop from a file manager works on macOS. Multimodal commands require a multimodal projector to be loaded — pass `--mmproj`\n\nat startup or use `/mmproj <path>`\n\nfrom the REPL.\n\n```\ncd TensorSharp.Server/bin\n\n# Start the server with the exact hosted model\n./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal\n\n# Linux + NVIDIA GPU\n./TensorSharp.Server --model ./models/model.gguf --backend ggml_cuda\n\n# Multimodal models: host an explicit projector too\n./TensorSharp.Server --model ./models/model.gguf --mmproj ./models/mmproj.gguf --backend ggml_cuda\n\n# Configure server-wide default sampling parameters\n# (used whenever a request does not override the value itself)\n./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal \\\n    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 \\\n    --presence-penalty 0.0 --frequency-penalty 0.0 --seed 42 \\\n    --stop \"</s>\" --stop \"<|endoftext|>\"\n```\n\nOpen `http://localhost:5000`\n\nin your browser. The web interface supports:\n\n- Multi-turn chat conversations\n- Per-tab chat sessions: each browser tab owns its own tracked conversation history; KV blocks are owned by the inference engine\n- A single hosted GGUF selected explicitly with\n`--model`\n\n- An explicit hosted multimodal projector via\n`--mmproj`\n\nwhen needed - Image, video, and audio uploads for multimodal inference (up to 500 MB)\n- Thinking/reasoning mode toggle\n- Tool calling with function definitions\n- Streaming token generation via Server-Sent Events\n- Backward-compatible queue-status events (the engine itself handles concurrency)\n- Message editing and deletion with regeneration from any point in the conversation\n- Free scrolling: scroll up to read earlier replies while new tokens stream in; the chat auto-scrolls again as soon as the user scrolls back to the bottom\n\nUse `--model`\n\nto choose the hosted GGUF file and `--mmproj`\n\nto choose the hosted projector. `TensorSharp.Server`\n\nno longer scans a `MODEL_DIR`\n\n.\n\n**Server command-line options:**\n\n| Option | Description |\n|---|---|\n`--model <path>` |\nGGUF file to host (required for inference; if omitted, the server starts but `/api/models/load` will report no hosted model) |\n`--mmproj <path>` |\nMultimodal projector GGUF (resolved relative to the model directory when only a filename is given; pass `none` to disable). Requires `--model` . |\n`--backend <type>` |\nDefault compute backend: `cpu` , `cuda` , `mlx` , `ggml_cpu` , `ggml_metal` , or `ggml_cuda` |\n`--max-tokens <N>` |\nDefault maximum tokens to generate when a request omits the limit (default: `20000` ) |\n`--temperature <f>` |\nDefault sampling temperature when a request does not provide one (`0` = greedy) |\n`--top-k <N>` |\nDefault top-K filtering when a request does not provide one (`0` = disabled) |\n`--top-p <f>` |\nDefault nucleus sampling threshold when a request does not provide one (`1.0` = disabled) |\n`--min-p <f>` |\nDefault min-p filtering when a request does not provide one (`0` = disabled) |\n`--repeat-penalty <f>` |\nDefault repetition penalty when a request does not provide one (`1.0` = none) |\n`--presence-penalty <f>` |\nDefault presence penalty when a request does not provide one (`0` = disabled) |\n`--frequency-penalty <f>` |\nDefault frequency penalty when a request does not provide one (`0` = disabled) |\n`--seed <N>` |\nDefault random seed when a request does not provide one (`-1` = non-deterministic) |\n`--stop <string>` |\nDefault stop sequence (can be repeated). Per-request `stop` /`stop_sequences` fully replace the default list rather than merge with it. |\n`--continuous-batching` / `--no-continuous-batching` |\nEnable (default) or disable iteration-level paged-batching. When enabled the server admits / preempts sequences mid-batch and packs them into one forward pass on models that implement `IBatchedPagedModel` . `--no-continuous-batching` falls back to per-sequence KV-swap for every model. Alias: `--paged-batching` / `--no-paged-batching` . |\n`--paged-kv` / `--no-paged-kv` |\nLegacy compatibility flags for the removed per-session paged-KV manager. Current server KV state is engine-owned; use continuous-batching / `TS_SCHED_*` knobs for the engine. Aliases: `--paged-kv-cache` / `--no-paged-kv-cache` . |\n`--paged-kv-block-size <N>` |\nLegacy standalone paged-KV block size. The current server engine uses `TS_SCHED_BLOCK_SIZE` . |\n`--paged-kv-ram-mb <N>` |\nLegacy standalone paged-KV RAM-tier cap. |\n`--paged-kv-ssd-dir <dir>` |\nLegacy standalone paged-KV SSD cold-tier directory. |\n`--paged-kv-ssd-mb <N>` |\nLegacy standalone paged-KV SSD cap. |\n`--paged-kv-quant-bits <0|4|8>` |\nLegacy standalone paged-KV block quantization (TurboQuantKvCodec). |\n\nPer-request fields in the chat / generate JSON payloads (e.g. `temperature`\n\n,\n`top_p`\n\n, `top_k`\n\n, `min_p`\n\n, `repeat_penalty`\n\n, `presence_penalty`\n\n,\n`frequency_penalty`\n\n, `seed`\n\n, `stop`\n\n/`stop_sequences`\n\n) always win over these\nserver-wide defaults; the defaults only fill in fields the client omits.\n\n**Runtime environment variables:**\n\n| Variable | Description |\n|---|---|\n`BACKEND` |\nDefault compute backend (`cpu` , `cuda` , `mlx` , `ggml_cpu` , `ggml_metal` , or `ggml_cuda` ), used when `--backend` is not passed (default: `ggml_metal` on macOS, `ggml_cpu` elsewhere) |\n`MAX_TOKENS` |\nDefault maximum generation length when neither `--max-tokens` nor a request-level limit is set (default: `20000` ) |\n`MAX_TEXT_FILE_CHARS` |\nCharacter cap used to truncate plain-text uploads when no tokenizer is available (default: `8000` ) |\n`VIDEO_MAX_FRAMES` |\nMaximum evenly spaced video frames extracted for video prompts (default: `4` ) |\n`PORT` / `ASPNETCORE_URLS` |\nStandard ASP.NET Core listener configuration (default port: `5000` ) |\n`TENSORSHARP_TEMPERATURE` |\nDefault sampling temperature when neither `--temperature` nor the request body sets one |\n`TENSORSHARP_TOP_K` |\nDefault top-K when neither `--top-k` nor the request body sets one |\n`TENSORSHARP_TOP_P` |\nDefault top-P when neither `--top-p` nor the request body sets one |\n`TENSORSHARP_MIN_P` |\nDefault min-P when neither `--min-p` nor the request body sets one |\n`TENSORSHARP_REPEAT_PENALTY` |\nDefault repetition penalty when neither `--repeat-penalty` nor the request body sets one |\n`TENSORSHARP_PRESENCE_PENALTY` |\nDefault presence penalty when neither `--presence-penalty` nor the request body sets one |\n`TENSORSHARP_FREQUENCY_PENALTY` |\nDefault frequency penalty when neither `--frequency-penalty` nor the request body sets one |\n`TENSORSHARP_SEED` |\nDefault random seed when neither `--seed` nor the request body sets one |\n`TENSORSHARP_LOG_LEVEL` |\nMinimum log level for both console and file loggers: `Trace` , `Debug` , `Information` , `Warning` , `Error` , `Critical` (default: `Information` ). Also honored by `TensorSharp.Cli` . |\n`TENSORSHARP_LOG_DIR` |\nDirectory the JSON-line file logger writes to (default: `<binDir>/logs` ). Also honored by `TensorSharp.Cli` . |\n`TENSORSHARP_LOG_FILE` |\nSet to `0` to disable the file logger and keep only the console output (default: enabled). Also honored by `TensorSharp.Cli` . |\n\n**Paged KV cache & continuous-batching tunables (read at process / model start)**\n\nThese can be set with either the `--paged-kv*`\n\n/ `--continuous-batching`\n\nCLI flags (which translate to the env vars below) or directly via the environment:\n\n| Variable | Description |\n|---|---|\n`TS_KV_PAGED_CACHE` |\nLegacy compatibility switch for the standalone `PagedKvCacheManager` ; current `TensorSharp.Server` request KV state is engine-owned. The CLI shortcuts are `--paged-kv` / `--no-paged-kv` . |\n`TS_KV_BLOCK_SIZE` |\nLegacy standalone paged-KV block size. The engine uses `TS_SCHED_BLOCK_SIZE` . |\n`TS_KV_CACHE_MAX_RAM_MB` |\nLegacy standalone paged-KV RAM-tier cap. |\n`TS_KV_CACHE_SSD_DIR` |\nLegacy standalone paged-KV SSD cold-tier directory. |\n`TS_KV_CACHE_MAX_SSD_MB` |\nLegacy standalone paged-KV SSD cap. |\n`TS_KV_PAGED_QUANT_BITS` |\nLegacy standalone paged-KV block quantization bits (`0` = passthrough, `4` , or `8` ). |\n`TS_SCHED_DISABLE_BATCHED` |\n`1` forces the per-sequence KV-swap fallback even when a model implements `IBatchedPagedModel` . The CLI shortcut is `--no-continuous-batching` . |\n`TS_SCHED_MAX_BATCHED_TOKENS` |\nScheduler per-step token budget (default: `4096` ). |\n`TS_SCHED_MAX_RUNNING_SEQS` |\nMaximum in-flight sequences (default: `16` ). |\n`TS_SCHED_PREFILL_CHUNK` |\nMaximum prefill tokens per step (default: `1024` ). |\n`TS_SCHED_NUM_BLOCKS` |\nPhysical blocks in the engine block pool (default: `256` ). |\n`TS_SCHED_BLOCK_SIZE` |\nTokens per block on the engine side (default: `256` ). |\n`TS_SCHED_PREFIX_CACHE` |\n`0` disables block-hash prefix sharing across requests. |\n`TS_SCHED_DECODE_QUANTUM` |\nTokens before a sequence-switch is allowed (default: block size). |\n`TS_QWEN35_BATCHED` |\nSet to `0` to force the Qwen 3.5/3.6 family onto the legacy per-sequence KV-swap path (default: batched/paged). Also implicitly disabled by `--no-continuous-batching` . |\n`TS_QWEN35_BATCHED_GDN_NATIVE` |\nUse the native batched GatedDeltaNet kernel inside Qwen 3.5/3.6 batched path. |\n`TS_GEMMA4_BATCHED` |\nSet to `0` to force Gemma 4 onto the legacy per-sequence KV-swap path (default: batched/paged). |\n`TS_GPTOSS_BATCHED` |\nSet to `0` to force GPT OSS onto the legacy per-sequence KV-swap path (default: batched/paged). |\n`TS_GPTOSS_PAGED_ATTN_MANAGED` |\nUse the managed (C#) paged-attention-with-sinks kernel inside GPT OSS batched path. |\n`TS_NEMOTRON_BATCHED` |\nSet to `0` to force Nemotron-H onto the legacy per-sequence KV-swap path (default: batched/paged). |\n`TS_NEMOTRON_MAMBA2_BATCHED_NATIVE` |\nUse the native Mamba2 batched step kernel inside Nemotron-H batched path. |\n`TS_PAGED_ATTN_KERNEL` |\nPaged-attention dispatch kernel for `Mistral3Model.BatchedForward` : `native` (default), `tensor` (C# Tensor-based), or `managed` (pure C# scalar). |\n`TS_MLX_PIPELINED_DECODE` |\nSet to `1` to enable pipelined greedy decode on the MLX backend (CLI only). |\n`TS_MLX_MLOCK_GGUF` |\n`1` (default) pins the GGUF mmap region in physical RAM via `mlock(2)` so model weights stay resident between forward passes. Set to `0` to skip (use if the process `memlock` rlimit is too low or you want the OS to manage paging). MLX backend only. |\n`TS_MLX_FUSED_KV_WRITE` |\n`1` (default) uses a single multi-dim `slice_update` to write the per-token KV block. Set to `0` to revert to the per-head loop (A/B testing / regression isolation). |\n`TS_MLX_BATCHED_MOE_DECODE` |\n`1` (default) collapses K per-expert decode dispatches to one batched dispatch per (gate/up/down) kind for Qwen 3.5/3.6 MoE. Set to `0` on memory-constrained machines (saves ~weight-doubling overhead from the stacked weight slabs). |\n`TS_MLX_MOE_FUSED_GATE_UP_SILU` |\n`1` (default) fuses gate matmul + up matmul + SiLUMul into one Metal kernel for batched MoE decode. Set to `0` to A/B against the legacy 3-dispatch path. |\n`TS_MLX_DEVICE_ROUTER` |\n`1` (opt-in) keeps MoE router top-K + softmax on device to skip ~60 host syncs/token on Qwen 3.6-35B-A3B. Requires greedy router + batched MoE matmul. |\n`TS_MLX_LOG_MEMORY_POLICY` |\n`1` (default) prints once-per-load MLX memory-policy lines (wired limit, GGUF mlock status, allocator caps). Set to `0` to silence. |\n`TS_MLX_MEMORY_LIMIT_MB` / `TS_MLX_CACHE_LIMIT_MB` / `TS_MLX_WIRED_LIMIT_MB` |\nOverride the MLX allocator hard cap / unused-buffer cache cap / wired-buffer residency cap (megabytes). Defaults are derived from the host's unified-memory capacity. |\n`TS_MLX_EVAL_EVERY_N_LAYERS` / `TS_MLX_GEMMA4_EVAL_EVERY_N_LAYERS` |\nPeriodic `mlx_async_eval` cadence during decode to overlap GPU work with host queueing. Default `4` (sweep on E4B Q8_0 shows ~7% decode win vs. disabled). Set to `0` to disable. |\n`TENSORSHARP_MLX_LIBRARY` / `TENSORSHARP_MLX_LIBRARY_DIR` |\nOverride the search path for `libmlxc` when using `--backend mlx` . |\n\nSampling parameter precedence (highest wins):\n\n- Per-request JSON fields in the API call (e.g.\n`temperature`\n\n,`top_p`\n\n,`stop`\n\n). - Server-wide CLI flags (e.g.\n`--temperature`\n\n,`--top-p`\n\n,`--stop`\n\n). `TENSORSHARP_*`\n\nenvironment variables listed above.- Built-in\n`SamplingConfig`\n\ndefaults (`temperature=1.0`\n\n,`top_k=0`\n\n,`top_p=1.0`\n\n,`min_p=0`\n\n,`repeat_penalty=1.0`\n\n, presence/frequency penalties`0`\n\n,`seed=-1`\n\n, no stop sequences).\n\nQuick reference for which environment variables (and matching CLI flags) gate each major feature. Variables in **bold** are required to turn the feature on; everything else is a tunable for a feature that's already enabled by default.\n\n| Feature | Default | Env vars | CLI equivalent |\n|---|---|---|---|\nContinuous-batching engine (`InferenceEngine` + scheduler) |\nON in `TensorSharp.Server` |\n`TS_SCHED_DISABLE_BATCHED=1` to force per-seq fallback |\n`--no-continuous-batching` / `--continuous-batching` |\n| Legacy per-session paged-KV manager | removed from Server request path | `TS_KV_PAGED_CACHE` (`0` / `1` ), `TS_KV_BLOCK_SIZE` retained for compatibility / standalone tests |\n`--paged-kv` / `--no-paged-kv` , `--paged-kv-block-size N` |\n| Legacy paged-KV SSD spillover (standalone manager) | OFF | `TS_KV_CACHE_MAX_RAM_MB` , `TS_KV_CACHE_SSD_DIR` , `TS_KV_CACHE_MAX_SSD_MB` |\n`--paged-kv-ram-mb` , `--paged-kv-ssd-dir` , `--paged-kv-ssd-mb` |\n| Legacy paged-KV block quantization (standalone manager) | OFF (`0` = passthrough) |\n`TS_KV_PAGED_QUANT_BITS` (`0` / `4` / `8` ) |\n`--paged-kv-quant-bits` |\n| Block-hash prefix sharing across requests | ON | `TS_SCHED_PREFIX_CACHE=0` to disable |\n— |\n| Scheduler tunables (per-step token budget, max in-flight seqs, prefill chunk, block pool size, decode quantum) | engine defaults | `TS_SCHED_MAX_BATCHED_TOKENS` , `TS_SCHED_MAX_RUNNING_SEQS` , `TS_SCHED_PREFILL_CHUNK` , `TS_SCHED_NUM_BLOCKS` , `TS_SCHED_BLOCK_SIZE` , `TS_SCHED_DECODE_QUANTUM` |\n— |\n\n| Model | Default state | Env var to flip default | Native-kernel sub-toggle |\n|---|---|---|---|\n| Mistral 3 | ON | — | `TS_PAGED_ATTN_KERNEL` = `native` (default) / `tensor` / `managed` |\n| Gemma 4 | ON | `TS_GEMMA4_BATCHED=0` to force legacy per-seq |\n— |\n| Qwen 3 | ON (reference port) | — | — |\n| Qwen 3.5 / 3.6 family | ON | `TS_QWEN35_BATCHED=0` to force legacy per-seq (or `--no-continuous-batching` ) |\n`TS_QWEN35_BATCHED_GDN_NATIVE=1` enables native batched GDN kernel; `FUSED_ATTN_LAYER_MIN_SEQ_LEN=N` overrides fused-attention engage threshold (default 4096) |\n| GPT OSS | ON | `TS_GPTOSS_BATCHED=0` to force legacy per-seq |\n`TS_GPTOSS_PAGED_ATTN_MANAGED=1` forces the managed (C#) sinks softmax instead of the native paged-attention-with-sinks kernel |\n| Nemotron-H | ON | `TS_NEMOTRON_BATCHED=0` to force legacy per-seq |\n`TS_NEMOTRON_MAMBA2_BATCHED_NATIVE=1` enables the native batched Mamba2 step (NEON SIMD + GCD parallelism) |\n| Gemma 3 | not implemented (per-seq fallback) | — | — |\n\n| Feature | Default | Env vars | CLI equivalent |\n|---|---|---|---|\n| Default compute backend | `ggml_metal` (macOS), `ggml_cpu` (Windows/Linux) |\n`BACKEND` |\n`--backend` |\n| MLX backend library lookup | probe app dir | `TENSORSHARP_MLX_LIBRARY` (full path to `libmlxc` ), `TENSORSHARP_MLX_LIBRARY_DIR` (directory) |\n— |\n| MLX pipelined greedy decode (CLI only) | OFF | `TS_MLX_PIPELINED_DECODE=1` |\n— |\nMLX `mlock(2)` of GGUF mmap so weights stay resident |\nON | `TS_MLX_MLOCK_GGUF=0` to disable |\n— |\nMLX fused multi-dim KV write (single `slice_update` per cache block) |\nON | `TS_MLX_FUSED_KV_WRITE=0` to revert to per-head loop |\n— |\n| MLX batched MoE decode (Qwen 3.5/3.6 MoE) | ON | `TS_MLX_BATCHED_MOE_DECODE=0` for legacy per-expert path |\n— |\n| MLX fused MoE gate+up+SiLUMul Metal kernel | ON | `TS_MLX_MOE_FUSED_GATE_UP_SILU=0` for legacy 3-dispatch |\n— |\n| MLX on-device MoE router top-K + softmax | OFF | `TS_MLX_DEVICE_ROUTER=1` |\n— |\nMLX Gemma 4 layer-boundary `async_eval` cadence |\nevery 4 layers | `TS_MLX_GEMMA4_EVAL_EVERY_N_LAYERS=N` (`0` = disabled) |\n— |\n| MLX allocator caps (memory / cache / wired buffer) | host-derived | `TS_MLX_MEMORY_LIMIT_MB` , `TS_MLX_CACHE_LIMIT_MB` , `TS_MLX_WIRED_LIMIT_MB` |\n— |\n| MLX one-line memory-policy banners at load | ON | `TS_MLX_LOG_MEMORY_POLICY=0` to silence |\n— |\n\nThese fill in fields the request body omits; per-request JSON always wins, CLI flags win over env vars.\n\n| Sampling field | Env var | CLI equivalent |\n|---|---|---|\n`temperature` |\n`TENSORSHARP_TEMPERATURE` |\n`--temperature` |\n`top_k` |\n`TENSORSHARP_TOP_K` |\n`--top-k` |\n`top_p` |\n`TENSORSHARP_TOP_P` |\n`--top-p` |\n`min_p` |\n`TENSORSHARP_MIN_P` |\n`--min-p` |\n`repeat_penalty` |\n`TENSORSHARP_REPEAT_PENALTY` |\n`--repeat-penalty` |\n`presence_penalty` |\n`TENSORSHARP_PRESENCE_PENALTY` |\n`--presence-penalty` |\n`frequency_penalty` |\n`TENSORSHARP_FREQUENCY_PENALTY` |\n`--frequency-penalty` |\n`seed` |\n`TENSORSHARP_SEED` |\n`--seed` |\n| max tokens | `MAX_TOKENS` |\n`--max-tokens` |\n| stop sequences | — (CLI / per-request only) | `--stop` (repeatable) |\n\n| Feature | Default | Env vars |\n|---|---|---|\n| ASP.NET Core listener | `http://0.0.0.0:5000` |\n`PORT` , `ASPNETCORE_URLS` |\n| Plain-text upload character cap (when no tokenizer available) | 8000 chars | `MAX_TEXT_FILE_CHARS` |\n| Video-frame extraction count | 4 frames | `VIDEO_MAX_FRAMES` |\n\n| Feature | Default | Env vars | CLI equivalent |\n|---|---|---|---|\n| Console + file log minimum level | `Information` |\n`TENSORSHARP_LOG_LEVEL` |\n`--log-level` |\n| File logger output directory | `<binDir>/logs` |\n`TENSORSHARP_LOG_DIR` |\n`--log-dir` |\n| File logger enabled | ON | `TENSORSHARP_LOG_FILE=0` to disable |\n`--log-file 0|1` |\n| Console logger enabled | ON | — | `--log-console 0|1` (CLI only) |\n\nThese are read by `build-linux.sh`\n\n/ `build-windows.ps1`\n\n/ the auto-build during `dotnet build`\n\nfor `TensorSharp.GGML.Native`\n\n, not at run time.\n\n| Feature | Default | Env vars | Build-script flag |\n|---|---|---|---|\n| Enable GGML CUDA in the native build | auto-detected from toolchain | `TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON` |\n`--cuda` / `--no-cuda` |\nNarrow `CMAKE_CUDA_ARCHITECTURES` list |\nauto-detected from visible GPU | `TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES` |\n`--cuda-arch='86-real;89-real'` |\n| Native build parallelism cap | conservative auto-cap | `TENSORSHARP_GGML_NATIVE_BUILD_PARALLEL_LEVEL` |\n— |\n\nThe server emits one structured Information-level entry at the start and end of every chat / generate turn, so a single grep over the log file reproduces the full request-response audit trail without replaying any traffic.\n\n| Event id | Emitted on | Carries |\n|---|---|---|\n`ChatStarted` (1500) |\n`chat.start` , `generate.start` , plus per-protocol request banners |\nsampling config, message + attachment counts, `userInput=` (full latest user message), `fullInput=` (JSON-encoded array of EVERY message in the request: system prompts + all prior user/assistant turns + the new user message, with attachment counts), or the full prompt for `/api/generate` |\n`ChatCompleted` (1502) |\n`chat.complete` , `generate.complete` |\ntoken counts, KV cache reuse (`kvReused` , `kvReusePercent` ), TTFT, elapsed, throughput, finish reason, full raw assistant output (reasoning + result) |\n`ChatAborted` (1503) |\nclient disconnected mid-stream | partial output, KV reuse fraction at the time of abort |\n`KvCacheReusePlan` (1510) |\nper-prefix-reuse decision | `Debug` -level fine-grained breakdown (exact match / partial / full reset) |\n`HttpRequestStarted/Completed` (1100/1101) |\nevery HTTP request | method, path, remote IP, status, duration; `/api/queue/status` is demoted to `Debug` so high-frequency UI polling does not drown out the per-turn entries |\n\nThe raw assistant output captures `<think>...</think>`\n\n, `<|channel|>analysis`\n\n,\nand any other inline framing the model emits, so the log line for a single turn\ncontains both reasoning and the user-visible result. Combined with the\n`fullInput=`\n\nfield on `chat.start`\n\n, every turn is fully reproducible from the\nlog file alone (request inputs + raw model output). Long uploads or long\nreasoning traces can produce multi-kilobyte log lines; raise the log level\n(`TENSORSHARP_LOG_LEVEL=Warning`\n\n) to suppress them while still keeping the start\nbanner and error logs.\n\nSample `fullInput`\n\npayload (formatted for readability; it is emitted as a\nsingle line in the actual log):\n\n```\n[\n  {\"role\":\"system\",\"content\":\"You are a helpful assistant.\"},\n  {\"role\":\"user\",\"content\":\"What is the tallest mountain?\"},\n  {\"role\":\"assistant\",\"content\":\"Mount Everest.\"},\n  {\"role\":\"user\",\"content\":\"How tall is it?\",\"images\":1}\n]\n```\n\nThe same per-turn KV cache reuse stats are surfaced through every API:\n\n**Web UI SSE**(`POST /api/chat`\n\n) - the`done`\n\nevent carries`promptTokens`\n\n,`kvReusedTokens`\n\n, and`kvReusePercent`\n\n.**Ollama NDJSON**(`POST /api/generate`\n\n,`POST /api/chat/ollama`\n\n) - the final chunk and the non-streaming response carry`prompt_cache_hit_tokens`\n\n(int) and`prompt_cache_hit_ratio`\n\n(0..1).**OpenAI**(`POST /v1/chat/completions`\n\n) - the`usage`\n\nblock carries`prompt_tokens_details.cached_tokens`\n\n, matching the OpenAI extension that existing SDKs already understand.\n\nThe Web UI footer line under each assistant message also surfaces the cache hit\ninline (e.g. `187 tokens · 2.1s · 87.2 tok/s · KV 420/512 (82%)`\n\n).\n\nTensorSharp.Server exposes three API styles. See [API_EXAMPLES.md](/zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/API_EXAMPLES.md) for full documentation with curl and Python examples.\n\n**Ollama-compatible API:**\n\n```\n# List models\ncurl http://localhost:5000/api/tags\n\n# Generate text\ncurl -X POST http://localhost:5000/api/generate \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen3-4B-Q8_0.gguf\", \"prompt\": \"Hello!\", \"stream\": false}'\n\n# Chat\ncurl -X POST http://localhost:5000/api/chat/ollama \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen3-4B-Q8_0.gguf\", \"messages\": [{\"role\": \"user\", \"content\": \"Hi\"}], \"stream\": false}'\n\n# Chat with thinking mode\ncurl -X POST http://localhost:5000/api/chat/ollama \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen3-4B-Q8_0.gguf\", \"messages\": [{\"role\": \"user\", \"content\": \"Solve 17*23\"}], \"think\": true, \"stream\": false}'\n\n# Chat with tool calling\ncurl -X POST http://localhost:5000/api/chat/ollama \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen3-4B-Q8_0.gguf\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the weather?\"}], \"tools\": [{\"function\": {\"name\": \"get_weather\", \"description\": \"Get current weather\", \"parameters\": {\"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}}], \"stream\": false}'\n```\n\n**OpenAI-compatible API:**\n\n```\n# Chat completions\ncurl -X POST http://localhost:5000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"Qwen3-4B-Q8_0.gguf\", \"messages\": [{\"role\": \"user\", \"content\": \"Hi\"}], \"max_tokens\": 50}'\n\n# Structured outputs (OpenAI response_format)\ncurl -X POST http://localhost:5000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"Qwen3-4B-Q8_0.gguf\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Extract the city and country from: Paris, France.\"}],\n    \"response_format\": {\n      \"type\": \"json_schema\",\n      \"json_schema\": {\n        \"name\": \"location_extraction\",\n        \"strict\": true,\n        \"schema\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"city\": {\"type\": \"string\"},\n            \"country\": {\"type\": \"string\"},\n            \"confidence\": {\"type\": [\"string\", \"null\"]}\n          },\n          \"required\": [\"city\", \"country\", \"confidence\"],\n          \"additionalProperties\": false\n        }\n      }\n    }\n  }'\n```\n\n**OpenAI Python SDK:**\n\n``` python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:5000/v1\", api_key=\"not-needed\")\nresponse = client.chat.completions.create(\n    model=\"Qwen3-4B-Q8_0.gguf\",\n    messages=[{\"role\": \"user\", \"content\": \"What is 2+3?\"}],\n    max_tokens=50\n)\nprint(response.choices[0].message.content)\n```\n\n**Queue status:**\n\n```\ncurl http://localhost:5000/api/queue/status\n# {\"busy\":false,\"pending_requests\":0,\"total_processed\":42}\n```\n\nModels that support thinking mode (Qwen 3, Qwen 3.5/3.6-family, Gemma 4, GPT OSS, Nemotron-H) can produce structured chain-of-thought reasoning before generating the final answer. The thinking content is separated from the main response and can be displayed or hidden by the client.\n\n**Qwen 3 / Qwen 3.5/3.6-family / Nemotron-H:** uses`<think>...</think>`\n\ntags**Gemma 4:** uses`<|channel>thought\\n...<channel|>`\n\ntags**GPT OSS:** uses Harmony format with`<|channel|>analysis`\n\nfor thinking and`<|channel|>final`\n\nfor the response\n\nEnable via `--think`\n\n(console), `\"think\": true`\n\n(Ollama API), or the thinking toggle in the web UI.\n\nModels can invoke user-defined tools and participate in multi-turn tool-call conversations. Define tools as JSON and pass them via `--tools`\n\n(console) or the `tools`\n\nparameter in the API.\n\nEach architecture uses its own wire format for tool calls:\n\n**Qwen 3 / Qwen 3.5/3.6-family / Nemotron-H:**`<tool_call>{\"name\": \"...\", \"arguments\": {...}}</tool_call>`\n\n**Gemma 4:**`<|tool_call>call:function_name{args}<tool_call|>`\n\n**GPT OSS (Harmony):** tools are declared as a TypeScript namespace in the developer message, and calls are emitted on the commentary channel as`<|channel|>commentary to=functions.NAME <|constrain|>json<|message|>{args}<|call|>`\n\nThe output parser (`OutputParser.cs`\n\n) automatically extracts tool calls from the model's raw output regardless of architecture.\n\nGemma 4 models support image, video, and audio inputs. Place the multimodal projector (`gemma-4-mmproj-F16.gguf`\n\n) in the same directory as the model file for automatic loading.\n\n**Images:** PNG, JPEG, HEIC/HEIF**Video:** MP4 (extracts up to 8 frames at 1 fps using OpenCV)**Audio:** WAV (16kHz mono), MP3, OGG Vorbis\n\nGemma 3 supports PNG, JPEG, and HEIC/HEIF image inputs. Place its multimodal projector (`mmproj-gemma3-4b-f16.gguf`\n\n) next to the model file for automatic loading.\n\nAll Qwen 3.5/3.6-family variants (`qwen35`\n\n, `qwen35moe`\n\n, and `qwen3next`\n\n) load through the same `Qwen35Model`\n\nimplementation. Image inputs are supported via the dynamic-resolution `Qwen35VisionEncoder`\n\n; place the projector (`Qwen3.5-mmproj-F16.gguf`\n\n) next to the model GGUF for automatic loading. The MoE variants (e.g. Qwen3.5-35B-A3B and Qwen3.6-35B-A3B GGUFs that report the same architecture keys) additionally enable a fused `MoEExpertsSwiGLUResidual`\n\nGGML kernel during decode that runs all selected experts, the optional shared expert, and the residual add in a single GPU graph dispatch.\n\nMistral 3 supports image inputs via the Pixtral vision encoder. Place the multimodal projector (`mistral3-mmproj.gguf`\n\n) in the same directory as the model file for automatic loading.\n\n**Images:** PNG, JPEG, HEIC/HEIF\n\nThe Nemotron Omni distribution adds a RADIO / v2_vl ViT image encoder. Pass the matching multimodal projector with `--mmproj`\n\n(e.g. `nvidia_Nemotron-H-Omni-mmproj.gguf`\n\n); the language-model GGUF stays the same. Image tokens are inserted at `<image>`\n\nplaceholders and expanded into `<img>`\n\n+ N tile tokens + `</img>`\n\nautomatically by the multimodal injector.\n\n**Images:** PNG, JPEG, HEIC/HEIF**Audio:** the chat template emits`<so_embedding>`\n\nper uploaded audio file and the CLI runs the Parakeet-style log-mel preprocessor for verification, but actual audio inference requires a Parakeet audio mmproj that the public GGUFs do not currently ship.\n\nTensorSharp is structured as a layered system:\n\n-\n**TensorSharp.Core** provides the core`Tensor`\n\ntype, storage abstraction, and the extensible operation registry (`Ops`\n\n). CPU implementations use`System.Numerics.Vectors`\n\nfor SIMD acceleration. -\n**TensorSharp.Runtime** owns runtime-facing contracts and services: GGUF parsing, tokenization (SentencePiece / BPE), chat template rendering, configurable token sampling, output parsing, paged KV cache (`Runtime/Paged/*`\n\n), the continuous-batching scheduler / engine (`Runtime/Scheduling/*`\n\n), the`IKvBlockCodec`\n\ninterface plus the`TurboQuantKvCodec`\n\nQ4/Q8 implementation, and reusable contracts such as`IModelArchitecture`\n\n,`IBatchedPagedModel`\n\n,`IPromptRenderer`\n\n,`IOutputProtocolParser`\n\n,`IMultimodalInjector`\n\n,`IKVCachePolicy`\n\n, and`IBackendExecutionPlan`\n\n. -\n**TensorSharp.Models** implements`ModelBase`\n\nplus the concrete architectures and multimodal helpers (Gemma 3/4, Qwen 3/3.5, GPT OSS, Nemotron-H, Mistral 3). Each architecture ships both the legacy per-sequence forward and an`IBatchedPagedModel.ForwardBatch`\n\nimplementation (`<Family>Model.BatchedForward.cs`\n\n) for continuous batching. Models are loaded via`ModelBase.Create()`\n\nwhich auto-detects the architecture from GGUF metadata. -\n**TensorSharp.Backends.GGML** registers accelerated implementations of the same operations via a native C++ bridge (`libGgmlOps`\n\n/`GgmlOps.dll`\n\n) that links against[ggml](https://github.com/ggml-org/ggml). On macOS this provides Metal GPU compute, and on Windows/Linux it can expose GGML CUDA for NVIDIA GPUs. Operations include native quantized matmul (Q4_K_M, Q8_0, etc.) without dequantizing to FP32, plus paged-attention (`TSGgml_PagedAttentionForward`\n\n, with and without attention sinks) and architecture-specific batched kernels (Mamba2, GatedDeltaNet). -\n**TensorSharp.Backends.Cuda** is the direct CUDA path. It uses the CUDA Driver API for device/context/storage management, cuBLAS for float32 GEMM, PTX kernels for hot scalar and transformer helper ops, and CPU fallbacks where native kernels are not implemented yet. -\n**TensorSharp.Backends.MLX** is the Apple Silicon MLX path. It wraps[mlx-c](https://github.com/ml-explore/mlx-c)(`libmlxc`\n\n) with allocator, storage, async worker dispatch, quantized + fused + compiled kernels, MoE expert offload, and a CPU fallback layer for ops that aren't yet wired up. -\n**TensorSharp.Server** is the HTTP/application layer. It provides Ollama-compatible and OpenAI-compatible REST APIs, the browser-based chat UI, upload handling, an`InferenceEngineHost`\n\nthat owns the per-model continuous-batching engine, and a thin queue-status surface for backward compatibility. -\n**TensorSharp.Cli** is the console/application layer for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and the built-in prefill / decode benchmarks.\n\nThe list below is the cross-architecture summary; each per-model card under\n[ docs/models/](/zhongkaifu/TensorSharp/blob/main/docs/models/README.md) walks through the same kernels in\ncontext, with the exact GGML graph dispatched and the conditions under which\nthe fused path engages.\n\n**Fused GPU decode**(Gemma 4): all transformer layers are executed in a single GGML compute graph dispatch on Metal, reducing CPU-GPU round-trips from hundreds per token to one. This achieves ~2.6x speedup over per-operation dispatch.**Fused GPU prefill**(Gemma 4): for dense (non-MoE, non-shared, non-PLE/multimodal) layers,`Gemma4LayerPrefill`\n\nruns the entire transformer block (RMSNorm + QKV + QK-norm + RoPE + attention + output projection + post-attn norm + GeGLU FFN + post-FFN norm + residual + layer scalar) as a single GGML graph dispatch per layer during prefill, extending the fused approach from decode to multi-token prefill.**Chunked prefill**(Gemma 4): long prompts are split into bounded chunks (2x sliding window, max 2048 tokens) to avoid O(n^2) attention score tensors for SWA layers. Chunking is applied automatically when text-only (no multimodal embeddings) and keeps each chunk within the SWA window budget.**Native whole-model decode**(Qwen 3): all transformer layers run in one native call (`TransformerModelDecode`\n\n) with pre-resolved per-layer weight pointers cached at load time, removing managed-loop overhead from the decode hot path.**Fused Qwen 3.5/3.6-family attention layer decode**: a single GGML graph performs RMSNorm + fused QKV + Q/gate deinterleave + per-head QK norm + RoPE + KV cache append + flash attention + sigmoid-gated mix + output projection + residual add for each FullAttention layer. Replaces ~2 standalone GGML calls and ~6 small CPU/GPU sync points per attention layer. Engages once the cached sequence length exceeds 4096 tokens (override with`FUSED_ATTN_LAYER_MIN_SEQ_LEN=N`\n\n).**Fused prefill attention**(Qwen 3.5/3.6-family):`FusedPrefillAttention`\n\ncombines Q*K^T, causal mask, softmax, and *V into a single GGML graph dispatch during multi-token prefill, eliminating ~5 separate C#-to-GGML round-trips per attention layer. Handles both initial prefill and continuation with existing KV cache entries.**Fused output-projection + FFN**(Qwen 3.5/3.6-family): for both FullAttention and GatedDeltaNet layers with dense FFN,`FusedOutProjFFN`\n\nmerges the output projection, residual add, post-attention RMSNorm, and the full SwiGLU FFN (gate_up matmul + SiLU + down matmul + residual) into a single GGML graph dispatch, reducing two GPU round-trips to one per layer.**Fused output-projection + norm + router**(Qwen 3.5/3.6-family MoE):`FusedOutProjNormRouter`\n\nmerges the GatedDeltaNet output projection, residual add, post-attention RMSNorm, and MoE router projection into one dispatch. The pre-computed router logits are then consumed directly by the batched MoE kernel, eliminating a separate router dispatch per MoE layer.**Fused vision encoder**(Qwen 3.5/3.6-family):`FusedVisionAttention`\n\nmerges LayerNorm + QKV + bias + 2D RoPE + scaled dot-product attention + output projection + bias + residual into one GGML graph dispatch (~8 ops → 1).`FusedVisionMLP`\n\nmerges LayerNorm + up + bias + GELU + down + bias + residual into one dispatch (7 ops → 1). Combined, these cut the per-block GPU round-trips from ~15 to 2.**Fused weight projections**: Q/K/V projections are fused into a single QKV matmul; gate and up projections are fused into a single gate_up matmul.** Native quantized compute**: quantized weights (Q4_K_M, Q6_K, Q8_0, IQ2_XXS, MXFP4, etc.) are used directly in matmul without expanding to FP32, saving memory and bandwidth. A batched`AddmmQuantBatch`\n\nkernel handles multiple sub-weight matmuls against a single quantized blob in one dispatch.**Direct CUDA kernels**: the`cuda`\n\nbackend accelerates fill/copy, unary ops, activation fusions, RMSNorm, softmax, index select, causal masking, RoPE/RoPEEx, cuBLAS GEMM, and supported quantized matmul/get-rows while safely falling back for incomplete op coverage.**Batched GPU MoE**:`MoEExpertsSwiGLUResidual`\n\n(Qwen 3.5/3.6-family) and`MoEExpertsForward`\n\n(Nemotron-H) collapse all selected experts -- and, for Qwen 3.5/3.6-family, the optional shared expert and the residual add -- into a single GGML graph dispatch per MoE layer.**GEMM-based vision patch embedding**(Qwen 3.5/3.6-family): the patch embedding step is reformulated as parallel im2col + matrix multiplication, replacing a single-threaded scalar quintuple-nested loop with a GPU-accelerated matmul.**Parallelized Q/gate deinterleave**(Qwen 3.5/3.6-family): the Q + sigmoid-gate deinterleave in FullAttention prefill is parallelized across tokens, scaling linearly with CPU core count for long prompts.**Optimized pure C# CPU path**: managed GEMM fast paths and contiguous float32 kernels accelerate decode, softmax, RMSNorm, RoPE, fused activations, and other hot paths while keeping quantized GGUF weights compressed during CPU loading.**Circular KV cache**: sliding-window attention layers use a fixed-size circular buffer, bounding memory usage regardless of sequence length.** KV-cache prefix reuse**: multi-turn conversations reuse the longest matching token prefix across turns. Truncation is automatically backed off by the sliding-window size for SWA models so the suffix can rebuild the SWA context.**Paged KV cache & block-hash prefix sharing**: the continuous-batching engine partitions KV into fixed-size blocks, content-hashes each full block, and shares them across concurrent and sequential requests. Models that have not implemented`IBatchedPagedModel`\n\nstill use the engine's isolated per-sequence KV-swap fallback.**Native paged-attention kernel**:`TSGgml_PagedAttentionForward`\n\n(and the`WithSinks`\n\nvariant for GPT OSS) does a C++ gather of K/V from the paged buffer, builds a small GGML graph per sequence, and dispatches`ggml_flash_attn_ext`\n\n— the same fused Metal/CUDA flash-attention kernel the legacy single-sequence path uses. On Ministral-3-14B long-context (4×~800 tokens) it is**~21 % faster than the legacy per-sequence GGML path**.** Batched / paged forward passes**: Mistral 3, Gemma 4, GPT OSS, Qwen 3.5/3.6 (incl. GatedDeltaNet recurrent state pool), and Nemotron-H (incl. Mamba2 recurrent state pool + native batched Mamba2 kernel) pack N sequences into a single`ForwardBatch`\n\ncall with one batched linear-projection matmul per layer, paged K/V scatter via`slotMapping`\n\n, and per-sequence attention via the native kernel. Gemma 4 batched path reaches**1.5×** legacy throughput at batch=8 short prompts and**1.6×** at 4×800-token prompts; Nemotron-H Mamba2 batched reaches**3.95×** at batch=3 on Apple M4 Pro. See[docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md](/zhongkaifu/TensorSharp/blob/main/docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md).**Kernel warmup**: both CLI and Server run a tiny forward pass at startup to pre-compile GPU kernels (Metal pipeline states, CUDA JIT) and warm the memory pool, avoiding cold-start latency on the first real inference request.**Prefill caching**(Gemma 4, Qwen 3.5/3.6-family): per-forward-pass SWA mask cache (Gemma 4), NeoX RoPE cos/sin lookup table cache across global layers (Gemma 4), and RoPE position tensor cache across layers (Gemma 4, Qwen 3.5/3.6-family) eliminate redundant recomputation during prefill.**In-place QK RMSNorm**(Qwen 3.5/3.6-family): per-head QK normalization is performed in-place using a`View`\n\n, avoiding one tensor allocation and copy per Q/K per layer.\n\n**Zero-copy file-mapped quantized weights**(direct CUDA, GGML CUDA, GGML Metal, GGML CPU): the GGUF model file is memory-mapped and quantized tensors are bound directly into native ops via host-pointer buffers. This removes the per-tensor copy from disk into a freshly-allocated native heap buffer that previously roughly doubled the resident set on Apple Silicon for large quantized models. For example,`Qwen3.5-35B-A3B-IQ2_XXS`\n\n(~10 GB GGUF) now runs with ~7 GB peak working memory under Metal instead of ~17 GB. The OS keeps the mapped file in its page cache and pages it out under memory pressure without any inference penalty on Apple Silicon (unified memory).**Best-fit memory pool**: the GGML host allocator uses a best-fit search across pooled blocks instead of first-fit, which avoids handing out a large scratch block to satisfy a tiny intermediate-tensor request and keeps the working-set tightly bounded across long-running inference.**Bounded pool retention**: the integrated-GPU / CPU memory pool now caps individual retained blocks at 64 MB and the total pool at 32 blocks. Combined with mmap-backed weights, this keeps short-lived intermediate tensors recycled fast while bounding the peak resident set.**Memory-efficient model loading**: large tensors are streamed directly to native memory without intermediate managed allocations. F32 weights and norms still load on demand; quantized weights are mmap-backed when supported by the backend.**Paged KV block pool with optional SSD spillover**: paged KV blocks live in a per-engine`BlockPool`\n\nwith LRU eviction; the`PagedKvBlockStore`\n\nkeeps a configurable RAM cap (`TS_KV_CACHE_MAX_RAM_MB`\n\n) and spills cold blocks into an SSD tier (`TS_KV_CACHE_SSD_DIR`\n\n) up to`TS_KV_CACHE_MAX_SSD_MB`\n\n. Block content-hashes are kept in a global index so prefix matches are reused across sessions and requests without rematerialising the K/V.**KV block codecs**: blocks can be optionally compressed in-place with`TurboQuantKvCodec`\n\n(Q4 or Q8) via`--paged-kv-quant-bits`\n\n, trading a small accuracy cost for half / quarter the per-block bandwidth and memory footprint. Recurrent-state models fall back to passthrough automatically.\n\nReference numbers measured on `Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf`\n\n(~10 GB on disk, 256 routed experts of which 8 are active per token, with 12 full attention + 30 GatedDeltaNet recurrent layers) on an Apple M4 Pro with 24 GB unified memory:\n\n| Metric | Before (`v1` baseline) |\nAfter (this branch) | Change |\n|---|---|---|---|\n| Process peak memory footprint | ~17 GB | ~8 GB |\n-52% |\n| TensorSharp.Server resident set after load | ~20 GB | ~8 GB |\n-60% |\n| Decode throughput (warm, 256 prefill / 64 decode, M4 Pro) | ~3.8 tok/s | ~10.8 tok/s |\n+2.85x |\n| Decode latency (warm, 256 prefill / 64 decode, M4 Pro) | ~264 ms/token | ~92 ms/token |\n-65% |\n\nReproduce with:\n\n```\n./TensorSharp.Cli --model Qwen3.6-35B-A3B-UD-IQ2_XXS.gguf --backend ggml_metal \\\n    --benchmark --bench-prefill 256 --bench-decode 64 --bench-runs 3\n```\n\nThe memory reduction comes primarily from no longer copying the GGUF file into a separate native heap buffer (the file is now mmap-bound zero-copy into Metal command buffers). The decode throughput increase is largely a side effect of removing that ~10 GB duplicate working set, which was previously triggering OS-level memory pressure on machines with 24 GB or less of physical RAM.\n\nFor an apples-to-apples comparison of TensorSharp, llama.cpp, and Ollama on the same on-disk GGUF files (Gemma 4 E4B Q8_0 today, with text / synthetic prefill / image / audio / video tasks and KV-cache dtype sweeps for `f32`\n\n, `f16`\n\n, and `q8_0`\n\n), see [ docs/inference_benchmark_matrix.md](/zhongkaifu/TensorSharp/blob/main/docs/inference_benchmark_matrix.md). The driver scripts are in\n\n`benchmarks/inference_matrix/scripts/`\n\nand the per-cell raw JSON outputs live under `benchmarks/inference_matrix/results/`\n\n.`InferenceWeb.Tests`\n\nexercises in-process behavior that doesn't require a running server: managed quantized ops, direct CUDA backend kernels when a CUDA device is available, MLX backend kernels when MLX is available, paged KV cache scheduling (`ContinuousBatchSchedulerTests`\n\n, `PagedKvCacheTests`\n\n, `PagedKvCacheCodecTests`\n\n), batched executor correctness (`BatchedExecutorTests`\n\n), per-model batched-forward correctness against the legacy path (`Qwen35BatchedCorrectnessTests`\n\n, `Mistral3BatchedForwardTests`\n\n, `Gemma4BatchedForwardTests`\n\n, `GptOssBatchedCorrectnessTests`\n\n, `NemotronBatchedCorrectnessTests`\n\n), per-model batched perf microbenchmarks (`*BatchedPerfBench.cs`\n\n), `TurboQuantKvCodec`\n\ncodec round-trips, prefill chunking, KV cache policies, KV-cache prompt rendering / multi-turn integration, chat-session and session-manager isolation, model service history plumbing, request-logging middleware and file-logger provider, image preprocessing, media helpers, structured-output validation, text-upload helpers, model-service upload logging, web UI chat policy, model context length parsing, backend catalog resolution, and the server CLI options builder (`ServerOptionsBuilderTests`\n\n).\n\n```\ndotnet test InferenceWeb.Tests/InferenceWeb.Tests.csproj\n```\n\nIntegration tests for TensorSharp.Server are in `TensorSharp.Server/testdata/`\n\n. They cover all three API styles (Web UI SSE, Ollama, OpenAI), multi-turn conversations, thinking mode, tool calling, structured outputs, queue behavior, concurrent requests, and abort support. Architecture-specific features (thinking, tool calling) are auto-detected and skipped when the active model does not support them.\n\n```\n# Start TensorSharp.Server, then run:\npython3 TensorSharp.Server/testdata/test_multiturn.py\n# or\nbash TensorSharp.Server/testdata/test_multiturn.sh\n```\n\nSee [TensorSharp.Server/testdata/README.md](/zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/testdata/README.md) for the full test matrix.\n\nZhongkai Fu\n\nSee [LICENSE](/zhongkaifu/TensorSharp/blob/main/LICENSE) for details.", "url": "https://wpnews.pro/news/tensorsharp-open-source-local-llm-inference-engine", "canonical_source": "https://github.com/zhongkaifu/TensorSharp", "published_at": "2026-06-04 00:29:10+00:00", "updated_at": "2026-06-04 00:46:11.455520+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "machine-learning"], "entities": ["TensorSharp", "GGUF", "Ollama", "OpenAI", "Gemma", "Qwen", "Mistral", "Nemotron"], "alternates": {"html": "https://wpnews.pro/news/tensorsharp-open-source-local-llm-inference-engine", "markdown": "https://wpnews.pro/news/tensorsharp-open-source-local-llm-inference-engine.md", "text": "https://wpnews.pro/news/tensorsharp-open-source-local-llm-inference-engine.txt", "jsonld": "https://wpnews.pro/news/tensorsharp-open-source-local-llm-inference-engine.jsonld"}}