TensorSharp: Open-Source Local LLM Inference Engine TensorSharp, a new open-source C# inference engine, now enables developers to run large language models locally using GGUF files. The engine supports multiple model architectures including Gemma 4, Qwen 3, and Mistral 3, and offers CPU, CUDA, and MLX backends with features like continuous batching and multimodal inference. A C inference engine for running large language models LLMs locally using GGUF model files. TensorSharp provides a console application, a web-based chatbot interface, and Ollama/OpenAI-compatible HTTP APIs for programmatic access. | Start here | Use this when you want to... | |---|---| | Supported model architectures supported-model-architectures Compute backends compute-backends HTTP APIs http-apis Per-model architecture cards /zhongkaifu/TensorSharp/blob/main/docs/models/README.md Paged attention & continuous batching /zhongkaifu/TensorSharp/blob/main/docs/PAGED ATTENTION AND CONTINUOUS BATCHING.md Inference benchmark matrix /zhongkaifu/TensorSharp/blob/main/docs/inference benchmark matrix.md Server API examples /zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/API EXAMPLES.md Server integration tests /zhongkaifu/TensorSharp/blob/main/TensorSharp.Server/testdata/README.md | Area | Status | |---|---| | Model families | Gemma 3/4, Qwen 3, Qwen 3.5/3.6-family GGUFs qwen35 , qwen35moe , qwen3next , GPT OSS, Nemotron-H incl. Nemotron 3 Nano Omni , and Mistral 3 | | Inference hosts | CLI, interactive REPL, ASP.NET Core web UI, Ollama-style API, OpenAI Chat Completions-style API | | Backends | Pure C CPU, direct CUDA/cuBLAS cuda , MLX Metal mlx , GGML CPU, GGML Metal, GGML CUDA | | Multimodal | Gemma 4 image/video/audio; Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni image input | | Continuous batching | vLLM-style paged KV cache, block-hash prefix sharing across requests, iteration-level scheduler enabled by default; opt-out via --no-continuous-batching | | Server model scope | One explicitly hosted GGUF via --model ; optional explicit projector via --mmproj ; no directory scanning | | Observability | Structured per-turn logs, queue status, and KV-cache reuse metrics across Web UI, Ollama, and OpenAI response shapes | Multi-architecture support -- Gemma 4, Gemma 3, Qwen 3, Qwen 3.5/3.6-family, GPT OSS, Nemotron-H, Mistral 3 Multimodal inference -- image, video, and audio inputs Gemma 4 ; images for Gemma 3 / Qwen 3.5-family / Mistral 3 / Nemotron-H Omni Thinking / reasoning mode -- structured chain-of-thought output with