TL;DRβ Every .NET RAG project quietly ships a Python sidecar to do one job: chunk documents. I got rid of mine.DocNest .NETis an idiomatic C# / .NET 8 port of my[DocNest]engine β embeddings runlocally(ONNX MiniLM, no key, offline), the LLM isoptional(factual questions answered atzero tokens), and the.udf
knowledge base it writes isbyte-compatible with the Python version. Ingest in Python, query in C#. It's on NuGet today.Β·[Repo].[NuGet]
You're building on .NET. The product needs to answer questions over a pile of PDFs, contracts, spreadsheets β real retrieval-augmented generation. So you go looking for tooling, and you find the same thing I did:
It's all Python.
LangChain, LlamaIndex, every RAG tutorial worth reading β Python, Python, Python. So you do the thing nobody admits to in the architecture review: you stand up a little Python service on the side. A second runtime to containerize, deploy, version, monitor, and wake up to at 3 a.m. when it OOMs. All so it can split a document into chunks and hand them back to your actual app.
A whole extra language in production to chop up a PDF. I stared at that diagram one too many times and decided it had to go.
So I ported DocNest to C#. Not a wrapper shelling out to python.exe
β a real, idiomatic .NET port. async
/await
end to end, every dependency behind an interface, shipped as proper NuGet packages. Nothing Python left in the runtime.
But to explain why DocNest is worth porting, I have to tell you about the bug that started the whole thing.
A RAG app I'd built gave a client a confidently wrong number. Not "I don't know" β a clean, specific, wrong answer, delivered with total confidence. I spent three days assuming my retrieval ranking was off, tuning embeddings and k
values and similarity thresholds.
The ranking was fine. The problem happened before any of that β at ingestion. Here's how almost every pipeline reads a document:
PDF β extract text β split every 512 chars β embed β store β hope
Watch what that does to a revenue table:
chunk_1: "45.2% Q3 Europe 38.1% Q2 Europe 41.7% Q3"
chunk_2: "Asia 29.3% Q2 Asia Americas 52.1% Q3 Ame"
The headers are gone. The rows are shredded across a chunk boundary. The model receives a bag of loose numbers with no idea which is revenue, which is a quarter, which region they belong to β and fills the gap with a confident guess. That's not a model problem or a retrieval problem. It's an ingestion problem. You destroyed the meaning before the model ever saw the data.
A person doesn't read a report as one long character stream. They see headings, sections, a table with columns. DocNest does the same: it reads the document's structure first. Every heading becomes a navigable Β§section
. Every table is preserved as structured data β never flattened:
{
"section": "Β§4.2 Revenue by Region",
"table": {
"headers": ["Region", "Q2", "Q3", "Change"],
"rows": [
["Europe", "38.1%", "45.2%", "+7.1pp"],
["Asia", "29.3%", "41.7%", "+12.4pp"]
]
}
}
Same numbers, same model, same question β but now the answer is right, and it comes with a citation. The document is normalised once into a portable .udf
file: a self-contained ZIP holding the section index, key numbers, keywords, section text, and quantised embeddings. Parse once, query forever.
Here's the part I'm proud of. The .udf
format is an open spec, and the .NET writer produces files that are byte-compatible with the Python engine. That one constraint unlocks something genuinely useful:
.udf
to your One ingestion ecosystem, two languages, the same artifact moving between them. Nothing in the codebase is allowed to break that cross-ecosystem contract β it's the whole point.
When I describe this, two questions come back every time. They're actually two independent choices:
1. Embeddings run locally. A small ONNX MiniLM model (~90 MB) downloads once and caches. No API key, fully offline. There's an optional ONNX cross-encoder reranker for dense PDFs.
2. The LLM is optional. Answer Layers 0β1 resolve factual questions deterministically β zero tokens, no key. You only bring an LLM for synthesis, and when you do, "OpenAI" means the answer model, not embeddings. The two never get coupled.
dotnet add package DocNest.Core
dotnet add package DocNest.Parsers
dotnet add package DocNest.Retrieval
dotnet add package DocNest.Query
using DocNest;
using DocNest.Parsers;
using DocNest.Pipeline;
using DocNest.Query;
using DocNest.Retrieval;
using DocNest.Udf;
// Parse β normalise β write a portable .udf
var raw = await new ParserFactory().Get("report.pdf").ParseAsync("report.pdf");
var doc = new DocNestPipeline().Process(raw);
await new UdfWriter().WriteAsync(doc, "report.udf");
// Load it back and ask β deterministic layers, no LLM
var document = (await UdfReader.LoadAsync("report.udf")).ToDocument();
using var retriever = new HybridRetriever(".docnest_cache");
var engine = new DocNestQueryEngine(retriever); // no LLM β Layers 0β1 only
var result = await engine.AnswerAsync(document, "What was Q3 revenue?", allowLlm: false);
Console.WriteLine(result.Answer); // "Q3 revenue: $38M (source: Β§3.1)"
Console.WriteLine(result.TokensUsed); // 0
Prefer the terminal?
dotnet tool install -g DocNest.Cli
docnest convert report.pdf -o report.udf
docnest query report.udf "What was Q3 revenue?"
OpenAiCompatibleLlmProvider
talks to OpenAI, Groq, Cerebras, Together, OpenRouter and local servers (Ollama, LM Studio) β change the base URL and model. Anthropic has its own provider.
ILlmProvider llm = new OpenAiCompatibleLlmProvider(
apiKey: Environment.GetEnvironmentVariable("GROQ_API_KEY")!,
model: "llama-3.3-70b-versatile",
baseUrl: "https://api.groq.com/openai/v1");
var engine = new DocNestQueryEngine(retriever, llm);
var result = await engine.AnswerAsync(document, "Summarise the key risks.", allowLlm: true);
Console.WriteLine(string.Join(", ", result.Citations)); // ["Β§5.2", "Β§5.3"]
file β IParser β DocNestPipeline (normalise Β· key-numbers Β· keywords) β Document β .udf
query β HybridRetriever (BM25 + dense + cross-encoder rerank + RRF + 1-hop graph) β top-k
β DocNestQueryEngine (5 layers) β answer + citations + tokens + confidence
| Layer | Mechanism | Tokens |
|---|---|---|
| 0 | Pre-computed key-numbers / summary | 0 |
| 1 | Extractive from the top section | 0 |
| 2 | Single-section LLM | ~300 |
| 3 | Multi-section synthesis (reranked context) | ~900 |
| 4 | Broad fallback over retrieved sections | ~1,500 |
The engine climbs this ladder only when a cheaper rung isn't confident. Layers 0β1 handle a surprising share of real factual questions at zero cost β you pay tokens only for genuine synthesis.
A multi-format eval β 10 documents, 88 questions, 5 formats (the same set as the Python reference), dense + cross-encoder rerank, gpt-oss-120b
narrator, qwen2.5
judge:
| Format | Score | Hit-rate (β₯7) |
|---|---|---|
| XLSX | 8.7 / 10 | 93% |
| MD | 8.7 / 10 | 100% |
| DOCX | 7.0 / 10 | 79% |
| HTML | 4.8 / 10 | 50% |
| 6.8 / 10 | 70% | |
| Overall | ||
| ~7.1 / 10 | ||
| ~78% |
The Python reference sits at 8.5/10. This .NET port is at 7.1 and closing the gap slice by slice β the cross-encoder reranker alone dragged PDFs from 5.1 β 6.8 (hit-rate 47% β 70%). HTML is clearly my weakest format right now, and it's the next thing I'm fixing.
I could have cherry-picked a kinder run and quoted a bigger number. I'd rather ship the reproducible one with the eval harness sitting right next to it in the repo. If you don't trust a benchmark you can't re-run, neither do I.
| Package | Role |
|---|---|
DocNest.Abstractions |
|
| Domain records + wrapper interfaces | |
DocNest.Core |
|
Pipeline, normaliser, .udf reader/writer, quantizer |
|
DocNest.Parsers |
|
| md / html / csv / docx / xlsx / pdf | |
DocNest.Embeddings |
|
| ONNX MiniLM embedder + ms-marco cross-encoder reranker | |
DocNest.Retrieval |
|
| Hybrid retriever (FTS5 BM25 + dense + rerank + RRF + graph) | |
DocNest.Query |
|
| 5-layer answer engine + LLM providers | |
DocNest.Storage |
|
.udf ZIP storage backend |
|
DocNest.Cli |
|
docnest dotnet tool |
Parsers cover PDF (PdfPig), DOCX/XLSX (OpenXML), HTML (AngleSharp), CSV/TSV and Markdown. Every external dependency lives behind a DocNest interface, so swapping any of them is a one-line change.
This is pre-1.0, built slice-by-slice under a gated protocol: understand β plan β design + ADR β tests-first β full suite green β sign-off, per phase. The core pipeline, hybrid retrieval, cross-encoder reranking and the 5-layer engine are implemented and tested. Cloud embedding providers (OpenAI embeddings and friends) exist in the Python engine but aren't ported yet β embeddings here are local-only by design.
dotnet add package DocNest.Core
dotnet tool install -g DocNest.Cli
pip install docnest-ai
).udf
spec:If you've ever stood up a Python sidecar just to chunk a PDF for a .NET app, I'd genuinely like to know whether this kills that step for you β tell me in the comments. And if it does, a star on the repo helps other .NET folks find it.
Secure Β· Fast Β· Reliable Β· Cost-Effective