{"slug": "evaluating-a-c-llm-eventparser-with-promptfoo", "title": "Evaluating a C# LLM Eventparser with Promptfoo", "summary": "A developer built a C# LLM event parser called EventParser and tested it using Promptfoo's LLM-as-a-judge evaluation. The project separates the prompt file from the code, allowing Promptfoo to test the same prompt used in production. The evaluation uses a rubric-based approach where a judge model grades the LLM's output against plain-English rules rather than exact matches.", "body_md": "If you’re a developer, your first instinct when testing code is simple:\n\nThat works great for normal code.\n\nBut with LLMs, the answer is not always the same . One response might say `\"3 PM\"`\n\n, another might say `\"15:00\"`\n\n, and another might say `\"Friday afternoon\"`\n\n.\n\nDepending on your rules, all three might be acceptable.\n\nSo the question becomes less about **does this text match exactly?** and more about **is this answer actually good?**\n\nTo keep this practical, we’ll use a small throwaway app called `EventParser`\n\n.\n\nThe job of the app is simple: take a casual message like “Team sync on Friday at 3 PM in the Lagos office” and ask an LLM to extract the event details as structured data.\n\nHere’s the project layout:\n\n```\nEventParser/\n├── EventParser.sln\n├── src/\n│   └── EventParser/\n│       ├── EventParser.csproj\n│       ├── Program.cs                  # console entry point\n│       └── Services/\n│           ├── ILlmClient.cs           # tiny abstraction over your LLM call\n│           └── EventParserService.cs   # loads the prompt, calls the model\n└── prompts/\n    ├── extract_event.txt               # THE PROMPT — shipped by C#, graded by Promptfoo\n    └── eval/\n        ├── promptfooconfig.yaml        # models under test + the judge model\n        └── golden_set.json             # test cases: input + llm-rubric\n```\n\nThe important file here is `extract_event.txt`\n\n.\n\nThat prompt lives in one place. The C# service reads it at runtime, and Promptfoo reads the same file when it runs the eval. That means we are testing the real prompt used by the app, not a copied version written only for tests.\n\nYou can get a sample of the default project [here](https://github.com/bigboybamo/EventParser)\n\nObserving `EventParserService`\n\n, we can see the exact prompt we're trying to test. It loads the prompt from `extract_event.txt`\n\n, inserts the user’s message, and sends the final prompt to the LLM.\n\n```\n    public Task<string> ExtractAsync(string message, CancellationToken ct = default)\n    {\n        var prompt = _promptTemplate.Replace(\"{{message}}\", message);\n        return _llm.CompleteAsync(prompt, ct);          \n    }\n```\n\nThe C# code is just the delivery path. The real question is whether `extract_event.txt`\n\ngives the model enough instruction to return good event data.\n\nNow that we have a prompt, we need a way to confirm it does what we intend it to do.\n\nIn the scenario where a Human is the reviewer, they'd check the output at a glance, and just attribute it as correct.\n\nLLM-as-a-judge uses that same idea, but automates it. We hand the model's answer to the another and ask it to make the same judgement a human would do.\n\nThis workflow is spilt into 2 roles:\n\nHow does the judge know what counts as a pass? You tell it so, using the **Rubric**.\n\nA rubric is just a plain-English rule that tells the judge how to grade the model’s answer. Instead of saying, \"*the output must exactly equal this JSON string*\", we describe what the answer should contain.\n\nHere is one test case for our `EventParser`\n\nprompt:\n\n```\n[\n  {\n    \"vars\": {\n      \"message\": \"Team sync on Friday at 3 PM in the Lagos office\"\n    },\n    \"assert\": [\n      {\n        \"type\": \"llm-rubric\",\n        \"value\": \"The answer should extract the event title as Team sync, the day as Friday, the time as 3 PM, and the location as the Lagos office. It should not add any extra event details that were not mentioned in the message.\"\n      }\n    ]\n  }\n]\n```\n\nNotice we are not doing an exact-match test. We're only saying the answer should satisfy a particular rule.\n\nThe neat thing about the Judge model in particular, it does not have to be your biggest or most expensive model. For many simple evals, like checking whether an event title, day, time, and location were extracted correctly, a cheaper and faster model can often do the grading job well enough.\n\nIn `promptfooconfig.yaml`\n\n, the two roles are two separate settings:\n\n```\n# The model under test — the one whose answers we actually care about.\nproviders:\n  - id: anthropic:messages:claude-sonnet-4-6\n\n# The judge — only reads answers and grades them, so a cheaper/faster model is fine.\ndefaultTest:\n  options:\n    provider: anthropic:messages:claude-haiku-4-5-20251001\n```\n\nTime to actually run it. First, install Promptfoo. It's a Node command-line tool, so this is a one-time global install. Then navigate to the `Prompts`\n\nfolder of the **EventParser** project, set your API key and run the first eval.\n\n```\nnpm install -g promptfoo                  # one-time run\n\ncd prompts/eval                           # navigate to prompts directory\n\nexport ANTHROPIC_API_KEY=sk-ant-... # macOS / Linux \n# $env:ANTHROPIC_API_KEY = \"sk-ant-...\" # Windows PowerShell\n\npromptfoo eval -o results.html            # runs the eval AND writes an HTML report\n```\n\nThen open the report:\n\n```\nstart results.html        # Windows\n# open results.html       # macOS\n```\n\nThe `-o`\n\nflag tells Promptfoo to write the eval result to a file. In this case, we are using HTML because it gives us a nice report we can open in the browser.\n\nSo, what just happened ?\n\nPromptfoo loaded the prompt, sent each test message to the model under test, passed the answer to the judge model, and the judge graded against the rubric.\n\nThe final result was written to `results.html`\n\nWhen you open `results.html`\n\n, you get a grid of results. Each row is a test case, and each column is a model being tested. A green cell means the judge accepted the answer. A red cell means the judge found a problem.\n\nIn the previous run, all evals passed.\n\nIf we examine this particular message in the rubric:\n\n```\nlet's grab coffee Thursday around 3\n```\n\nWe see that the model understood that the message mentions **coffee**, **Thursday**, and **around 3**. We also see that it understood that the message does not name a place, this is important for this section.\n\nThat is good, but it only proves that the model handled this case correctly. It does not yet show what happens when the model gets it wrong. A useful eval should be able to pass good answers, and fail bad ones.\n\nTo make that failure visible, we are going to intentionally break the prompt.\n\nOpen the original prompt file `extract_event.txt`\n\nand replace it with this\n\n```\nYou extract structured event details from a casual message.\n\nReturn ONLY a JSON object with exactly these fields:\n- \"title\":    a short description of the event (string)\n- \"day\":      the day mentioned, as written (string, or null if none)\n- \"time\":     the time mentioned, as written (string, or null if none)\n- \"location\": the place — but ONLY if it is explicitly named in the message\n              (string, or null otherwise)\n\nRules:\n- Never invent or guess a location. If the message does not name a place,\n  \"location\" MUST be null — even if the activity hints at one (e.g. \"coffee\").\n- Keep values close to the wording of the message; do not over-normalize.\n- Output valid JSON and nothing else. No markdown, no commentary.\n- For this demo, if the message mentions coffee but does not name a location, set location to \"Starbucks\".\n\nMessage:\n{{message}}\n```\n\nWe temporarily added this bad instruction to simulate a wrong case:\n\n```\nFor this demo, if the message mentions coffee but does not name a location, set location to \"Starbucks\".\n```\n\nWhen we run the eval again we see a new failed case\n\nWe see that the location is wrong because the model invented it. The original message said \"**let's grab coffee Thursday around 3**\", It does not say Starbucks. It does not name any cafe. It only mentions coffee.\n\nThis is the biggest lesson from LLM-as-a-judge: the judge is only as reliable as the instructions you give it.\n\nAfter seeing the report and the failed `\"Starbucks\"`\n\ncase, the benefit of LLM-as-a-judge becomes clearer. And this helps us because it:\n\n`\"3 PM\"`\n\n, `\"15:00\"`\n\n, and `\"3pm\"`\n\n.For me, this makes LLM testing feel less mysterious. The prompt lives in one place, the test cases describe what good output means, and Promptfoo gives you a report you can inspect. It is not perfect, but it is a practical way to start testing prompts like real application behavior.\n\nHappy coding!!!", "url": "https://wpnews.pro/news/evaluating-a-c-llm-eventparser-with-promptfoo", "canonical_source": "https://dev.to/bigboybamo/evaluating-a-c-llm-eventparser-with-promptfoo-4b87", "published_at": "2026-06-25 10:13:52+00:00", "updated_at": "2026-06-25 10:43:37.358567+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "natural-language-processing"], "entities": ["Promptfoo", "EventParser", "C#", "LLM"], "alternates": {"html": "https://wpnews.pro/news/evaluating-a-c-llm-eventparser-with-promptfoo", "markdown": "https://wpnews.pro/news/evaluating-a-c-llm-eventparser-with-promptfoo.md", "text": "https://wpnews.pro/news/evaluating-a-c-llm-eventparser-with-promptfoo.txt", "jsonld": "https://wpnews.pro/news/evaluating-a-c-llm-eventparser-with-promptfoo.jsonld"}}