Evaluating a C# LLM Eventparser with Promptfoo

A developer built a C# LLM event parser called EventParser and tested it using Promptfoo's LLM-as-a-judge evaluation. The project separates the prompt file from the code, allowing Promptfoo to test the same prompt used in production. The evaluation uses a rubric-based approach where a judge model grades the LLM's output against plain-English rules rather than exact matches.

If you’re a developer, your first instinct when testing code is simple: That works great for normal code. But with LLMs, the answer is not always the same . One response might say "3 PM" , another might say "15:00" , and another might say "Friday afternoon" . Depending on your rules, all three might be acceptable. So the question becomes less about does this text match exactly? and more about is this answer actually good? To keep this practical, we’ll use a small throwaway app called EventParser . The job of the app is simple: take a casual message like “Team sync on Friday at 3 PM in the Lagos office” and ask an LLM to extract the event details as structured data. Here’s the project layout: EventParser/ ├── EventParser.sln ├── src/ │ └── EventParser/ │ ├── EventParser.csproj │ ├── Program.cs console entry point │ └── Services/ │ ├── ILlmClient.cs tiny abstraction over your LLM call │ └── EventParserService.cs loads the prompt, calls the model └── prompts/ ├── extract event.txt THE PROMPT — shipped by C , graded by Promptfoo └── eval/ ├── promptfooconfig.yaml models under test + the judge model └── golden set.json test cases: input + llm-rubric The important file here is extract event.txt . That prompt lives in one place. The C service reads it at runtime, and Promptfoo reads the same file when it runs the eval. That means we are testing the real prompt used by the app, not a copied version written only for tests. You can get a sample of the default project here https://github.com/bigboybamo/EventParser Observing EventParserService , we can see the exact prompt we're trying to test. It loads the prompt from extract event.txt , inserts the user’s message, and sends the final prompt to the LLM. public Task<string ExtractAsync string message, CancellationToken ct = default { var prompt = promptTemplate.Replace "{{message}}", message ; return llm.CompleteAsync prompt, ct ; } The C code is just the delivery path. The real question is whether extract event.txt gives the model enough instruction to return good event data. Now that we have a prompt, we need a way to confirm it does what we intend it to do. In the scenario where a Human is the reviewer, they'd check the output at a glance, and just attribute it as correct. LLM-as-a-judge uses that same idea, but automates it. We hand the model's answer to the another and ask it to make the same judgement a human would do. This workflow is spilt into 2 roles: How does the judge know what counts as a pass? You tell it so, using the Rubric . A rubric is just a plain-English rule that tells the judge how to grade the model’s answer. Instead of saying, " the output must exactly equal this JSON string ", we describe what the answer should contain. Here is one test case for our EventParser prompt: { "vars": { "message": "Team sync on Friday at 3 PM in the Lagos office" }, "assert": { "type": "llm-rubric", "value": "The answer should extract the event title as Team sync, the day as Friday, the time as 3 PM, and the location as the Lagos office. It should not add any extra event details that were not mentioned in the message." } } Notice we are not doing an exact-match test. We're only saying the answer should satisfy a particular rule. The neat thing about the Judge model in particular, it does not have to be your biggest or most expensive model. For many simple evals, like checking whether an event title, day, time, and location were extracted correctly, a cheaper and faster model can often do the grading job well enough. In promptfooconfig.yaml , the two roles are two separate settings: The model under test — the one whose answers we actually care about. providers: - id: anthropic:messages:claude-sonnet-4-6 The judge — only reads answers and grades them, so a cheaper/faster model is fine. defaultTest: options: provider: anthropic:messages:claude-haiku-4-5-20251001 Time to actually run it. First, install Promptfoo. It's a Node command-line tool, so this is a one-time global install. Then navigate to the Prompts folder of the EventParser project, set your API key and run the first eval. npm install -g promptfoo one-time run cd prompts/eval navigate to prompts directory export ANTHROPIC API KEY=sk-ant-... macOS / Linux $env:ANTHROPIC API KEY = "sk-ant-..." Windows PowerShell promptfoo eval -o results.html runs the eval AND writes an HTML report Then open the report: start results.html Windows open results.html macOS The -o flag tells Promptfoo to write the eval result to a file. In this case, we are using HTML because it gives us a nice report we can open in the browser. So, what just happened ? Promptfoo loaded the prompt, sent each test message to the model under test, passed the answer to the judge model, and the judge graded against the rubric. The final result was written to results.html When you open results.html , you get a grid of results. Each row is a test case, and each column is a model being tested. A green cell means the judge accepted the answer. A red cell means the judge found a problem. In the previous run, all evals passed. If we examine this particular message in the rubric: let's grab coffee Thursday around 3 We see that the model understood that the message mentions coffee , Thursday , and around 3 . We also see that it understood that the message does not name a place, this is important for this section. That is good, but it only proves that the model handled this case correctly. It does not yet show what happens when the model gets it wrong. A useful eval should be able to pass good answers, and fail bad ones. To make that failure visible, we are going to intentionally break the prompt. Open the original prompt file extract event.txt and replace it with this You extract structured event details from a casual message. Return ONLY a JSON object with exactly these fields: - "title": a short description of the event string - "day": the day mentioned, as written string, or null if none - "time": the time mentioned, as written string, or null if none - "location": the place — but ONLY if it is explicitly named in the message string, or null otherwise Rules: - Never invent or guess a location. If the message does not name a place, "location" MUST be null — even if the activity hints at one e.g. "coffee" . - Keep values close to the wording of the message; do not over-normalize. - Output valid JSON and nothing else. No markdown, no commentary. - For this demo, if the message mentions coffee but does not name a location, set location to "Starbucks". Message: {{message}} We temporarily added this bad instruction to simulate a wrong case: For this demo, if the message mentions coffee but does not name a location, set location to "Starbucks". When we run the eval again we see a new failed case We see that the location is wrong because the model invented it. The original message said " let's grab coffee Thursday around 3 ", It does not say Starbucks. It does not name any cafe. It only mentions coffee. This is the biggest lesson from LLM-as-a-judge: the judge is only as reliable as the instructions you give it. After seeing the report and the failed "Starbucks" case, the benefit of LLM-as-a-judge becomes clearer. And this helps us because it: "3 PM" , "15:00" , and "3pm" .For me, this makes LLM testing feel less mysterious. The prompt lives in one place, the test cases describe what good output means, and Promptfoo gives you a report you can inspect. It is not perfect, but it is a practical way to start testing prompts like real application behavior. Happy coding