Nexa-gauge – LLM evaluation framework with per-node scoring controls

wpnews.pro

cd /news/large-language-models/nexa-gauge-llm-evaluation-framework-… · home › topics › large-language-models › article

[ARTICLE · art-18736] src=harnexa.dev ↗ pub=2026-05-30T19:50Z topic=large-language-models verified=true sentiment=↑ positive

Nexa-gauge – LLM evaluation framework with per-node scoring controls

Nexa-gauge, a graph-based evaluation framework for LLM and LVLM applications, has been released to replace ad-hoc manual checks with a repeatable pipeline that supports per-node scoring controls and deterministic caching. The system normalizes raw records into typed evaluation states, executes only required nodes for selected targets, and produces consistent per-case reports for prompt iteration, benchmark runs, and release gating. The framework combines LLM-as-a-judge semantic evaluation with targeted metrics including relevance, grounding, redteam, geval, and reference scoring.

read3 min views22 publishedMay 30, 2026

Overview #

nexa-gauge is a graph-based evaluation system for LLM and LVLM application outputs. It replaces ad-hoc manual checks with a repeatable pipeline that can be run on local datasets or hosted datasets.

At a high level, nexa-gauge:

Normalizes raw records into a typed evaluation state.
Executes only the nodes required for the selected target.
Reuses prior node outputs through deterministic caching.
Produces a consistent per-case report for downstream tooling.

This architecture supports day-to-day prompt iteration, benchmark runs, and release gating with measurable quality and safety signals.

Why LLM-As-A-Judge Is Necessary #

Exact-match metrics are useful but limited for modern generative systems. In many real tasks, multiple answers can be valid, quality depends on context use, and failure modes are semantic rather than lexical.

LLM-as-a-judge provides scalable semantic evaluation by scoring outputs against explicit criteria. In nexa-gauge, this capability is combined with targeted metrics so teams can evaluate quality from multiple angles:

relevance

for input-output alignment.grounding

for support in provided context.redteam

for safety and risk behavior.geval

for rubric-based judgment.reference

for overlap with known reference answers.

Execution Model And Caching #

nexa-gauge provides two operational modes:

run

executes the selected branch and returns final artifacts.estimate

computes uncached eligible cost before execution.

Both modes follow the same branch-planning logic, which makes cost estimates actionable before you run full evaluations.

Caching is route-aware and deterministic. Reuse occurs only when input content and routing semantics are unchanged. Changes to inputs, prompts, or model routing intentionally invalidate affected steps.

Practical outcome:

Teams can estimate budget before execution.
Iterative runs avoid recomputing stable nodes.
Results remain reproducible under fixed inputs and model routes.

Architecture #

Node Summary #

Input And Orchestration

Node	Purpose
`scan`	Normalizes record fields and initializes case state.
`eval`	Aggregates metric branches into a unified result.
`report`	Projects final output into a stable report contract.

Utility Nodes

Node	Purpose
`chunk`	Splits generated text for downstream extraction. `Semchunk` ..
`refine`	Removes, deduplicates, reranks, selects topk chunks. `mmr`

Metric Essentials

Node	Purpose
`claims`	Extracts atomic claims from generated output.
`geval_steps`	Resolves evaluation steps for GEval scoring.

Metric Nodes

Node	Purpose
`relevance`	Measures how directly claims answer the input.
`grounding`	Measures whether claims are supported by context.
`redteam`	Evaluates safety and policy risk using rubrics.
`geval`	Runs final rubric-driven LLM judging.
`reference`	Computes reference-based lexical metrics.

Typical Workflow #

nexagauge estimate eval --input sample.json --limit 100

nexagauge run eval --input sample.json --limit 100 --output-dir ./report

For dataset fields, accepted aliases, and metric activation rules, see the Data Schema.

For iterative development, repeated runs on unchanged inputs and routing should show high cache reuse and lower incremental latency.

source & further reading

harnexa.dev — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/nexa-gauge-llm-evaluatio…

Read original on harnexa.dev → harnexa.dev/nexa-gauge/docs/introduction

mentioned entities

nexa-gauge

LLM

LVLM

metadata

slugnexa-gauge-llm-evaluation-framework-with-per-node-scoring-controls

topic#large-language-models

secondary4 topics

sentimentpositive

canonicalharnexa.dev

navigation

← prevAI is a Meteor. Don't Be a Dinos…

next →An OS in pure Rust with its own …

── more in #large-language-models 4 stories · sorted by recency

notesbylex.com · 15 Jul · #large-language-models

6 months of OpenClaw

arxiv.org · 14 Jul · #large-language-models

LLM-as-a-Verifier: A General-Purpose Verification Framework

dev.to · 15 Jul · #large-language-models

memlineage v0.1.0: defensa de dos capas contra memory poisoning en agentes LLM

byteiota.com · 15 Jul · #large-language-models

Microsoft RAMPART and Clarity: Test AI Agents in CI

── more on @nexa-gauge 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required