Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

A developer argues that evaluation infrastructure, not alignment research or product engineering, serves as the actual enforcement layer for AI safety in production. The developer proposes that safety must be treated as a runtime guarantee enforced through non-bypassable boundaries that check inputs, outputs, and trajectories for violations like credential leakage or system prompt exposure. The approach implements three concentric enforcement layers with hard invariants that block harmful outputs before they reach users, rather than relying on model fine-tuning or system prompts alone.

The AI safety conversation is dominated by two camps: the alignment researchers thinking about existential risk, and the product engineers shipping features. Neither group talks enough about the middle layer — the actual enforcement mechanism that determines whether an agent behaves as intended in production. Here's my thesis: evaluation infrastructure is alignment enforcement. Not alignment research. Not safety theater. The actual enforcement layer that determines whether your agent does what you intended, at runtime, under adversarial conditions. If your agent can be jailbroken, produce harmful outputs, or violate its constraints — and you only find out from user reports — your evals aren't a testing tool. They're a missing safety system. Most teams treat safety as a property of the model. "We fine-tuned it to be safe." "We added a system prompt that says don't do bad things." This is wishful thinking dressed as engineering. Safety is a runtime guarantee , and runtime guarantees require runtime enforcement: interface SafetyBoundary { name: string; scope: 'input' | 'output' | 'trajectory'; check: data: BoundaryInput = BoundaryResult; action: 'block' | 'flag' | 'modify'; bypassable: false; // This is the point. } const boundaries: SafetyBoundary = { name: 'no-credential-leakage', scope: 'output', check: data = { const patterns = / ?:api - ?key|token|secret \\s:= + '" ? \\w- {20,}/gi, /-----BEGIN ?:RSA |EC ?PRIVATE KEY-----/g, / ?:ghp|gho|ghu|ghs|ghr A-Za-z0-9 {36,}/g ; const matches = patterns.flatMap p = ... data.output.matchAll p || ; return { safe: matches.length === 0, violations: matches.map m = { pattern: m 0 .substring 0, 20 + '...', position: m.index } }; }, action: 'block', bypassable: false }, { name: 'no-system-prompt-leakage', scope: 'output', check: data = { const systemPromptFragments = extractSignificantPhrases data.context.systemPrompt, { minLength: 15, topN: 20 } ; const leaked = systemPromptFragments.filter fragment = data.output.toLowerCase .includes fragment.toLowerCase ; return { safe: leaked.length < 3, violations: leaked.map l = { fragment: l } }; }, action: 'block', bypassable: false } ; The bypassable: false property is the entire philosophy. These aren't suggestions. They're invariants. I think about safety enforcement as three concentric layers, each catching different failure modes: Things that must never happen, regardless of input. These are your hardest constraints: class InvariantEnforcer { private invariants: SafetyBoundary ; private violationLog: ViolationRecord = ; async enforce agentOutput: AgentOutput : Promise<EnforcementResult { const results = await Promise.all this.invariants .filter inv = inv.scope === 'output' .map async inv = { const result = inv.check { output: agentOutput.raw, context: agentOutput.context } ; if result.safe { this.violationLog.push { invariant: inv.name, timestamp: Date.now , inputHash: hash agentOutput.context.userInput , violations: result.violations } ; } return { boundary: inv, result }; } ; const blocked = results.filter r = r.result.safe && r.boundary.action === 'block' ; if blocked.length 0 { return { allowed: false, blockedBy: blocked.map b = b.boundary.name , fallbackResponse: this.generateSafeFallback agentOutput.context }; } return { allowed: true, warnings: results.filter r = r.result.safe }; } } These run on every output. They're non-negotiable. When one fires, the agent output gets replaced with a safe fallback — no exceptions. Softer boundaries about how the agent should behave. These are your alignment properties: js const behavioralChecks = { name: 'uncertainty-acknowledgment', check: output: AgentOutput = { const confidenceSignals = /I'm not ?:sure|certain /i, /This might not be ?:accurate|correct /i, /I don't have ?:enough|sufficient information/i, /Based on ?:limited|available information/i ; const certaintySignals = /definitely|certainly|absolutely|always|never/gi ; const hasLowConfidenceContext = output.context.retrievalScores?.some s = s < 0.7 ; if hasLowConfidenceContext return { safe: true }; const acknowledgesUncertainty = confidenceSignals.some p = p.test output.raw ; const overlyConfident = output.raw.match certaintySignals 0 || .length 2; return { safe: acknowledgesUncertainty || overlyConfident, violations: acknowledgesUncertainty && overlyConfident ? { issue: 'High confidence language with low-confidence retrieval' } : }; } } ; Notice: this check correlates output confidence language with retrieval quality scores . An agent that says "definitely" when its context retrieval scored 0.4 is a safety problem — it's confabulating with false confidence. The most sophisticated layer: evaluating not individual outputs, but sequences of agent actions. This catches subtle manipulation patterns: interface TrajectoryAnalyzer { analyze steps: AgentStep : TrajectoryRisk; } const trajectoryChecks: TrajectoryAnalyzer = { analyze steps { const risks: string = ; // Detect escalation patterns const toolCalls = steps.filter s = s.type === 'tool call' ; const permissionLevel = toolCalls.map t = getPermissionLevel t.tool ; const escalating = permissionLevel.every p, i = i === 0 || p = permissionLevel i - 1 && permissionLevel.length 3; if escalating && Math.max ...permissionLevel 3 { risks.push 'monotonic-permission-escalation' ; } // Detect retry-after-refusal possible jailbreak attempt const refusals = steps.filter s = s.wasRefused ; const postRefusalSuccess = steps.filter s, i = s.wasRefused && steps.slice 0, i .some prev = prev.wasRefused && prev.intent === s.intent ; if postRefusalSuccess.length 0 { risks.push 'success-after-refusal-same-intent' ; } return { riskLevel: risks.length 1 ? 'high' : risks.length === 1 ? 'medium' : 'low', risks, requiresReview: risks.includes 'success-after-refusal-same-intent' }; } }; The success-after-refusal-same-intent pattern is critical. If an agent refused to do something, then later does the same thing possibly rephrased , that's a potential jailbreak that succeeded. This is invisible at the individual-output level — you need trajectory context. Alignment isn't just about training. It's about the entire system that ensures an agent does what its operators intend: Most teams invest heavily in 1 , minimally in 3 , and almost nothing in 2 . But 2 is the only part that actually prevents bad outcomes at runtime. If your evaluation infrastructure has gaps, those gaps are exactly where safety failures will occur. Adversarial users don't attack your strongest checks. They find the boundaries you didn't think to enforce. This means building eval infrastructure isn't just engineering work. It's threat modeling. Every boundary you define is a hypothesis about what could go wrong. Every boundary you miss is an attack surface. // Your eval coverage IS your actual safety coverage. // Not your model's RLHF. Not your system prompt. // What you check is what you enforce. const coverage = { structuralInvariants: boundaries.filter b = b.scope === 'output' .length, behavioralChecks: behavioralChecks.length, trajectoryAnalysis: trajectoryChecks ? 'enabled' : 'BLIND SPOT', inputValidation: inputBoundaries.length, // The gaps are the attack surface uncoveredVectors: identifyGaps boundaries, knownAttackPatterns }; If you take one thing from this post: treat your eval layer as security infrastructure, not test infrastructure. That means: The agent safety problem isn't unsolvable. It's under-engineered. We have the patterns. We have deterministic checks that catch structural violations, heuristics that catch behavioral drift, and trajectory analysis that catches multi-step attacks. What we lack is the discipline to treat these as production safety systems rather than optional test suites. Do you treat your agent evals as safety-critical infrastructure, or as a nice-to-have testing layer? What's your enforcement gap — the thing you know you should be checking but aren't?