Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

wpnews.pro

The AI safety conversation is dominated by two camps: the alignment researchers thinking about existential risk, and the product engineers shipping features. Neither group talks enough about the middle layer — the actual enforcement mechanism that determines whether an agent behaves as intended in production.

Here's my thesis: evaluation infrastructure is alignment enforcement. Not alignment research. Not safety theater. The actual enforcement layer that determines whether your agent does what you intended, at runtime, under adversarial conditions.

If your agent can be jailbroken, produce harmful outputs, or violate its constraints — and you only find out from user reports — your evals aren't a testing tool. They're a missing safety system.

Most teams treat safety as a property of the model. "We fine-tuned it to be safe." "We added a system prompt that says don't do bad things." This is wishful thinking dressed as engineering.

Safety is a runtime guarantee, and runtime guarantees require runtime enforcement:

interface SafetyBoundary {
  name: string;
  scope: 'input' | 'output' | 'trajectory';
  check: (data: BoundaryInput) => BoundaryResult;
  action: 'block' | 'flag' | 'modify';
  bypassable: false; // This is the point.
}

const boundaries: SafetyBoundary[] = [
  {
    name: 'no-credential-leakage',
    scope: 'output',
    check: (data) => {
      const patterns = [
        /(?:api[_-]?key|token|secret)[\\s:=]+['"]?[\\w-]{20,}/gi,
        /-----BEGIN (?:RSA |EC )?PRIVATE KEY-----/g,
        /(?:ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9_]{36,}/g
      ];
      const matches = patterns.flatMap(p => 
        [...(data.output.matchAll(p) || [])]
      );
      return {
        safe: matches.length === 0,
        violations: matches.map(m => ({
          pattern: m[0].substring(0, 20) + '...',
          position: m.index
        }))
      };
    },
    action: 'block',
    bypassable: false
  },
  {
    name: 'no-system-prompt-leakage',
    scope: 'output',
    check: (data) => {
      const systemPromptFragments = extractSignificantPhrases(
        data.context.systemPrompt, 
        { minLength: 15, topN: 20 }
      );
      const leaked = systemPromptFragments.filter(fragment =>
        data.output.toLowerCase().includes(fragment.toLowerCase())
      );
      return {
        safe: leaked.length < 3,
        violations: leaked.map(l => ({ fragment: l }))
      };
    },
    action: 'block',
    bypassable: false
  }
];

The bypassable: false

property is the entire philosophy. These aren't suggestions. They're invariants.

I think about safety enforcement as three concentric layers, each catching different failure modes:

Things that must never happen, regardless of input. These are your hardest constraints:

class InvariantEnforcer {
  private invariants: SafetyBoundary[];
  private violationLog: ViolationRecord[] = [];

  async enforce(agentOutput: AgentOutput): Promise<EnforcementResult> {
    const results = await Promise.all(
      this.invariants
        .filter(inv => inv.scope === 'output')
        .map(async (inv) => {
          const result = inv.check({ output: agentOutput.raw, context: agentOutput.context });
          if (!result.safe) {
            this.violationLog.push({
              invariant: inv.name,
              timestamp: Date.now(),
              inputHash: hash(agentOutput.context.userInput),
              violations: result.violations
            });
          }
          return { boundary: inv, result };
        })
    );

    const blocked = results.filter(r => !r.result.safe && r.boundary.action === 'block');

    if (blocked.length > 0) {
      return {
        allowed: false,
        blockedBy: blocked.map(b => b.boundary.name),
        fallbackResponse: this.generateSafeFallback(agentOutput.context)
      };
    }

    return { allowed: true, warnings: results.filter(r => !r.result.safe) };
  }
}

These run on every output. They're non-negotiable. When one fires, the agent output gets replaced with a safe fallback — no exceptions.

Softer boundaries about how the agent should behave. These are your alignment properties:

const behavioralChecks = [
  {
    name: 'uncertainty-acknowledgment',
    check: (output: AgentOutput) => {
      const confidenceSignals = [
        /I'm not (?:sure|certain)/i,
        /This might not be (?:accurate|correct)/i,
        /I don't have (?:enough|sufficient) information/i,
        /Based on (?:limited|available) information/i
      ];
      const certaintySignals = [
        /definitely|certainly|absolutely|always|never/gi
      ];

      const hasLowConfidenceContext = output.context.retrievalScores?.some(
        s => s < 0.7
      );

      if (!hasLowConfidenceContext) return { safe: true };

      const acknowledgesUncertainty = confidenceSignals.some(p => p.test(output.raw));
      const overlyConfident = (output.raw.match(certaintySignals[0]) || []).length > 2;

      return {
        safe: acknowledgesUncertainty || !overlyConfident,
        violations: !acknowledgesUncertainty && overlyConfident 
          ? [{ issue: 'High confidence language with low-confidence retrieval' }]
          : []
      };
    }
  }
];

Notice: this check correlates output confidence language with retrieval quality scores. An agent that says "definitely" when its context retrieval scored 0.4 is a safety problem — it's confabulating with false confidence.

The most sophisticated layer: evaluating not individual outputs, but sequences of agent actions. This catches subtle manipulation patterns:

interface TrajectoryAnalyzer {
  analyze(steps: AgentStep[]): TrajectoryRisk;
}

const trajectoryChecks: TrajectoryAnalyzer = {
  analyze(steps) {
    const risks: string[] = [];

    // Detect escalation patterns
    const toolCalls = steps.filter(s => s.type === 'tool_call');
    const permissionLevel = toolCalls.map(t => getPermissionLevel(t.tool));
    const escalating = permissionLevel.every((p, i) => 
      i === 0 || p >= permissionLevel[i - 1]
    ) && permissionLevel.length > 3;

    if (escalating && Math.max(...permissionLevel) > 3) {
      risks.push('monotonic-permission-escalation');
    }

    // Detect retry-after-refusal (possible jailbreak attempt)
    const refusals = steps.filter(s => s.wasRefused);
    const postRefusalSuccess = steps.filter((s, i) => 
      !s.wasRefused && steps.slice(0, i).some(prev => 
        prev.wasRefused && prev.intent === s.intent
      )
    );

    if (postRefusalSuccess.length > 0) {
      risks.push('success-after-refusal-same-intent');
    }

    return {
      riskLevel: risks.length > 1 ? 'high' : risks.length === 1 ? 'medium' : 'low',
      risks,
      requiresReview: risks.includes('success-after-refusal-same-intent')
    };
  }
};

The success-after-refusal-same-intent

pattern is critical. If an agent refused to do something, then later does the same thing (possibly rephrased), that's a potential jailbreak that succeeded. This is invisible at the individual-output level — you need trajectory context.

Alignment isn't just about training. It's about the entire system that ensures an agent does what its operators intend:

Most teams invest heavily in (1), minimally in (3), and almost nothing in (2). But (2) is the only part that actually prevents bad outcomes at runtime.

If your evaluation infrastructure has gaps, those gaps are exactly where safety failures will occur. Adversarial users don't attack your strongest checks. They find the boundaries you didn't think to enforce.

This means building eval infrastructure isn't just engineering work. It's threat modeling. Every boundary you define is a hypothesis about what could go wrong. Every boundary you miss is an attack surface.

// Your eval coverage IS your actual safety coverage.
// Not your model's RLHF. Not your system prompt.
// What you check is what you enforce.

const coverage = {
  structuralInvariants: boundaries.filter(b => b.scope === 'output').length,
  behavioralChecks: behavioralChecks.length,
  trajectoryAnalysis: trajectoryChecks ? 'enabled' : 'BLIND_SPOT',
  inputValidation: inputBoundaries.length,

  // The gaps are the attack surface
  uncoveredVectors: identifyGaps(boundaries, knownAttackPatterns)
};

If you take one thing from this post: treat your eval layer as security infrastructure, not test infrastructure.

That means:

The agent safety problem isn't unsolvable. It's under-engineered. We have the patterns. We have deterministic checks that catch structural violations, heuristics that catch behavioral drift, and trajectory analysis that catches multi-step attacks.

What we lack is the discipline to treat these as production safety systems rather than optional test suites.

Do you treat your agent evals as safety-critical infrastructure, or as a nice-to-have testing layer? What's your enforcement gap — the thing you know you should be checking but aren't?

source & further reading

dev.to — original article I launched to zero signups, then found 5 features nobody could reach building enterprise multi-agent workflows in .net with mistral Understanding Middleware in Deep Agents (With Runnable Examples)

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Run your AI side-project on zahid.host