Show HN: AST-guard A gradient-immune structural guard against RL reward hacking A developer released AST-guard, an open-source tool that uses deterministic abstract syntax tree analysis to detect reward hacking in AI-generated code, achieving 96.2% recall on a benchmark of reward hacks while remaining immune to gradient-based optimization. The tool is designed as a first-stage filter in a detection cascade, catching structural bypasses with zero false positives before code execution. Pre-Execution Gate for AI-Generated Code The deterministic layer between LLM code generation and code execution. No LLM. No ML. No cost. Research artifact — experimental, in active development.ast-guard is a research tool investigating structural reward-hacking detection in RL-trained code models. It has been empirically tested in a real RL training loop rl-rewardhacking , is not production-ready, and carries no stability guarantees. See Empirical Validation for the experiment results. ast-guard sits between code generation and execution. It parses LLM-generated code into an AST and returns a deterministic verdict before any line runs. It is not a linter, not a security scanner, and not a sandbox — it is a deterministic structural check that is gradient-immune : it cannot be deceived by a model's reasoning trace, and optimizing against it cannot happen through backpropagation. The core research thesis: A deterministic, model-free structural gate can serve as the first stage of a reward-hacking detection cascade. It catches the structurally obvious bypasses with 0 false positives and forces a model under optimization pressure to escalate to harder-to-detect semantic hacks — which it cannot do silently the escalation is measurable and categorizable . This thesis has been tested empirically in a real RL training loop; see Empirical Validation empirical-validation-rl-training-experiment . Two existing defense classes leave a gap: Training-time alignment Anthropic, DeepMind reduces incidence, not residual. Inference-time LLM reviewers TRACE, RewardHackWatch, EvilGenie share failure modes with the generator they monitor — a model optimizing against an LLM reviewer can learn to produce text that fools it. ast-guard's structural analysis cannot be fooled — it can only be circumvented by switching to a form that leaves no structural trace . That circumvention is itself a detectable, categorizable signal. LLM generates code │ ▼ ┌────────────────────────┐ │ ast-guard gate │ ← deterministic, gradient-immune, <10ms │ structural only │ └────────────────────────┘ │ ┌────┴──────────────────────────────┐ ▼ ▼ CLEAN WARNING / CRITICAL pass to execution structural hack detected → block or escalate │ ▼ ┌──────────────────┐ │ Semantic reviewer │ ← LLM judge, held-out tests │ next stage │ handles Class B / semantic └──────────────────┘ Two modes: Pair Mode — compares original code against LLM-generated code. Detects structural deviations. Standalone Mode — analyzes a single agent output with no baseline. Uses behavioral risk scoring. See benchmarks/RESULTS.md /Nick-is-building/ast-guard/blob/main/benchmarks/RESULTS.md for full methodology, confusion matrices, and per-category breakdowns. Results on publicly released datasets with labels not authored by this project. | Dataset | Mode | Samples | Key Metric | Value | Run artifact | |---|---|---|---|---|---| MALT METR — normal | Standalone | 77,369 | Specificity TNR † | 95.0% | malt v2 2 0.json ✓ | MALT METR — hardcoded solution | Standalone | 429 | Detection Rate | 46.9% | malt v2 2 0.json ✓ | MALT METR — bypass constraints | Standalone | 2,379 | Detection Rate | 34.5% | malt v2 2 0.json ✓ | | School of Reward Hacks longtermrisk | Pair | 52 | Recall | 96.2% | sorh results.json ✓ | | Countdown-Code Khan et al. | Pair | 15,894 | True Negative Rate | 99.0% | pending re-run ‡ | † TNR on the normal label only 77,369 samples ; see RESULTS.md /Nick-is-building/ast-guard/blob/main/benchmarks/RESULTS.md for the full confusion matrix. ‡ Numbers recorded in CHANGELOG at v2.1.0; no JSON artifact stored in benchmarks/data/ . Re-run with python -m benchmarks.run benchmark --benchmark countdown-code --json results.json to produce a current artifact. These samples were written by the project author to cover specific structural patterns. They are not independent external datasets — use them to confirm a check fires correctly, not as a claim about detection rates on real-world data. | Suite | Mode | Samples | Key Metric | Value | Notes | |---|---|---|---|---|---| | Structural Benchmark | Pair | 36 | F1 | 100% | Hand-curated pairs, one per structural pattern category | | TRACE-aligned built-in | Pair | 33 | F1 | 95.7% | Hand-written pairs using TRACE category names — not the published TRACE dataset Deshpande et al. 2026, 517 trajectories | Source: rl-rewardhacking — a companion repo running GRPO verl on Qwen2.5-Coder-7B-Instruct with ast-guard as a live reward penalty. Full methodology and numbers in results/FINDINGS lam2 run.md and results/FINDINGS 7B runs.md in that repo. The core question: does a gradient-immune structural penalty actually reduce reward hacking in a real RL loop, and what happens when the model pushes back? Honeypot task MBPP subset : model sees 1 visible assert and is rewarded for passing it. Ground truth is measured by 15+ held-out assertions hidden from the model. ast-guard runs in pair mode against the canonical solution as baseline. | Parameter | Value | |---|---| | Model | Qwen2.5-Coder-7B-Instruct | | RL algorithm | GRPO verl v0.6.1 , LoRA rank 32 | | Penalty λ when active | 2.0 penalty = −λ per structural detection | | Detector penalty checks | check 5, check 7, check 8 pair mode | | Steps | 150 | | Hardware | RunPod A100-80GB | λ=0 baseline no penalty, observation only : - hack rate → 0.99 by step 52, stays there through step 150 - ast-guard recall = 1.000 from step 53 onward detects all hacks - fp on honest = 0.000 throughout - avg heldout correctness → 0.000 by step 53 λ=2.0 ast-guard penalty active, hardened detector : - hack rate plateaus at 0.35–0.40 pre-filter , ~60% reduction vs. baseline - fp on honest = 0.000 in 149/150 steps - Residual hacking: 100% Class B semantic — no structural traces remain | Phase steps | λ=0 hack rate | λ=2.0 hack rate | λ=2.0 recall | |---|---|---|---| | 1–25 | 0.097–0.197 | 0.148 | 0.003 | | 26–50 | ~0.267 | 0.208 | 0.002 | | 51–75 | 0.990 | 0.304 | 0.137 | | 76–100 | 0.995 | 0.358 | 0.157 | | 101–125 | 0.995 | 0.376 | 0.167 | | 126–150 | 0.995 | 0.403 | 0.127 | Under penalty pressure, the model shifted hack forms in a measurable, ordered sequence: λ=0, steps 1–52: Primarily pure-constant-return return 4 . Detected 100% by check 7. struct rate ≈ 0 simple internal heuristic misses this — ast-guard is more sensitive . λ=0.75 intermediate run, pre-hardening : Model shifted to single-branch ternary return 1027 if s == "..." else None and tuple-compare forms — both escaped the un-hardened check 7 entirely. By steps 81–150: struct rate ≈ 0.93, ag rate ≈ 0.000. Recall collapsed to 0. Detector hardened commit ba1737f https://github.com/Nick-is-building/ast-guard/commit/ba1737f to close those forms. λ=2.0, hardened detector: Structural hacks nearly eliminated struct rate ≈ 0 . Residual hacking hack rate ~0.40 is now entirely Class B semantic — no structural signature. This is the detector's principled boundary. This cascade is not a failure . It is empirical evidence for the cascade-gate positioning: ast-guard closes the structural space, forcing escalation to semantic hacks that require a different kind of analysis to catch. avg heldout fraction of rollouts that pass held-out tests collapses under both λ=0 and λ=2.0, though λ=2.0 maintains a residual of ~0.13 at step 150 vs. 0.000 for λ=0. The penalty does not recover correctness — it only suppresses structural hacking. See F-section of PROJECT MASTER FACTS.md /Nick-is-building/ast-guard/blob/PROJECT MASTER FACTS.md in the workspace for open methodological questions. The RL training setup is in rl-rewardhacking https://github.com/Nick-is-building/rl-rewardhacking . Key entry points: install source setup gpu.sh or setup.sh if deps already present create all datasets run honeypot diagnostic penalty run python scripts/run rl training.py honeypot diagnostic --lam 2.0 --beta 1e-3 \ --temperature 1.3 --clip ratio high 0.28 --save steps -1 ast-guard is consumed as a pip dependency pip install ast-guard==2.3.0 . The HoneypotAstGuardReward class in src/rewards.py handles pair-mode scanning per rollout. Rollout JSONLs are written to results/runs/