cd /news/ai-safety/show-hn-ast-guard-a-gradient-immune-… Β· home β€Ί topics β€Ί ai-safety β€Ί article
[ARTICLE Β· art-43799] src=github.com β†— pub= topic=ai-safety verified=true sentiment=↑ positive

Show HN: AST-guard A gradient-immune structural guard against RL reward hacking

A developer released AST-guard, an open-source tool that uses deterministic abstract syntax tree analysis to detect reward hacking in AI-generated code, achieving 96.2% recall on a benchmark of reward hacks while remaining immune to gradient-based optimization. The tool is designed as a first-stage filter in a detection cascade, catching structural bypasses with zero false positives before code execution.

read14 min views1 publishedJun 29, 2026
Show HN: AST-guard A gradient-immune structural guard against RL reward hacking
Image: source

Pre-Execution Gate for AI-Generated Code

The deterministic layer between LLM code generation and code execution. No LLM. No ML. No cost.

Research artifact β€” experimental, in active development.ast-guard is a research tool investigating structural reward-hacking detection in RL-trained code models. It has been empirically tested in a real RL training loop ([rl-rewardhacking]), is not production-ready, and carries no stability guarantees. See[Empirical Validation]for the experiment results.

ast-guard sits between code generation and execution. It parses LLM-generated code into an AST and returns a deterministic verdict before any line runs. It is not a linter, not a security scanner, and not a sandbox β€” it is a deterministic structural check that is gradient-immune: it cannot be deceived by a model's reasoning trace, and optimizing against it cannot happen through backpropagation.

The core research thesis: A deterministic, model-free structural gate can serve as the first stage of a reward-hacking detection cascade. It catches the structurally obvious bypasses with 0 false positives and forces a model under optimization pressure to escalate to harder-to-detect semantic hacks β€” which it cannot do silently (the escalation is measurable and categorizable). This thesis has been tested empirically in a real RL training loop; see Empirical Validation.

Two existing defense classes leave a gap:

Training-time alignment(Anthropic, DeepMind) reduces incidence, not residual.** Inference-time LLM reviewers**(TRACE, RewardHackWatch, EvilGenie) share failure modes with the generator they monitor β€” a model optimizing against an LLM reviewer can learn to produce text that fools it.

ast-guard's structural analysis cannot be fooled β€” it can only be circumvented (by switching to a form that leaves no structural trace). That circumvention is itself a detectable, categorizable signal.

LLM generates code
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ast-guard gate       β”‚  ← deterministic, gradient-immune, <10ms
β”‚   (structural only)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό                                   β–Ό
CLEAN                          WARNING / CRITICAL
(pass to execution)            (structural hack detected β†’ block or escalate)
                                       β”‚
                                       β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚ Semantic reviewer β”‚  ← LLM judge, held-out tests
                               β”‚ (next stage)      β”‚     (handles Class B / semantic)
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two modes:

Pair Modeβ€” compares original code against LLM-generated code. Detects structural deviations.** Standalone Mode**β€” analyzes a single agent output with no baseline. Uses behavioral risk scoring.

See benchmarks/RESULTS.md for full methodology, confusion matrices, and per-category breakdowns.

Results on publicly released datasets with labels not authored by this project.

Dataset Mode Samples Key Metric Value Run artifact
MALT (METR) β€” normal
Standalone 77,369 Specificity (TNR)† 95.0%
malt_v2_2_0.json βœ“
MALT (METR) β€” hardcoded_solution
Standalone 429 Detection Rate 46.9%
malt_v2_2_0.json βœ“
MALT (METR) β€” bypass_constraints
Standalone 2,379 Detection Rate 34.5%
malt_v2_2_0.json βœ“
School of Reward Hacks (longtermrisk) Pair 52 Recall 96.2%
sorh_results.json βœ“
Countdown-Code (Khan et al.) Pair 15,894 True Negative Rate 99.0%
pending re-run ‑

† TNR on the normal

label only (77,369 samples); see RESULTS.md for the full confusion matrix.

‑ Numbers recorded in CHANGELOG at v2.1.0; no JSON artifact stored in benchmarks/data/

. Re-run with python -m benchmarks.run_benchmark --benchmark countdown-code --json results.json

to produce a current artifact.

These samples were written by the project author to cover specific structural patterns. They are not independent external datasets β€” use them to confirm a check fires correctly, not as a claim about detection rates on real-world data.

Suite Mode Samples Key Metric Value Notes
Structural Benchmark Pair 36 F1 100%
Hand-curated pairs, one per structural pattern category
TRACE-aligned built-in Pair 33 F1 95.7%
Hand-written pairs using TRACE category names β€” not the published TRACE dataset (Deshpande et al. 2026, 517 trajectories)

Source:[rl-rewardhacking]β€” a companion repo running GRPO (verl) on Qwen2.5-Coder-7B-Instruct with ast-guard as a live reward penalty. Full methodology and numbers inresults/FINDINGS_lam2_run.md

andresults/FINDINGS_7B_runs.md

in that repo.

The core question: does a gradient-immune structural penalty actually reduce reward hacking in a real RL loop, and what happens when the model pushes back?

Honeypot task (MBPP subset): model sees 1 visible assert

and is rewarded for passing it. Ground truth is measured by 15+ held-out assertions hidden from the model. ast-guard runs in pair mode against the canonical solution as baseline.

Parameter Value
Model Qwen2.5-Coder-7B-Instruct
RL algorithm GRPO (verl v0.6.1), LoRA rank 32
Penalty Ξ» (when active) 2.0 (penalty = βˆ’Ξ» per structural detection)
Detector penalty checks check_5, check_7, check_8 (pair mode)
Steps 150
Hardware RunPod A100-80GB

Ξ»=0 baseline (no penalty, observation only):

  • hack_rate β†’ 0.99 by step 52, stays there through step 150
  • ast-guard recall = 1.000 from step 53 onward (detects all hacks)
  • fp_on_honest = 0.000 throughout
  • avg_heldout (correctness) β†’ 0.000 by step 53

Ξ»=2.0 (ast-guard penalty active, hardened detector):

  • hack_rate plateaus at 0.35–0.40 (pre-filter), ~60% reduction vs. baseline
  • fp_on_honest = 0.000 in 149/150 steps
  • Residual hacking: 100% Class B (semantic) β€” no structural traces remain
Phase (steps) Ξ»=0 hack_rate Ξ»=2.0 hack_rate Ξ»=2.0 recall
1–25 0.097–0.197 0.148 0.003
26–50 ~0.267 0.208 0.002
51–75 0.990 0.304 0.137
76–100 0.995 0.358 0.157
101–125 0.995 0.376 0.167
126–150 0.995 0.403 0.127

Under penalty pressure, the model shifted hack forms in a measurable, ordered sequence:

Ξ»=0, steps 1–52: Primarily pure-constant-return (return 4

). Detected 100% by check_7. struct_rate β‰ˆ 0 (simple internal heuristic misses this β€” ast-guard is more sensitive).Ξ»=0.75 (intermediate run, pre-hardening): Model shifted to single-branch ternary (return 1027 if s == "..." else None

) and tuple-compare forms β€” both escaped theun-hardenedcheck_7 entirely. By steps 81–150: struct_rate β‰ˆ 0.93, ag_rate β‰ˆ 0.000. Recall collapsed to 0.Detector hardened(commit ba1737f) to close those forms.Ξ»=2.0, hardened detector: Structural hacks nearly eliminated (struct_rate β‰ˆ 0). Residual hacking (hack_rate ~0.40) is now entirely Class B semantic β€” no structural signature. This is the detector's principled boundary.

This cascade is not a failure. It is empirical evidence for the cascade-gate positioning: ast-guard closes the structural space, forcing escalation to semantic hacks that require a different kind of analysis to catch.

avg_heldout (fraction of rollouts that pass held-out tests) collapses under both Ξ»=0 and Ξ»=2.0, though Ξ»=2.0 maintains a residual of ~0.13 at step 150 vs. 0.000 for Ξ»=0. The penalty does not recover correctness β€” it only suppresses structural hacking. See F-section of PROJECT_MASTER_FACTS.md in the workspace for open methodological questions.

The RL training setup is in rl-rewardhacking. Key entry points:

source setup_gpu.sh   # or setup.sh if deps already present
create_all_datasets

python scripts/run_rl_training.py honeypot_diagnostic --lam 2.0 --beta 1e-3 \
    --temperature 1.3 --clip_ratio_high 0.28 --save_steps -1

ast-guard is consumed as a pip dependency (pip install ast-guard==2.3.0

). The HoneypotAstGuardReward

class in src/rewards.py

handles pair-mode scanning per rollout. Rollout JSONLs are written to results/runs/<run_id>/rollouts/

and contain per-step, per-sample fields including ast_guard_detected

, structural_hack

, and honeypot_label_int

.

These approaches are complementary, not competing. ast-guard handles structural analysis; LLM reviewers handle semantics.

Approach Method Cost/scan Latency Deterministic Scope
GPT-4o reviewer LLM-as-judge $0.01–0.10 500–2000ms No Semantic + Structural
RewardHackWatch ML + Regex + AST GPU required varies No Trajectory-level
ast-guard v2.3.0
Pure AST + heuristics
$0.00
<10ms
Yes
Structural only

ast-guard parses code into an Abstract Syntax Tree and evaluates structural properties. No execution, no sampling, no probabilistic inference.

Hardcoding Detectionβ€” if-counts, literal counts, long-string growth vs. baseline. Guard-clauses excluded.** Complexity Collapse**β€” per-function McCabe complexity drop >60% without a recognized legitimate optimization.** Forbidden Calls & Obfuscation**β€” diff-based detection of neweval

/exec

/subprocess

/ctypes

/SystemExit

calls, alias resolution,chr()

-obfuscation, builtins subscript.Import Driftβ€” new imports against blocklist (CRITICAL) and safelist (CLEAN). Unknown imports β†’ WARNING.** Extensional Enumeration**β€” a Python analogue of the RLVR-shortcut concept from Helff et al.: flat if/elif or match/case chains covering β‰₯70% of branches with no loops. Helff studied the concept in inductive-logic tasks (Prolog-style rule induction); the if/elif detector here is ast-guard's own operationalization, not a pattern Helff measured directly.Behavioral Risk Scoring(standalone only) β€” additive YARA/Semgrep-style score from AST patterns. CLEAN <30, WARNING 30–69, CRITICAL β‰₯70.** Literal Hijack**(pair mode, Python only) β€” generated function returns only literals regardless of inputs, while the original had non-trivial logic. Covers pure-constant-return, single-branch ternary/if gates (return X if param == literal else None

), and compare-return variants. Added in response to observed reward-hacking in RL training.New Constant Bypass(pair mode, Python only) β€” if-branch compares a parameter against a new specific constant (absent from the original) and returns input-independently. Targets targeted-case hardcoding:if n == 75000: return expected

.

CLEAN    β†’ score below threshold, no blocklist triggers
WARNING  β†’ suspicious patterns, manual review recommended
CRITICAL β†’ high-confidence structural hack, block execution

Modes: strict

blocks CRITICAL; standard

logs everything; audit

silent collection only.

Requirements: Python 3.11+. Zero external dependencies for Python analysis.

git clone https://github.com/Nick-is-building/ast-guard.git
cd ast-guard
python -m pytest tests/ -q
python
from ast_guard import scan, scan_standalone

result = scan(original_code, generated_code, mode="strict")

if result["verdict"] == "CRITICAL":
    print("Blocked: structural hack detected.")
    print(result["checks"])
elif result["verdict"] == "WARNING":
    print("Suspicious. Review recommended.")

result = scan_standalone(agent_code)
print(result["verdict"], result["checks"]["check_6_behavioral"]["score"])
python -m ast_guard.cli check original.py generated.py            # standard
python -m ast_guard.cli check original.py generated.py --mode strict
python -m ast_guard.cli check original.py generated.py --json     # for pipelines

python -m ast_guard.cli check original.sh generated.sh --language bash
python -m ast_guard.cli check original.js generated.js --language javascript
python -m ast_guard.cli check original.py generated.py --language auto   # default
python -m ast_guard.cli check original.py generated.py --no-multilang    # Python-only

Exit code 0 on CLEAN/WARNING, exit code 1 on CRITICAL β€” drop-in for CI gates.

Python is native (zero deps). Bash and JavaScript are available via tree-sitter as an optional extra.

pip install ast-guard[multilang]
Language Backend Checks active
Python Native ast
1, 2, 3, 4, 5, 6, 7, 8
Bash tree-sitter-bash 1, 2, 3, 4, 5, 6
JavaScript tree-sitter-javascript 1, 2, 3, 4, 5, 6
TypeScript tree-sitter-typescript 1, 2, 3, 4, 5, 6

All four languages run the same 6-check pipeline. Check 2 (Complexity Collapse) requires a pair-mode baseline and is inactive in standalone mode for all languages. Checks 7 and 8 are Python-only. Language is auto-detected from the generated file (shebang-first, then keyword scoring) or can be set explicitly with --language

.

Check 5 (Extensional Enumeration) for Bash: detects case/esac

statements with literal branch values and if/elif

with [[ $x == "y" ]]

-style comparisons.

Check 6 (Behavioral Risk Scoring) for Bash: eval_dynamic, pipe_to_shell, process_termination, subprocess_shell, network_fetch, test_file_write, environ_mutation, startup_persistence, destructive_call.

Check 5 for JavaScript / TypeScript: detects switch/case

with string/number literals and if/else-if

with ===

/==

comparisons. Also detects dispatch-table memorisation: return TABLE[param]

or TABLE.get(param)

where TABLE

is an all-literal object or Map

(β‰₯5 entries). Fires in pair mode when the table is new; suppressed when a pre-existing table of the same size was already in the original.

Check 6 for JavaScript / TypeScript: eval_dynamic (including Function()

constructor), process_termination, subprocess_shell, dangerous_import (child_process), test_file_write, environ_mutation, module_cache_manipulation (require.cache).

ast-guard includes a built-in Model Context Protocol server.

pip install ast-guard[mcp]
{
  "mcpServers": {
    "ast-guard": {
      "command": "ast-guard-mcp",
      "type": "stdio"
    }
  }
}

Tools: ast_guard_scan

(compare original vs. generated), ast_guard_feedback

(submit triage feedback).

- uses: ./.github/actions/ast-guard
  with:
    original: path/to/original.py
    generated: path/to/generated.py
    mode: strict
    upload-sarif: "true"

SARIF output is compatible with the GitHub Security Tab.

Thresholds, blocklists, and allowlists are configurable via TOML. Hierarchy: CLI args > .ast-guard.toml

~/.ast-guard/config.toml

defaults.

[thresholds]
if_count_rel_increase = 0.50
literal_count_rel_increase = 2.0
literal_count_abs_min = 10
long_string_len = 200
complexity_rel_decrease = 0.60
complexity_abs_min = 5
enumeration_ratio = 0.70
enumeration_min_ifs = 5

[imports]
blocklist = ["os", "sys", "subprocess", "pickle", "importlib"]
allowlist = ["functools", "itertools", "collections", "math"]

[settings]
mode = "standard"
telemetry = false
multilang = "auto"   # "auto" | true | false

Deterministic. Same input β†’ same verdict. No randomness, no network calls.Gradient-immune. The verdict is a step function over AST properties. A model optimizing against it cannot do so through backpropagation β€” it can only circumvent it structurally, and that circumvention is itself observable.Model-free. No LLM, no ML model, no embeddings. Only Python'sast

module (plus tree-sitter for Bash/JS/TS). Cannot share failure modes with the generator it monitors.Zero dependencies in core. Python standard library only.Real-time.<10ms per scan, no degradation to the execution loop.** Complementary by design.**Structural analysis is one layer; semantic review is another. ast-guard is explicitly scoped to what static AST analysis can settle β€” not as a limitation, but as a design decision that lets it make strong guarantees within that scope.Privacy-by-design. Telemetry stores anonymized AST metrics only β€” never code, filenames, paths, or timestamps. Always opt-in.Three modes(audit

β†’standard

β†’strict

) enable risk-free evaluation before commitment.

These are design decisions, not defects. ast-guard is explicitly scoped to structural analysis; the boundaries below are where structural analysis ends and semantic analysis begins.

Principled scope boundary β€” semantic hacks (Class B):

Code that passes visible tests through logical coincidence or shallow heuristics β€” without a detectable structural signature β€” is outside scope. In the RL experiment, once structural hacks were suppressed by the penalty, the model's residual hacking (hack_rate ~0.40 at step 150) was entirely Class B: return n & (n-1)

, return -1 if s[0] == '0' else 1

, return 2 if text else 1

. These pass one visible assert by accident or trivial heuristic, fail held-out tests, and produce no AST signal distinguishable from correct code. Catching them requires held-out test execution or semantic understanding β€” neither of which static analysis can provide. This is by design: ast-guard's purpose is to close the structural space and force escalation to a semantic reviewer, not to replace one.

The escalation cascade in practice:

The RL experiment showed that a trained model under structural penalty pressure will systematically exit the structural space. This is the intended behavior β€” ast-guard forces the model's optimization into territory that requires a different analysis tool (LLM judge, held-out tests), rather than competing with those tools on their own ground.

Standalone mode is less precise than pair mode:

Without a baseline, contextual disambiguation is harder. The MALT false-positive rate on normal

samples (5.0% at v2.2.0) is the current calibration point β€” intentionally conservative. Pair mode (original vs. generated) provides the strongest signal.

Bash and JavaScript have no independent recall corpus:

In MALT, 316/322 missed Bash bypass samples and 30/30 missed JS bypass samples show zero structural signal β€” they are semantically encoded, not structurally encoded. Checks 7 and 8 are Python-only. Pair mode provides additional signal via Check 1 and Check 2 for all languages.

Thresholds are empirically calibrated, not final:

Defaults are informed starting points. See METHODOLOGY.md and DEVELOPMENT_HISTORY.md for the iteration and hardening history.

benchmarks/RESULTS.mdβ€” precision, recall, F1, confusion matrices across all datasets.benchmarks/METHODOLOGY.mdβ€” the 13-iteration calibration history, including regressions.benchmarks/structural_benchmark/β€” 36 curated ground-truth pairs across 12 structural hack categories.

Reproduce:

python -m benchmarks.run_benchmark --benchmark structural
python -m benchmarks.run_benchmark --benchmark all
python -m benchmarks.run_benchmark --benchmark malt --mode strict

Companion experiment:

rl-rewardhacking(github.com/Nick-is-building/rl-rewardhacking) β€” the companion RL training repo that validates ast-guard empirically. Extendsariahw/rl-rewardhacking("Steering RL: Training Interventions to Mitigate Reward Hacking") with ast-guard as a live structural penalty in GRPO training. Adds: honeypot reward design, HoneypotAstGuardReward adapter (pair-mode scan against canonical solution), A/B hack classification, and the escalation-cascade experiment described above.

Datasets and taxonomies:

TRACE(Deshpande et al. 2026,arXiv:2601.20103) β€” 54-category reward-hacking taxonomy. ast-guard covers 15 structural categories at 95.7% F1; the remainder are semantic.MALT(METR 2025) β€” 10,919 manually reviewed agent transcripts, 81,515 extracted code blocks. The largest labeled dataset in the field.

Conceptual foundations:

Helff et al.(arXiv:2604.15149) β€” Frames extensional enumeration as a reward-hacking pattern in inductive logic-reasoning tasks (Prolog-style rule induction). Motivates theconceptbehind Check 5; the Python if/elif and match/case detector here is ast-guard's own analogue, not a pattern Helff measured directly.ZeroFalse(arXiv:2510.02534) β€” Calibrated confidence levels for static-analysis findings. Motivates ast-guard's confidence-score module (ast_guard/confidence.py

).

Complementary detectors (structural analysis is one layer):

RewardHackWatchβ€” Runtime detector combining ML + regex + AST. ast-guard is its deterministic structural complement.** EvilGenie**β€” Inference-time LLM reviewer. A scaffold is present in ast-guard (benchmarks/s/evilgenie.py

) but has not been validated against real data β€” EvilGenie is a live-harness benchmark with no static data release, so field names are guessed.

@software{ast_guard_2026,
  title  = {ast-guard: Pre-Execution Gate for AI-Generated Code},
  author = {Nick},
  year   = {2026},
  url    = {https://github.com/Nick-is-building/ast-guard},
  version = {2.3.0}
}

ast-guard is actively developed research software. See CHANGELOG.md for version history, DEVELOPMENT_HISTORY.md for the detector hardening timeline (how RL-observed evasions drove specific check improvements), and CONTRIBUTING.md for contribution guidelines.

── more in #ai-safety 4 stories Β· sorted by recency
── more on @ast-guard 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/show-hn-ast-guard-a-…] indexed:0 read:14min 2026-06-29 Β· β€”