# AI vs Rule-Based Secret Scanning in CI: Which Actually Works

> Source: <https://dev.to/oleksandr_kuryzhev_42873f/ai-vs-rule-based-secret-scanning-in-ci-which-actually-works-bp>
> Published: 2026-06-18 07:02:45+00:00

*Originally published on kuryzhev.cloud*

Your CI pipeline passed clean, the PR merged, and three hours later your AWS keys were rotating — because your regex scanner never saw the secret encoded inside a base64 config blob. AI secret scanning in CI pipelines promises to close that gap, but it comes with its own set of tradeoffs that most comparison posts quietly skip over. I've run both approaches in production environments ranging from a 5-person startup to a 200-developer platform org, and the answer is not as simple as "just use the AI tool."

The moment that forces this decision is always the same: a developer pushes code containing a hardcoded credential, and your post-incident review surfaces the question — did our scanner miss this, or did we not have one at all? Either answer is bad, but they lead to very different remediation paths.

There are two distinct camps in secret detection tooling right now. The first is traditional pattern and regex-based scanning: tools like [Gitleaks](https://github.com/gitleaks/gitleaks) and detect-secrets operate deterministically. They match known patterns against your code, flag high-entropy strings, and give you a binary pass/fail. No external calls, no inference latency, no monthly invoice from an API vendor.

The second camp is AI-assisted scanning: tools like Nightfall, GitGuardian with its ML classification models, and Semgrep AI use trained models to understand context. They can distinguish a real Stripe live key from a placeholder string, or catch a secret that's been obfuscated in a way that defeats a simple regex.

The stakes here are not abstract. The IBM Cost of a Data Breach 2023 report puts the average incident cost at $4.45M. Leaked secrets in CI logs and version history are consistently ranked as a top initial access vector by threat intelligence teams. Getting this wrong is expensive. Getting it right requires understanding what each tool category actually does — and doesn't — protect you against.

I've used Gitleaks on every project I've joined for the past three years. Version 8.18.4 is the current stable release, ships as a single binary, and you can pull it with `brew install gitleaks`

or use the container image `ghcr.io/gitleaks/gitleaks:v8.18.4`

. The default config detects 150+ secret types out of the box. That coverage is genuinely impressive for a free, offline tool.

**Pros:**

**Cons:**

`.gitleaks.toml`

files grow to 200+ lines over time and nobody owns them. I've inherited repos where the allowlist had grown so permissive it was suppressing entire file types.⚠️ **Watch out for this:** A common mistake is adding the `--no-git`

flag to Gitleaks in CI. This disables commit history scanning and only checks the working tree. You will miss secrets committed three sprints ago. Always use `fetch-depth: 0`

in your checkout step and never pass `--no-git`

unless you have a specific reason.

⚠️ **Another one:** The `allowlist`

in `.gitleaks.toml`

has a `regexTarget`

field that accepts `"line"`

or `"match"`

. If you use `"line"`

, it suppresses the entire line — including any other secrets that happen to be on the same line. Use `"match"`

unless you explicitly want line-level suppression.

Here's the `.gitleaks.toml`

configuration I use as a starting point. It extends the defaults, adds a custom rule for internal service tokens, and sets up an allowlist that suppresses known false positives without being dangerously broad:

```
# .gitleaks.toml
# Custom Gitleaks config — extends defaults, adds allowlist for known test fixtures
# Place at repo root; Gitleaks auto-discovers this file

title = "kuryzhev.cloud secret scan config"

[extend]
  # Inherit all 150+ default rules from upstream
  useDefault = true

# ── Custom rule: internal service tokens follow pattern SVC-[env]-[hex32] ────
[[rules]]
  id = "internal-service-token"
  description = "Internal service account token"
  regex = '''SVC-(prod|staging|dev)-[0-9a-f]{32}'''
  tags = ["internal", "service-account"]
  severity = "CRITICAL"

# ── Allowlist: suppress known false positives ─────────────────────────────────
[allowlist]
  description = "Global allowlist for test fixtures and docs"
  # regexTarget = "match" — only suppresses the matched string, not the full line
  regexTarget = "match"
  regexes = [
    '''EXAMPLE_API_KEY_REPLACE_ME''',       # placeholder in README templates
    '''sk_test_[0-9a-zA-Z]{24}''',          # Stripe test keys — not valid in prod
    '''ghp_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX''',  # GitHub docs example token
  ]
  paths = [
    '''tests/fixtures/.*''',                # test fixture directory
    '''docs/.*\.md''',                      # documentation files
    '''\.gitleaks\.toml''',                 # this file itself
  ]
  # Commits known to contain intentional secret-like strings (e.g. security test commits)
  commits = [
    "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2",
  ]

# ── Entropy tuning: reduce false positives on high-entropy non-secrets ────────
[[rules]]
  id = "suppress-lock-file-entropy"
  description = "Ignore high-entropy strings in lock files"
  regex = '''.+'''
  allowlist.paths = [
    '''package-lock\.json''',
    '''yarn\.lock''',
    '''Gemfile\.lock''',
    '''poetry\.lock''',
  ]
```

AI-assisted secret scanning has matured significantly in the last two years. GitGuardian's ML model, Nightfall's detection API, and Semgrep's AI Assistant are all production-grade tools used by teams I respect. But they come with a very different set of tradeoffs that are easy to underestimate when you're reading a vendor's marketing page.

**Pros:**

`sk_live_...`

) from a test key (`sk_test_XXXX`

) or a placeholder string. GitGuardian's model reports under 1% false-positive rate on public benchmark datasets, compared to 15–30% for pure regex approaches. That difference in signal quality is the entire argument for paying for an AI tool.`ggshield secret scan repo .`

on a repo with 50,000 commits takes 4–8 minutes and will surface secrets buried in branches that have never been reviewed. Run this once on onboarding for every existing repository.**Cons:**

⚠️ **Watch out for this one specifically:** The `ggshield secret scan ci`

command returns exit code `1`

on any finding and exit code `128`

on authentication failure. Pipelines that don't distinguish these two exit codes will silently pass when your service account token expires. I stopped trusting ggshield integrations I didn't write myself after seeing this exact failure mode in a client's pipeline — the scanner had been silently passing for six weeks because the API token had rotated.

⚠️ **Nightfall-specific gotcha:** The `detectionRuleUUID`

must be pre-created in the Nightfall dashboard. If you hardcode a deleted rule UUID in your pipeline config, the scan completes with no error and no findings. Silent pass. Always validate rule existence in your pipeline bootstrap step.

Rather than a vague recommendation, here's the framework I use when advising teams on this choice. Score each factor for your actual situation, not your aspirational situation.

| Factor | Rule-Based (Gitleaks) | AI-Assisted (GitGuardian/Nightfall) | Hybrid |
|---|---|---|---|
| Data residency required | ✅ Fully offline | ❌ Hard blocker | ⚠️ Use local AI rules only |
| False-positive tolerance (low) | ❌ 15–30% FP rate | ✅ <1% FP rate | ✅ Rule-based gates, AI audits |
| Budget (<$200/month) | ✅ Free | ❌ Likely over budget at >8 devs | ⚠️ GitGuardian free tier + Gitleaks |
| Monorepo with 50k+ commits | ⚠️ Slow on full history | ✅ Historical scan designed for this | ✅ Gitleaks on delta, GG on schedule |
| Context-aware detection needed | ❌ Pattern-only | ✅ Core capability | ✅ AI handles deep audit |
| CI latency SLA (<5 min total) | ✅ <30s scan time | ⚠️ Adds 15–45s per job | ✅ AI runs async, off critical path |

The hard blocker column matters most. If data residency is required, AI SaaS is eliminated regardless of accuracy scores. If your false-positive tolerance is near zero because your security team has zero capacity to triage noise, rule-based tools will create more problems than they solve. Use this matrix honestly.

For teams in regulated industries who need AI-level accuracy without data egress, the practical answer is Semgrep with local rules (`semgrep scan --config auto`

without a token) or a self-hosted Nightfall enterprise deployment. Neither is free, but both keep your code on your infrastructure. More on secure CI/CD patterns at [kuryzhev.cloud](https://kuryzhev.cloud/).

I prefer the hybrid approach, and I'm not hedging — this is the architecture I deploy on every new project I own.

**Gitleaks in pre-commit and as a blocking PR gate. GitGuardian as a scheduled async scan. Never both blocking the merge at the same time.**

Here's the reasoning. Rule-based tools should own the fast path — the developer feedback loop. A Gitleaks scan that completes in under 30 seconds and blocks a PR merge on a clear pattern match is genuinely useful. It catches the obvious mistakes before they land. It's free, it's fast, and it's offline.

AI tools should own the deep audit path. Run GitGuardian as a post-push async scan on main and develop branches, non-blocking, with P1 alert routing to your incident response channel. This catches the context-aware leaks — the encoded secrets, the test keys that are actually real, the internal service tokens — without adding latency to the developer loop or creating a blocking dependency on an external API.

The GitHub Actions workflow below implements exactly this pattern. Gitleaks runs on every PR and blocks on findings. GitGuardian runs async on pushes to main and develop, exits zero even on findings (alerting is handled by the GitGuardian dashboard), and uploads results as a pipeline artifact for audit purposes:

```
# .github/workflows/secret-scan.yml
# Hybrid secret scanning: Gitleaks (fast, blocking) + ggshield (AI-assisted, async)
# Requires: GITGUARDIAN_API_KEY stored as GitHub Actions secret (masked + protected)

name: Secret Detection

on:
  push:
    branches: ["**"]
  pull_request:
    branches: [main, develop]

jobs:
  # ── Job 1: Fast rule-based scan — blocks merge if triggered on PR ──────────
  gitleaks-scan:
    name: Gitleaks Rule-Based Scan
    runs-on: ubuntu-latest
    steps:
      - name: Checkout full history
        uses: actions/checkout@v4
        with:
          fetch-depth: 0  # REQUIRED: scan full commit history, not just HEAD

      - name: Run Gitleaks
        uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITLEAKS_LICENSE: ${{ secrets.GITLEAKS_LICENSE }}  # only needed for org-level scans
        with:
          args: >
            --config .gitleaks.toml
            --redact
            --exit-code 1
            --report-format sarif
            --report-path gitleaks-report.sarif

      - name: Upload SARIF to GitHub Security tab
        if: always()  # upload even on failure so findings appear in Security tab
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: gitleaks-report.sarif

  # ── Job 2: AI-assisted scan — non-blocking, async, alerts via ggshield ─────
  gitguardian-scan:
    name: GitGuardian AI-Assisted Scan
    runs-on: ubuntu-latest
    # Run on push to main/develop only; do NOT block PR merge gate
    if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop')
    steps:
      - name: Checkout full history
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install ggshield
        run: pip install ggshield==1.29.0  # pin version for reproducibility

      - name: Authenticate ggshield
        run: ggshield auth login --method=token
        env:
          GITGUARDIAN_API_KEY: ${{ secrets.GITGUARDIAN_API_KEY }}

      - name: Run GitGuardian secret scan
        run: |
          ggshield secret scan ci \
            --output gg-results.json \
            --json \
            --exit-zero  # non-blocking: exit 0 even on findings; alerting handled by GitGuardian dashboard
        env:
          GITGUARDIAN_API_KEY: ${{ secrets.GITGUARDIAN_API_KEY }}

      - name: Upload scan results as artifact
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: gitguardian-results
          path: gg-results.json
          retention-days: 30
```

One concrete caveat: if you're in a regulated industry with strict data egress controls, replace the GitGuardian job with Semgrep using local AI rules or a self-hosted enterprise scanner. The architecture stays the same — fast rule-based gate on PR, deep AI audit async — but the AI component runs on your own infrastructure. Check the [GitHub secret scanning documentation](https://docs.github.com/en/code-security/secret-scanning/about-secret-scanning) if you're on GitHub Advanced Security and want to layer native scanning on top of this setup.

One final thing I want to flag: wherever you store your `GITGUARDIAN_API_KEY`

or `NIGHTFALL_API_KEY`

— make sure it's masked, protected, and injected via a secrets manager or CI secret store. Storing a secret scanner's own API key as a plain CI environment variable is the specific kind of irony that shows up in post-incident reviews. I've seen it happen. Don't let it happen to you.

The bottom line on AI secret scanning in CI pipelines: use rule-based tools to protect the developer loop, use AI tools to protect the audit trail. Neither one alone is sufficient. Both together, in the right places, is the architecture that actually holds up under real-world conditions.
