# I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible

> Source: <https://dev.to/sodiqjimoh/i-revived-a-broken-mlops-platform-now-its-self-service-policy-guarded-and-operationally-55nj>
> Published: 2026-05-23 00:42:39+00:00

*Submitted for the GitHub Copilot Challenge — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner.*

On April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead.

48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery.

**21 checks. 0 failures. Reproducible on any machine.**

**Repo:** [github.com/sodiq-code/neuroscale-platform](https://github.com/sodiq-code/neuroscale-platform)

## Watch It Run: 21 Checks, 0 Failures

This is not a claim. The video below shows every check running live against a real k3d cluster.

[▶ Watch the full smoke test demo — 21 checks, 0 failures](https://storage.googleapis.com/runable-templates/cli-uploads%2FKOX2Ek1YgxEESzcJKMx3OuH8Kfvn0qwn%2FZzLPz_kcSP_ofvCz1mZ77%2Fsmoke-test-demo_ppdIdW.mp4)

```
━━━ Milestone A — GitOps Spine (ArgoCD) ━━━
  [✓ PASS] All ArgoCD pods are Running
  [✓ PASS] ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total
  [✓ PASS] ArgoCD sync visibility: no Unknown states (7/7 Synced)
  [✓ PASS] Drift self-heal: nginx-test recreated and Ready in ~20s

━━━ Milestone B — AI Serving Baseline (KServe) ━━━
  [✓ PASS] KServe controller-manager: 1 replica available
  [✓ PASS] InferenceServices: 2/2 Ready=True
  [✓ PASS] Inference request: demo-iris-2 returned predictions
           ↳ Response: {"predictions":[1,1]}

━━━ Milestone C — Golden Path (Backstage) ━━━
  [✓ PASS] Backstage deployment: 1 replica available
  [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output)
  [✓ PASS] demo-iris-2 ArgoCD Application exists (ApplicationSet output)

━━━ Milestone D — Guardrails (Kyverno + CI) ━━━
  [✓ PASS] Kyverno pods running: 3
  [✓ PASS] Kyverno ClusterPolicies installed: 5 policies
  [✓ PASS] Admission block: non-compliant InferenceService correctly denied

━━━ Milestone F — Production Hardening ━━━
  [✓ PASS] ApplicationSet neuroscale-model-endpoints exists
  [✓ PASS] ArgoCD has 7 Applications (ApplicationSet + static)
  [✓ PASS] ResourceQuota exists in namespace default
  [✓ PASS] LimitRange exists in namespace default
  [✓ PASS] Non-root admission block: root-container Deployment denied
  [✓ PASS] OpenCost deployment healthy: 1 replica available

  PASS 21 / FAIL 0 / SKIP 1
  ✓ All checks passed. Platform is healthy and ready to demo.
```

The single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video.

**Reproducible on any machine:**

```
bash scripts/bootstrap.sh     # ~5 minutes — requires Docker + k3d
bash scripts/smoke-test.sh    # 21 checks, all green
```

## The Problem: A Platform That Was Abandoned and Dangerous

NeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence.

Here's what I found when I came back:

-
**Backstage:**`CrashLoopBackOff`

— 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored. -
**ArgoCD repo-server:**`CrashLoopBackOff`

— every application showed`Unknown`

, meaning ArgoCD couldn't even evaluate their state. -
**KServe:**`READY=False`

— default config assumed Istio for ingress, but the cluster ran Kourier. Error:`"virtual service not found"`

. -
**Policy enforcement:** None. Root containers, no resource limits,`:latest`

tags — deployed freely. -
**Drift detection:** None. Manual`kubectl`

changes accumulated silently.

The deployment process was `vim`

→ `kubectl apply`

→ hope. Developers feared deploying models. The platform was technically worse than not having one.

## What I Built: Five Enforcement Layers

### Layer 1: Self-Service Golden Path

A developer fills in a Backstage form. The platform does everything else.

```
Backstage form → PR created → CI validates → Merge → ArgoCD syncs
  → ApplicationSet discovers → KServe endpoint live → Predictions working
```

No `kubectl`

. No YAML editing. No tribal knowledge. The [template.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml) generates a compliant `InferenceService`

manifest, opens a PR, and the `neuroscale-model-endpoints`

ApplicationSet auto-discovers it.

*Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest.*

*Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync.*

### Layer 2: GitOps Drift Control

Git is the source of truth. Drift is auto-corrected.

``` bash
$ kubectl delete deploy nginx-test -n default
# 20 seconds later...
$ kubectl get deploy nginx-test -n default
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
nginx-test   1/1     1            1           8s   # Auto-recreated by ArgoCD
```

`selfHeal: true`

and `prune: true`

. Manual cluster changes cannot persist.

### Layer 3: Policy Guardrails — Shift-Left + Shift-Down

**At PR time (CI):** kubeconform validates schemas. `kyverno-cli`

simulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at [ .github/workflows/guardrails-checks.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml).

**At admission time (cluster):** Kyverno blocks non-compliant resources before they reach the cluster.

Five enforced policies:

| Policy | What It Blocks |
|---|---|
`require-standard-labels-inferenceservice` |
Missing `owner` + `cost-center` labels |
`require-standard-labels-deployment` |
Missing ownership labels on Deployments |
`require-resource-requests-limits` |
No CPU/memory requests or limits |
`disallow-latest-image-tag` |
Floating `:latest` image tags |
`disallow-root-containers` |
Containers without `runAsNonRoot: true`
|

### Layer 4: Cost Attribution

Every workload carries `owner`

and `cost-center`

labels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also [comments on PRs](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml) with CPU/memory deltas and flags workloads exceeding thresholds.

### Layer 5: Operational Recovery

Documented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md).

## Copilot Partnership: Three Moments That Mattered

Copilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days.

### Moment 1: The Architectural Decision — Kourier vs Istio

**Problem:** KServe stuck at `READY=False`

. Error: `"virtual service not found"`

— an Istio concept on a cluster running Kourier.

Copilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility.

**Fix:** Reapply the serving-stack overlay, verify `disableIstioVirtualHost=true`

in ConfigMaps, restart control plane pods. Result: working inference, 800MB freed.

### Moment 2: The Silent Bug — CI Guardrails That Can't False-Green

**Problem:** `kyverno-cli apply`

looked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing.

Two undocumented `kyverno-cli`

behaviors Copilot surfaced:

- A single
`--resource`

flag with multiple paths silently ignores every path after the first. - Exit code is
`0`

even when violations are printed to stdout.

The fix (live in [ .github/workflows/guardrails-checks.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml)):

```
# Per-resource fan-out — one file per --resource flag
mapfile -t app_files < <(find apps -type f \( -name '*.yaml' -o -name '*.yml' \) | sort)
failed=0

for resource in "${app_files[@]}"; do
  log="$(mktemp)"
  if ! docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \
      apply infrastructure/kyverno/policies/*.yaml --resource "$resource" 2>&1 | tee "$log"; then
    failed=1
  fi
  # Dual check: exit code AND stdout — never trust one signal alone
  if grep -qiE 'denied|violat|fail|error' "$log"; then
    failed=1
  fi
done

exit "$failed"
```

**A guardrail that silently passes is worse than no guardrail.** Every team using `kyverno-cli`

in CI without this pattern has a potential false-green.

### Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook

**Problem:** Backstage in `CrashLoopBackOff`

— 7 restarts. Pod failing health checks it never properly configured.

I pasted the raw `kubectl get pods -n backstage -w`

output directly into Copilot:

Root cause: Backstage is a dependency chart — app settings must be nested under `backstage.backstage.*`

. Keys at the wrong level meant Helm silently used defaults, including `failureThreshold: 3`

with aggressive timings and no `startupProbe`

. The container kept failing before the plugin system finished initializing.

ArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show `Unknown`

. Copilot identified both as the same root pattern and helped write a deterministic recovery:

```
kubectl -n argocd rollout restart deploy/argocd-repo-server
kubectl -n argocd rollout status deploy/argocd-repo-server --timeout=120s
kubectl -n argocd get applications  # All Synced/Healthy
```

That became [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md). The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform.

## Before vs After

### For Developers

| Before | After |
|---|---|
Edit YAML by hand, `kubectl apply` , hope |
Fill a Backstage form, review a PR, merge |
| No policy feedback until deployment fails | CI blocks non-compliant manifests before merge |
| No cost visibility |
|

### For Operators

| Before | After |
|---|---|
| Manual cluster inspection for drift | ArgoCD self-heals in ~20 seconds |
| No runbooks — "ask the person who built it" |
|

### For the Platform

| Before | After |
|---|---|
| Abandoned since April 4th | Finished, documented, reproducible |
| Collection of broken parts | 6 milestones, 21 verified checks, 0 failures |
| Manual and error-prone | Automated and policy-enforced end-to-end |

## What Made This Real: The Failures

This was not built on the happy path. Every milestone hit real failures:

| Milestone | Key Failure | What It Taught Me |
|---|---|---|
| A — GitOps Spine | ArgoCD `Unknown` ≠ `Error` — comparison engine couldn't run |
Don't confuse UI status with root cause |
| B — KServe Serving | Istio/Kourier mismatch — undocumented KServe default | Always verify infrastructure defaults on constrained clusters |
| C — Golden Path | Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored | CI must validate rendered manifests, not just source YAML |
| D — Guardrails | Kyverno webhook disrupts all ArgoCD apps during install window | Admission controllers need deployment ordering |
| E — Cost & CI |
`kyverno-cli` false-green: exit 0 with actual violations |
Dual-check exit code AND stdout — never trust one signal |
| F — Hardening | ApplicationSet replaced per-app files — requires skeleton alignment | Scaffolder templates must match GitOps discovery patterns |

Full failure log and recovery steps in [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md).

## Try It Yourself

```
# Clone and bootstrap (requires Docker + k3d + kubectl + helm)
git clone https://github.com/sodiq-code/neuroscale-platform.git
cd neuroscale-platform
bash scripts/bootstrap.sh     # ~5 minutes

# Verify everything works
bash scripts/smoke-test.sh    # 21 checks, 0 failures

# Open all UIs
bash scripts/port-forward-all.sh
```

After `port-forward-all.sh`

:

-
**Backstage** at`http://localhost:7010`

— developer portal -
**ArgoCD** at`http://localhost:8080`

— 7 synced applications -
**OpenCost** at`http://localhost:9090`

— per-workload cost attribution

5 minutes from `git clone`

to a fully working platform. The smoke test proves it all.

## The Bottom Line

Copilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every `kyverno-cli`

user faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook.

**21 checks. 0 failures. Reproducible on any machine.**

*What's one abandoned project you wish you had finished? Drop it in the comments.*
