I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible

The article describes the revival of NeuroScale, a broken MLOps platform that was abandoned in April 2026 after Backstage, ArgoCD, and KServe all failed. The author rebuilt it into a self-service AI inference platform using GitOps, policy enforcement, and deterministic recovery, passing all 21 smoke tests with zero failures. The platform now allows developers to deploy models via a simple Backstage form, with Git as the single source of truth and automatic drift correction.

Submitted for the GitHub Copilot Challenge — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner. On April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead. 48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery. 21 checks. 0 failures. Reproducible on any machine. Repo: github.com/sodiq-code/neuroscale-platform https://github.com/sodiq-code/neuroscale-platform Watch It Run: 21 Checks, 0 Failures This is not a claim. The video below shows every check running live against a real k3d cluster. ▶ Watch the full smoke test demo — 21 checks, 0 failures https://storage.googleapis.com/runable-templates/cli-uploads%2FKOX2Ek1YgxEESzcJKMx3OuH8Kfvn0qwn%2FZzLPz kcSP ofvCz1mZ77%2Fsmoke-test-demo ppdIdW.mp4 ━━━ Milestone A — GitOps Spine ArgoCD ━━━ ✓ PASS All ArgoCD pods are Running ✓ PASS ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total ✓ PASS ArgoCD sync visibility: no Unknown states 7/7 Synced ✓ PASS Drift self-heal: nginx-test recreated and Ready in ~20s ━━━ Milestone B — AI Serving Baseline KServe ━━━ ✓ PASS KServe controller-manager: 1 replica available ✓ PASS InferenceServices: 2/2 Ready=True ✓ PASS Inference request: demo-iris-2 returned predictions ↳ Response: {"predictions": 1,1 } ━━━ Milestone C — Golden Path Backstage ━━━ ✓ PASS Backstage deployment: 1 replica available ✓ PASS demo-iris-2 InferenceService exists scaffolder output ✓ PASS demo-iris-2 ArgoCD Application exists ApplicationSet output ━━━ Milestone D — Guardrails Kyverno + CI ━━━ ✓ PASS Kyverno pods running: 3 ✓ PASS Kyverno ClusterPolicies installed: 5 policies ✓ PASS Admission block: non-compliant InferenceService correctly denied ━━━ Milestone F — Production Hardening ━━━ ✓ PASS ApplicationSet neuroscale-model-endpoints exists ✓ PASS ArgoCD has 7 Applications ApplicationSet + static ✓ PASS ResourceQuota exists in namespace default ✓ PASS LimitRange exists in namespace default ✓ PASS Non-root admission block: root-container Deployment denied ✓ PASS OpenCost deployment healthy: 1 replica available PASS 21 / FAIL 0 / SKIP 1 ✓ All checks passed. Platform is healthy and ready to demo. The single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video. Reproducible on any machine: bash scripts/bootstrap.sh ~5 minutes — requires Docker + k3d bash scripts/smoke-test.sh 21 checks, all green The Problem: A Platform That Was Abandoned and Dangerous NeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence. Here's what I found when I came back: - Backstage: CrashLoopBackOff — 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored. - ArgoCD repo-server: CrashLoopBackOff — every application showed Unknown , meaning ArgoCD couldn't even evaluate their state. - KServe: READY=False — default config assumed Istio for ingress, but the cluster ran Kourier. Error: "virtual service not found" . - Policy enforcement: None. Root containers, no resource limits, :latest tags — deployed freely. - Drift detection: None. Manual kubectl changes accumulated silently. The deployment process was vim → kubectl apply → hope. Developers feared deploying models. The platform was technically worse than not having one. What I Built: Five Enforcement Layers Layer 1: Self-Service Golden Path A developer fills in a Backstage form. The platform does everything else. Backstage form → PR created → CI validates → Merge → ArgoCD syncs → ApplicationSet discovers → KServe endpoint live → Predictions working No kubectl . No YAML editing. No tribal knowledge. The template.yaml https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml generates a compliant InferenceService manifest, opens a PR, and the neuroscale-model-endpoints ApplicationSet auto-discovers it. Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest. Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync. Layer 2: GitOps Drift Control Git is the source of truth. Drift is auto-corrected. bash $ kubectl delete deploy nginx-test -n default 20 seconds later... $ kubectl get deploy nginx-test -n default NAME READY UP-TO-DATE AVAILABLE AGE nginx-test 1/1 1 1 8s Auto-recreated by ArgoCD selfHeal: true and prune: true . Manual cluster changes cannot persist. Layer 3: Policy Guardrails — Shift-Left + Shift-Down At PR time CI : kubeconform validates schemas. kyverno-cli simulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at .github/workflows/guardrails-checks.yaml https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml . At admission time cluster : Kyverno blocks non-compliant resources before they reach the cluster. Five enforced policies: | Policy | What It Blocks | |---|---| require-standard-labels-inferenceservice | Missing owner + cost-center labels | require-standard-labels-deployment | Missing ownership labels on Deployments | require-resource-requests-limits | No CPU/memory requests or limits | disallow-latest-image-tag | Floating :latest image tags | disallow-root-containers | Containers without runAsNonRoot: true | Layer 4: Cost Attribution Every workload carries owner and cost-center labels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also comments on PRs https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml with CPU/memory deltas and flags workloads exceeding thresholds. Layer 5: Operational Recovery Documented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at docs/runbook.md https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md . Copilot Partnership: Three Moments That Mattered Copilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days. Moment 1: The Architectural Decision — Kourier vs Istio Problem: KServe stuck at READY=False . Error: "virtual service not found" — an Istio concept on a cluster running Kourier. Copilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility. Fix: Reapply the serving-stack overlay, verify disableIstioVirtualHost=true in ConfigMaps, restart control plane pods. Result: working inference, 800MB freed. Moment 2: The Silent Bug — CI Guardrails That Can't False-Green Problem: kyverno-cli apply looked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing. Two undocumented kyverno-cli behaviors Copilot surfaced: - A single --resource flag with multiple paths silently ignores every path after the first. - Exit code is 0 even when violations are printed to stdout. The fix live in .github/workflows/guardrails-checks.yaml https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml : Per-resource fan-out — one file per --resource flag mapfile -t app files < < find apps -type f \ -name ' .yaml' -o -name ' .yml' \ | sort failed=0 for resource in "${app files @ }"; do log="$ mktemp " if docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \ apply infrastructure/kyverno/policies/ .yaml --resource "$resource" 2 &1 | tee "$log"; then failed=1 fi Dual check: exit code AND stdout — never trust one signal alone if grep -qiE 'denied|violat|fail|error' "$log"; then failed=1 fi done exit "$failed" A guardrail that silently passes is worse than no guardrail. Every team using kyverno-cli in CI without this pattern has a potential false-green. Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook Problem: Backstage in CrashLoopBackOff — 7 restarts. Pod failing health checks it never properly configured. I pasted the raw kubectl get pods -n backstage -w output directly into Copilot: Root cause: Backstage is a dependency chart — app settings must be nested under backstage.backstage. . Keys at the wrong level meant Helm silently used defaults, including failureThreshold: 3 with aggressive timings and no startupProbe . The container kept failing before the plugin system finished initializing. ArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show Unknown . Copilot identified both as the same root pattern and helped write a deterministic recovery: kubectl -n argocd rollout restart deploy/argocd-repo-server kubectl -n argocd rollout status deploy/argocd-repo-server --timeout=120s kubectl -n argocd get applications All Synced/Healthy That became docs/runbook.md https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md . The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform. Before vs After For Developers | Before | After | |---|---| Edit YAML by hand, kubectl apply , hope | Fill a Backstage form, review a PR, merge | | No policy feedback until deployment fails | CI blocks non-compliant manifests before merge | | No cost visibility | | For Operators | Before | After | |---|---| | Manual cluster inspection for drift | ArgoCD self-heals in ~20 seconds | | No runbooks — "ask the person who built it" | | For the Platform | Before | After | |---|---| | Abandoned since April 4th | Finished, documented, reproducible | | Collection of broken parts | 6 milestones, 21 verified checks, 0 failures | | Manual and error-prone | Automated and policy-enforced end-to-end | What Made This Real: The Failures This was not built on the happy path. Every milestone hit real failures: | Milestone | Key Failure | What It Taught Me | |---|---|---| | A — GitOps Spine | ArgoCD Unknown ≠ Error — comparison engine couldn't run | Don't confuse UI status with root cause | | B — KServe Serving | Istio/Kourier mismatch — undocumented KServe default | Always verify infrastructure defaults on constrained clusters | | C — Golden Path | Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored | CI must validate rendered manifests, not just source YAML | | D — Guardrails | Kyverno webhook disrupts all ArgoCD apps during install window | Admission controllers need deployment ordering | | E — Cost & CI | kyverno-cli false-green: exit 0 with actual violations | Dual-check exit code AND stdout — never trust one signal | | F — Hardening | ApplicationSet replaced per-app files — requires skeleton alignment | Scaffolder templates must match GitOps discovery patterns | Full failure log and recovery steps in docs/runbook.md https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md . Try It Yourself Clone and bootstrap requires Docker + k3d + kubectl + helm git clone https://github.com/sodiq-code/neuroscale-platform.git cd neuroscale-platform bash scripts/bootstrap.sh ~5 minutes Verify everything works bash scripts/smoke-test.sh 21 checks, 0 failures Open all UIs bash scripts/port-forward-all.sh After port-forward-all.sh : - Backstage at http://localhost:7010 — developer portal - ArgoCD at http://localhost:8080 — 7 synced applications - OpenCost at http://localhost:9090 — per-workload cost attribution 5 minutes from git clone to a fully working platform. The smoke test proves it all. The Bottom Line Copilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every kyverno-cli user faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook. 21 checks. 0 failures. Reproducible on any machine. What's one abandoned project you wish you had finished? Drop it in the comments.