{"slug": "i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and", "title": "I Revived a Broken MLOps Platform — Now It's Self-Service, Policy-Guarded, and Operationally Credible", "summary": "The article describes the revival of NeuroScale, a broken MLOps platform that was abandoned in April 2026 after Backstage, ArgoCD, and KServe all failed. The author rebuilt it into a self-service AI inference platform using GitOps, policy enforcement, and deterministic recovery, passing all 21 smoke tests with zero failures. The platform now allows developers to deploy models via a simple Backstage form, with Git as the single source of truth and automatic drift correction.", "body_md": "*Submitted for the GitHub Copilot Challenge — deadline June 7, 2026. Built with GitHub Copilot as an active architectural and debugging partner.*\n\nOn April 4th, I abandoned this platform. Backstage crashed. ArgoCD was broken. KServe couldn't serve a single model. I walked away and left it for dead.\n\n48 days later, I came back and rebuilt it into a self-service AI inference platform with GitOps, policy enforcement, and deterministic recovery.\n\n**21 checks. 0 failures. Reproducible on any machine.**\n\n**Repo:** [github.com/sodiq-code/neuroscale-platform](https://github.com/sodiq-code/neuroscale-platform)\n\n## Watch It Run: 21 Checks, 0 Failures\n\nThis is not a claim. The video below shows every check running live against a real k3d cluster.\n\n[▶ Watch the full smoke test demo — 21 checks, 0 failures](https://storage.googleapis.com/runable-templates/cli-uploads%2FKOX2Ek1YgxEESzcJKMx3OuH8Kfvn0qwn%2FZzLPz_kcSP_ofvCz1mZ77%2Fsmoke-test-demo_ppdIdW.mp4)\n\n```\n━━━ Milestone A — GitOps Spine (ArgoCD) ━━━\n  [✓ PASS] All ArgoCD pods are Running\n  [✓ PASS] ArgoCD Applications: 7 Healthy, 0 Progressing, 7 total\n  [✓ PASS] ArgoCD sync visibility: no Unknown states (7/7 Synced)\n  [✓ PASS] Drift self-heal: nginx-test recreated and Ready in ~20s\n\n━━━ Milestone B — AI Serving Baseline (KServe) ━━━\n  [✓ PASS] KServe controller-manager: 1 replica available\n  [✓ PASS] InferenceServices: 2/2 Ready=True\n  [✓ PASS] Inference request: demo-iris-2 returned predictions\n           ↳ Response: {\"predictions\":[1,1]}\n\n━━━ Milestone C — Golden Path (Backstage) ━━━\n  [✓ PASS] Backstage deployment: 1 replica available\n  [✓ PASS] demo-iris-2 InferenceService exists (scaffolder output)\n  [✓ PASS] demo-iris-2 ArgoCD Application exists (ApplicationSet output)\n\n━━━ Milestone D — Guardrails (Kyverno + CI) ━━━\n  [✓ PASS] Kyverno pods running: 3\n  [✓ PASS] Kyverno ClusterPolicies installed: 5 policies\n  [✓ PASS] Admission block: non-compliant InferenceService correctly denied\n\n━━━ Milestone F — Production Hardening ━━━\n  [✓ PASS] ApplicationSet neuroscale-model-endpoints exists\n  [✓ PASS] ArgoCD has 7 Applications (ApplicationSet + static)\n  [✓ PASS] ResourceQuota exists in namespace default\n  [✓ PASS] LimitRange exists in namespace default\n  [✓ PASS] Non-root admission block: root-container Deployment denied\n  [✓ PASS] OpenCost deployment healthy: 1 replica available\n\n  PASS 21 / FAIL 0 / SKIP 1\n  ✓ All checks passed. Platform is healthy and ready to demo.\n```\n\nThe single SKIP is the drift self-heal pre-condition check — normal after a previous test run. The drift self-heal itself passed, visible in the output above and in the video.\n\n**Reproducible on any machine:**\n\n```\nbash scripts/bootstrap.sh     # ~5 minutes — requires Docker + k3d\nbash scripts/smoke-test.sh    # 21 checks, all green\n```\n\n## The Problem: A Platform That Was Abandoned and Dangerous\n\nNeuroScale started in February 2026 as an AI inference platform on Kubernetes. By early April it was abandoned — Backstage crashing, ArgoCD broken, KServe unable to serve a single model. The last commit before this challenge was April 4th. Then 48 days of silence.\n\nHere's what I found when I came back:\n\n-\n**Backstage:**`CrashLoopBackOff`\n\n— 14 restarts. A Helm values nesting bug caused probe timings to be silently ignored. -\n**ArgoCD repo-server:**`CrashLoopBackOff`\n\n— every application showed`Unknown`\n\n, meaning ArgoCD couldn't even evaluate their state. -\n**KServe:**`READY=False`\n\n— default config assumed Istio for ingress, but the cluster ran Kourier. Error:`\"virtual service not found\"`\n\n. -\n**Policy enforcement:** None. Root containers, no resource limits,`:latest`\n\ntags — deployed freely. -\n**Drift detection:** None. Manual`kubectl`\n\nchanges accumulated silently.\n\nThe deployment process was `vim`\n\n→ `kubectl apply`\n\n→ hope. Developers feared deploying models. The platform was technically worse than not having one.\n\n## What I Built: Five Enforcement Layers\n\n### Layer 1: Self-Service Golden Path\n\nA developer fills in a Backstage form. The platform does everything else.\n\n```\nBackstage form → PR created → CI validates → Merge → ArgoCD syncs\n  → ApplicationSet discovers → KServe endpoint live → Predictions working\n```\n\nNo `kubectl`\n\n. No YAML editing. No tribal knowledge. The [template.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml) generates a compliant `InferenceService`\n\nmanifest, opens a PR, and the `neuroscale-model-endpoints`\n\nApplicationSet auto-discovers it.\n\n*Five fields. No kubectl. No YAML. DNS pattern enforced client-side, cost center required. Click Next and Backstage does the rest.*\n\n*Two steps, 9 seconds total. PR opened, ApplicationSet picks it up on next ArgoCD sync.*\n\n### Layer 2: GitOps Drift Control\n\nGit is the source of truth. Drift is auto-corrected.\n\n``` bash\n$ kubectl delete deploy nginx-test -n default\n# 20 seconds later...\n$ kubectl get deploy nginx-test -n default\nNAME         READY   UP-TO-DATE   AVAILABLE   AGE\nnginx-test   1/1     1            1           8s   # Auto-recreated by ArgoCD\n```\n\n`selfHeal: true`\n\nand `prune: true`\n\n. Manual cluster changes cannot persist.\n\n### Layer 3: Policy Guardrails — Shift-Left + Shift-Down\n\n**At PR time (CI):** kubeconform validates schemas. `kyverno-cli`\n\nsimulates all 5 policies against rendered manifests with a dual exit-code + stdout check to guard against false-greens. Full pipeline at [ .github/workflows/guardrails-checks.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml).\n\n**At admission time (cluster):** Kyverno blocks non-compliant resources before they reach the cluster.\n\nFive enforced policies:\n\n| Policy | What It Blocks |\n|---|---|\n`require-standard-labels-inferenceservice` |\nMissing `owner` + `cost-center` labels |\n`require-standard-labels-deployment` |\nMissing ownership labels on Deployments |\n`require-resource-requests-limits` |\nNo CPU/memory requests or limits |\n`disallow-latest-image-tag` |\nFloating `:latest` image tags |\n`disallow-root-containers` |\nContainers without `runAsNonRoot: true`\n|\n\n### Layer 4: Cost Attribution\n\nEvery workload carries `owner`\n\nand `cost-center`\n\nlabels enforced by Kyverno — you can't deploy without them. OpenCost reads these via Prometheus for per-team cost breakdowns. The CI pipeline also [comments on PRs](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml) with CPU/memory deltas and flags workloads exceeding thresholds.\n\n### Layer 5: Operational Recovery\n\nDocumented runbooks for every failure mode encountered. 3-command, 2-minute recovery procedures. Full runbook at [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md).\n\n## Copilot Partnership: Three Moments That Mattered\n\nCopilot didn't write this platform. It functioned as a senior infrastructure advisor at three exact moments where I could have stayed stuck for days.\n\n### Moment 1: The Architectural Decision — Kourier vs Istio\n\n**Problem:** KServe stuck at `READY=False`\n\n. Error: `\"virtual service not found\"`\n\n— an Istio concept on a cluster running Kourier.\n\nCopilot searched the actual repo files, confirmed the non-Istio setup was already correct, and identified the root cause: stale cached config. Critical tradeoff it surfaced — Istio adds ~1GB memory overhead; Kourier is under 200MB. On a shared 8GB dev node, Istio would have killed reproducibility.\n\n**Fix:** Reapply the serving-stack overlay, verify `disableIstioVirtualHost=true`\n\nin ConfigMaps, restart control plane pods. Result: working inference, 800MB freed.\n\n### Moment 2: The Silent Bug — CI Guardrails That Can't False-Green\n\n**Problem:** `kyverno-cli apply`\n\nlooked green in CI. Then I tested with a deliberately non-compliant manifest. It still passed. The guardrail was checking nothing.\n\nTwo undocumented `kyverno-cli`\n\nbehaviors Copilot surfaced:\n\n- A single\n`--resource`\n\nflag with multiple paths silently ignores every path after the first. - Exit code is\n`0`\n\neven when violations are printed to stdout.\n\nThe fix (live in [ .github/workflows/guardrails-checks.yaml](https://github.com/sodiq-code/neuroscale-platform/blob/main/.github/workflows/guardrails-checks.yaml)):\n\n```\n# Per-resource fan-out — one file per --resource flag\nmapfile -t app_files < <(find apps -type f \\( -name '*.yaml' -o -name '*.yml' \\) | sort)\nfailed=0\n\nfor resource in \"${app_files[@]}\"; do\n  log=\"$(mktemp)\"\n  if ! docker run --rm -v \"$PWD:/work\" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \\\n      apply infrastructure/kyverno/policies/*.yaml --resource \"$resource\" 2>&1 | tee \"$log\"; then\n    failed=1\n  fi\n  # Dual check: exit code AND stdout — never trust one signal alone\n  if grep -qiE 'denied|violat|fail|error' \"$log\"; then\n    failed=1\n  fi\ndone\n\nexit \"$failed\"\n```\n\n**A guardrail that silently passes is worse than no guardrail.** Every team using `kyverno-cli`\n\nin CI without this pattern has a potential false-green.\n\n### Moment 3: The Recurring Crash — CrashLoopBackOff and the Runbook\n\n**Problem:** Backstage in `CrashLoopBackOff`\n\n— 7 restarts. Pod failing health checks it never properly configured.\n\nI pasted the raw `kubectl get pods -n backstage -w`\n\noutput directly into Copilot:\n\nRoot cause: Backstage is a dependency chart — app settings must be nested under `backstage.backstage.*`\n\n. Keys at the wrong level meant Helm silently used defaults, including `failureThreshold: 3`\n\nwith aggressive timings and no `startupProbe`\n\n. The container kept failing before the plugin system finished initializing.\n\nArgoCD hit the same pattern: Kyverno installation disrupted the repo-server's gRPC channel, which doesn't auto-reconnect — causing all apps to show `Unknown`\n\n. Copilot identified both as the same root pattern and helped write a deterministic recovery:\n\n```\nkubectl -n argocd rollout restart deploy/argocd-repo-server\nkubectl -n argocd rollout status deploy/argocd-repo-server --timeout=120s\nkubectl -n argocd get applications  # All Synced/Healthy\n```\n\nThat became [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md). The platform doesn't just work — it's recoverable. That's the difference between a demo and a real platform.\n\n## Before vs After\n\n### For Developers\n\n| Before | After |\n|---|---|\nEdit YAML by hand, `kubectl apply` , hope |\nFill a Backstage form, review a PR, merge |\n| No policy feedback until deployment fails | CI blocks non-compliant manifests before merge |\n| No cost visibility |\n|\n\n### For Operators\n\n| Before | After |\n|---|---|\n| Manual cluster inspection for drift | ArgoCD self-heals in ~20 seconds |\n| No runbooks — \"ask the person who built it\" |\n|\n\n### For the Platform\n\n| Before | After |\n|---|---|\n| Abandoned since April 4th | Finished, documented, reproducible |\n| Collection of broken parts | 6 milestones, 21 verified checks, 0 failures |\n| Manual and error-prone | Automated and policy-enforced end-to-end |\n\n## What Made This Real: The Failures\n\nThis was not built on the happy path. Every milestone hit real failures:\n\n| Milestone | Key Failure | What It Taught Me |\n|---|---|---|\n| A — GitOps Spine | ArgoCD `Unknown` ≠ `Error` — comparison engine couldn't run |\nDon't confuse UI status with root cause |\n| B — KServe Serving | Istio/Kourier mismatch — undocumented KServe default | Always verify infrastructure defaults on constrained clusters |\n| C — Golden Path | Backstage CrashLoopBackOff from Helm mis-nesting — probes silently ignored | CI must validate rendered manifests, not just source YAML |\n| D — Guardrails | Kyverno webhook disrupts all ArgoCD apps during install window | Admission controllers need deployment ordering |\n| E — Cost & CI |\n`kyverno-cli` false-green: exit 0 with actual violations |\nDual-check exit code AND stdout — never trust one signal |\n| F — Hardening | ApplicationSet replaced per-app files — requires skeleton alignment | Scaffolder templates must match GitOps discovery patterns |\n\nFull failure log and recovery steps in [ docs/runbook.md](https://github.com/sodiq-code/neuroscale-platform/blob/main/docs/runbook.md).\n\n## Try It Yourself\n\n```\n# Clone and bootstrap (requires Docker + k3d + kubectl + helm)\ngit clone https://github.com/sodiq-code/neuroscale-platform.git\ncd neuroscale-platform\nbash scripts/bootstrap.sh     # ~5 minutes\n\n# Verify everything works\nbash scripts/smoke-test.sh    # 21 checks, 0 failures\n\n# Open all UIs\nbash scripts/port-forward-all.sh\n```\n\nAfter `port-forward-all.sh`\n\n:\n\n-\n**Backstage** at`http://localhost:7010`\n\n— developer portal -\n**ArgoCD** at`http://localhost:8080`\n\n— 7 synced applications -\n**OpenCost** at`http://localhost:9090`\n\n— per-workload cost attribution\n\n5 minutes from `git clone`\n\nto a fully working platform. The smoke test proves it all.\n\n## The Bottom Line\n\nCopilot helped at the exact points where strong engineering judgment mattered most: an architectural tradeoff that saved 800MB of memory, a silent CI bug that every `kyverno-cli`\n\nuser faces, and operational recovery that turns a 2-hour outage into a 2-minute runbook.\n\n**21 checks. 0 failures. Reproducible on any machine.**\n\n*What's one abandoned project you wish you had finished? Drop it in the comments.*", "url": "https://wpnews.pro/news/i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and", "canonical_source": "https://dev.to/sodiqjimoh/i-revived-a-broken-mlops-platform-now-its-self-service-policy-guarded-and-operationally-55nj", "published_at": "2026-05-23 00:42:39+00:00", "updated_at": "2026-05-23 01:01:08.580933+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "open-source", "developer-tools", "cloud-computing"], "entities": ["GitHub Copilot", "ArgoCD", "KServe", "Backstage", "Kyverno", "GitOps", "NeuroScale Platform", "Sodiq"], "alternates": {"html": "https://wpnews.pro/news/i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and", "markdown": "https://wpnews.pro/news/i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and.md", "text": "https://wpnews.pro/news/i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and.txt", "jsonld": "https://wpnews.pro/news/i-revived-a-broken-mlops-platform-now-it-s-self-service-policy-guarded-and.jsonld"}}