From a Simple Web App to a Production-Style Platform: My DevOps Learning Journey

wpnews.pro

When I started building SystemCraft, my goal wasn't to learn Kubernetes, GitOps, monitoring, or cloud-native architecture.

I just wanted to build a system design interview platform.

Fast forward a few months, and that simple web application evolved into something much bigger:

CI/CD Pipelines

Dockerized Deployments

Kubernetes

Helm Charts

ArgoCD GitOps

Prometheus Monitoring

Grafana Dashboards

AlertManager

Auto Scaling

Security Scanning

This article is the story of how that happened and what I learned along the way.

The Original Idea

SystemCraft was designed to solve a problem I noticed while preparing for system design interviews.

Most preparation resources are passive:

Reading blogs

Watching videos

Looking at architecture diagrams

But real system design interviews are interactive.

You need to make decisions, justify trade-offs, adapt to changing requirements, and explain your reasoning.

I wanted to create a platform where engineers could:

Design architectures visually

Receive AI-powered feedback

Simulate real interview scenarios

Learn through iteration

The first version was straightforward:

Next.js ↓ MongoDB ↓ Gemini API

The Docker Phase

My first step was containerization.

I created a Dockerfile and containerized the entire application.

At first, I thought Docker was the hard part.

I quickly learned it wasn't.

Building containers is easy.

Operating containers reliably is the real challenge.

**

Questions started appearing:

**How do I deploy updates?

How do I manage multiple replicas?

How do I scale?

How do I monitor failures?

Docker solved packaging.

It didn't solve operations.

Building a Real CI Pipeline

The next step was automation.

I didn't want deployments to depend on manual commands.

I created a GitHub Actions pipeline that would automatically:

Lint & Typecheck ↓ Playwright E2E Tests ↓ Docker Build ↓ Trivy Security Scan ↓ Kubernetes Validation ↓ Deployment

One lesson became obvious:

Automation isn't about speed.

It's about consistency.

The pipeline catches mistakes long before they reach production.

Security Wasn't Optional

One of the most valuable additions was Trivy.

Initially I wasn't thinking much about container security.

Then I started scanning images and realized how many vulnerabilities can exist inside dependencies you didn't even know you had.

Every build now goes through:

Docker Build ↓ Trivy Scan ↓ Deployment

This simple addition completely changed how I think about shipping software.

Enter Kubernetes

Eventually a single container stopped being enough.

I wanted:

Multiple replicas

Self-healing workloads

Rolling updates

Horizontal scaling

Kubernetes provided all of that.

But Kubernetes introduced new challenges:

YAML management

Service discovery

Resource limits

Health checks

Configuration management

The complexity increased significantly.

At the same time, I started understanding why Kubernetes became the industry standard.

Helm Changed Everything

Managing raw Kubernetes manifests quickly became painful.

I introduced Helm charts to template deployments and environments.

Instead of maintaining multiple copies of manifests, I could parameterize everything:

Image versions

Replica counts

Resource limits

Environment variables

Deployment became much more manageable.

Discovering GitOps with ArgoCD

This was probably the biggest mindset shift.

Originally deployment looked like:

_GitHub Actions

↓

kubectl apply

After learning GitOps:

Git Commit

↓

Git Repository

↓

ArgoCD

↓

Kubernetes Cluster_

The cluster state became fully declarative.

Git became the source of truth.

Rollback became dramatically easier.

Auditing changes became trivial.

I finally understood why so many engineering teams are adopting GitOps workflows.

Monitoring: The Missing Piece

For a long time I only cared whether the application worked. Then I realized:

If something breaks in production, how would I know? That question led me to Prometheus and Grafana.

I instrumented the application and started tracking:

API latency

Request volume

Error rates

Resource utilization

Application health

Suddenly I could see what the system was actually doing.

Monitoring transformed troubleshooting from guessing into observing.

Adding Alerting

Monitoring is useful.

Alerting is essential.

I integrated AlertManager so that operational issues could be detected automatically.

This forced me to think about:

Error thresholds

SLOs

Availability targets

Incident response

Topics I previously associated only with large companies.

Testing Scalability

Eventually I wanted to understand how the platform behaved under load.

I simulated 500 concurrent users.

**The results were revealing.

Single Container

Metric Value

Requests 23,381

Throughput ~155 req/s

P95 Latency 3.33s

The Node.js process became saturated.

Performance degraded rapidly.

Kubernetes with HPA

Metric Value

Requests 61,026

Throughput ~351 req/s

P95 Latency 861ms**

By distributing traffic across multiple pods, latency dropped dramatically while throughput more than doubled.

This was the first time I could actually see the benefits of horizontal scaling in practice.

Current Architecture

Today the deployment flow looks like this:

Developer ↓ GitHub ↓ GitHub Actions ↓ Docker Build ↓ Trivy Scan ↓ GHCR ↓ ArgoCD ↓ Kubernetes ↓ Prometheus ↓ Grafana ↓ AlertManager

What started as a simple web application became a complete cloud-native platform.

What I Learned A few lessons stood out throughout this journey.

What's Next

The next phase of my learning journey involves:

AWS Terraform Infrastructure as Code Distributed Load Testing Platform Engineering

**I'm currently building an open-source load testing tool called Loadster, inspired by the challenges I encountered while testing SystemCraft. **

**Check out the site Live:** [https://system-craft-kohl.vercel.app/](https://system-craft-kohl.vercel.app/)

If you like the article make sure to drop a like and maybe even checkout the github repo and help me contribute and make it even better

source & further reading

dev.to — original article OpenAI’s GPT-5 Science Report Puts Human Stewardship at the Center of AI Research Claude Opus 5 Is Better at Coding and Harder to Trust I Built an AI Agent With Claude Code, Then Had Claude Review Its Own Work

From a Simple Web App to a Production-Style Platform: My DevOps Learning Journey

Run your AI side-project on zahid.host