When I started building SystemCraft, my goal wasn't to learn Kubernetes, GitOps, monitoring, or cloud-native architecture.
I just wanted to build a system design interview platform.
Fast forward a few months, and that simple web application evolved into something much bigger:
CI/CD Pipelines
Dockerized Deployments
Kubernetes
Helm Charts
ArgoCD GitOps
Prometheus Monitoring
Grafana Dashboards
AlertManager
Auto Scaling
Security Scanning
This article is the story of how that happened and what I learned along the way.
The Original Idea
SystemCraft was designed to solve a problem I noticed while preparing for system design interviews.
Most preparation resources are passive:
Reading blogs
Watching videos
Looking at architecture diagrams
But real system design interviews are interactive.
You need to make decisions, justify trade-offs, adapt to changing requirements, and explain your reasoning.
I wanted to create a platform where engineers could:
Design architectures visually
Receive AI-powered feedback
Simulate real interview scenarios
Learn through iteration
The first version was straightforward:
Next.js β MongoDB β Gemini API
The Docker Phase
My first step was containerization.
I created a Dockerfile and containerized the entire application.
At first, I thought Docker was the hard part.
I quickly learned it wasn't.
Building containers is easy.
Operating containers reliably is the real challenge.
**
Questions started appearing:
**How do I deploy updates?
How do I manage multiple replicas?
How do I scale?
How do I monitor failures?
Docker solved packaging.
It didn't solve operations.
Building a Real CI Pipeline
The next step was automation.
I didn't want deployments to depend on manual commands.
I created a GitHub Actions pipeline that would automatically:
Lint & Typecheck β Playwright E2E Tests β Docker Build β Trivy Security Scan β Kubernetes Validation β Deployment
One lesson became obvious:
Automation isn't about speed.
It's about consistency.
The pipeline catches mistakes long before they reach production.
Security Wasn't Optional
One of the most valuable additions was Trivy.
Initially I wasn't thinking much about container security.
Then I started scanning images and realized how many vulnerabilities can exist inside dependencies you didn't even know you had.
Every build now goes through:
Docker Build β Trivy Scan β Deployment
This simple addition completely changed how I think about shipping software.
Enter Kubernetes
Eventually a single container stopped being enough.
I wanted:
Multiple replicas
Self-healing workloads
Rolling updates
Horizontal scaling
Kubernetes provided all of that.
But Kubernetes introduced new challenges:
YAML management
Service discovery
Resource limits
Health checks
Configuration management
The complexity increased significantly.
At the same time, I started understanding why Kubernetes became the industry standard.
Helm Changed Everything
Managing raw Kubernetes manifests quickly became painful.
I introduced Helm charts to template deployments and environments.
Instead of maintaining multiple copies of manifests, I could parameterize everything:
Image versions
Replica counts
Resource limits
Environment variables
Deployment became much more manageable.
Discovering GitOps with ArgoCD
This was probably the biggest mindset shift.
Originally deployment looked like:
_GitHub Actions
β
kubectl apply
After learning GitOps:
Git Commit
β
Git Repository
β
ArgoCD
β
Kubernetes Cluster_
The cluster state became fully declarative.
Git became the source of truth.
Rollback became dramatically easier.
Auditing changes became trivial.
I finally understood why so many engineering teams are adopting GitOps workflows.
Monitoring: The Missing Piece
For a long time I only cared whether the application worked. Then I realized:
If something breaks in production, how would I know? That question led me to Prometheus and Grafana.
I instrumented the application and started tracking:
API latency
Request volume
Error rates
Resource utilization
Application health
Suddenly I could see what the system was actually doing.
Monitoring transformed troubleshooting from guessing into observing.
Adding Alerting
Monitoring is useful.
Alerting is essential.
I integrated AlertManager so that operational issues could be detected automatically.
This forced me to think about:
Error thresholds
SLOs
Availability targets
Incident response
Topics I previously associated only with large companies.
Testing Scalability
Eventually I wanted to understand how the platform behaved under load.
I simulated 500 concurrent users.
**The results were revealing.
Single Container
Metric Value
Requests 23,381
Throughput ~155 req/s
P95 Latency 3.33s
The Node.js process became saturated.
Performance degraded rapidly.
Kubernetes with HPA
Metric Value
Requests 61,026
Throughput ~351 req/s
P95 Latency 861ms**
By distributing traffic across multiple pods, latency dropped dramatically while throughput more than doubled.
This was the first time I could actually see the benefits of horizontal scaling in practice.
Current Architecture
Today the deployment flow looks like this:
Developer β GitHub β GitHub Actions β Docker Build β Trivy Scan β GHCR β ArgoCD β Kubernetes β Prometheus β Grafana β AlertManager
What started as a simple web application became a complete cloud-native platform.
What I Learned A few lessons stood out throughout this journey.
What's Next
The next phase of my learning journey involves:
AWS Terraform Infrastructure as Code Distributed Load Testing Platform Engineering
**I'm currently building an open-source load testing tool called Loadster, inspired by the challenges I encountered while testing SystemCraft. **
**Check out the site Live:** [https://system-craft-kohl.vercel.app/](https://system-craft-kohl.vercel.app/)
If you like the article make sure to drop a like and maybe even checkout the github repo and help me contribute and make it even better