Production LiteLLM on AWS EKS: High Availability with GitOps

LiteLLM, a unified proxy for 100+ LLM providers, has been deployed on AWS EKS with GitOps via ArgoCD to handle 500-1000 requests per second. The production-grade system addresses multi-provider chaos, budget control, user management, and cost visibility for organizations using multiple AI models. The deployment features high availability, automatic scaling, persistent logging, and enterprise-grade reliability.

In today’s multi-cloud AI landscape, organizations often find themselves juggling multiple LLM providers — OpenAI, Anthropic Claude via Azure AI Foundry, Cohere for embeddings, and more. Each provider has its own API, authentication mechanism, rate limits, and pricing model. Managing access for multiple users while enforcing budget controls and usage restrictions quickly becomes a nightmare. Enter LiteLLM : a unified proxy that acts as a single gateway to 100+ LLM providers. But deploying LiteLLM in production isn’t just about running a container — it requires careful consideration of high availability, scalability, persistence, observability, and cost optimization. In this article, I’ll walk you through our production-grade deployment of LiteLLM on AWS EKS, managed via ArgoCD for GitOps. I’ll explain every architectural decision, configuration parameter, and best practice we followed to build a system that serves 500–1000 requests per second with automatic scaling, persistent logging, and enterprise-grade reliability. Our data science team Lab Enablement LENA group faced several challenges: 1. Multi-Provider Chaos : We used Azure AI Foundry for Claude models and Cohere v3 for embeddings. Each required separate authentication, SDKs, and management. 2. No Central Budget Control : Azure AI Foundry doesn’t provide granular budget restrictions. Creating separate Azure resources for each user was operationally expensive and unscalable. 3. User Management Nightmare : Provisioning API keys, tracking usage, and revoking access required manual intervention. 4. Cost Visibility Gap : No unified view of spending across providers made it impossible to optimize costs or forecast budgets. LiteLLM solved all of these problems by providing: Our deployment consists of four core components orchestrated on AWS EKS: Kubernetes-Native : Leverages EKS for orchestration, scaling, and self-healing GitOps-Driven : ArgoCD ensures declarative state, audit trails, and rollback capability Highly Available : Pod anti-affinity, HPA, and multiple replicas prevent single points of failure Cost-Optimized : Redis caching and batch database writes reduce infrastructure load Secure : Internal ALB, VPC-only access, TLS termination, no public exposure Why ghcr.io/berriai/litellm-database:v1.86.2 ? We chose the database-enabled image because: Deployment Configuration apiVersion: apps/v1kind: Deploymentmetadata: name: litellm namespace: litellm-ns labels: app: litellmspec: replicas: 1 HPA will scale this dynamically revisionHistoryLimit: 10 selector: matchLabels: app: litellm strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 Zero-downtime deployments template: metadata: labels: app: litellm spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: litellm topologyKey: kubernetes.io/hostname containers: - name: litellm image: ghcr.io/berriai/litellm-database:v1.86.2 imagePullPolicy: IfNotPresent args: - "--config" - "/etc/litellm/config.yaml" - "--port" - "4000" - "--num workers" - "1" Single worker per pod for predictable resource usage - recomended ports: - name: http containerPort: 4000 env: - name: POSTGRES USER valueFrom: configMapKeyRef: name: litellm-pg-config key: POSTGRES USER - name: POSTGRES DB valueFrom: configMapKeyRef: name: litellm-pg-config key: POSTGRES DB - name: POSTGRES PASSWORD valueFrom: secretKeyRef: name: litellm-pg-secret key: POSTGRES PASSWORD - name: DATABASE URL value: "postgresql://$ POSTGRES USER :$ POSTGRES PASSWORD @litellm-pg-svc.litellm-ns.svc.cluster.local:5432/$ POSTGRES DB " - name: MASTER KEY valueFrom: secretKeyRef: name: litellm-secrets key: MASTER KEY - name: REDIS PASSWORD valueFrom: secretKeyRef: name: litellm-redis-secret key: REDIS PASSWORD - name: LITELLM MODE value: "PRODUCTION" - name: LITELLM LOG value: "ERROR" Log only errors to reduce noise - name: USE PRISMA MIGRATE value: "True" Auto-apply database migrations on startup - name: DOCS URL value: "/docs" - name: ROOT REDIRECT URL value: "/ui" Redirect root to admin UI volumeMounts: - name: config mountPath: /etc/litellm readOnly: true resources: requests: cpu: "1500m" memory: "3Gi" limits: cpu: "1500m" memory: "3Gi" livenessProbe: httpGet: path: /health/liveliness port: 4000 initialDelaySeconds: 60 Allow time for Prisma migration periodSeconds: 30 failureThreshold: 5 Tolerates 2.5 minutes of failures readinessProbe: httpGet: path: /health/readiness port: 4000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 5 volumes: - name: config configMap: name: litellm-config items: - key: config.yaml path: config.yaml Key Design Decisions affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: litellm topologyKey: kubernetes.io/hostname Why? If one EKS node fails, we want LiteLLM pods on other nodes to keep serving traffic. The preferred soft constraint means Kubernetes will try to spread pods but won’t block scheduling if impossible. Alternative Considered : ‘requiredDuringSchedulingIgnoredDuringExecution hard constraint was too strict for our small cluster — it would prevent scaling if only one node had capacity. 2. Resource Limits = Requests No Burstability resources: requests: cpu: "1500m" memory: "3Gi" limits: cpu: "1500m" memory: "3Gi" Why? 3. Health Check Tuning livenessProbe: httpGet: path: /health/liveliness port: 4000 initialDelaySeconds: 60 Critical: Allow Prisma migration to complete periodSeconds: 30 failureThreshold: 5readinessProbe: httpGet: path: /health/readiness port: 4000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 5 Why 60s initial delay for liveness? Why 5 failure thresholds? Lesson Learned: Initially, we set “initialDelaySeconds: 30”, which caused 40% of deployments to fail during migrations. Increasing to 60s achieved 100% success rate. apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: litellm-hpa namespace: litellm-nsspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: litellm minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 120 policies: - type: Pods value: 1 periodSeconds: 120 selectPolicy: Max scaleDown: stabilizationWindowSeconds: 120 policies: - type: Pods value: 1 periodSeconds: 120 selectPolicy: Max Why These Thresholds? At 60% CPU, we observed response latencies starting to creep above 200ms. Scaling at 60% keeps p95 latency under 150ms. 2. Memory: 80% Utilization Memory usage is more predictable: Scaling at 80% 2.4 Gi provides comfortable headroom. 3. Stabilization Windows: 120 Seconds Scale-Up : Prevents “flapping” during brief traffic spikes e.g., batch inference jobs . HPA waits 2 minutes before adding a pod, filtering out transient load. Scale-Down : Avoids premature pod removal. If traffic drops temporarily, HPA waits 2 minutes before scaling down, preventing a scale-up/down cycle. Real-World Impact : During a traffic spike from 20 → 80 rps: apiVersion: v1kind: ConfigMapmetadata: name: litellm-config namespace: litellm-nsdata: config.yaml: | general settings: master key: os.environ/MASTER KEY database url: os.environ/DATABASE URL store model in db: true use redis transaction buffer: true block robots: true Prevent search engine indexing background health checks: true health check interval: 300 Check model endpoints every 5 minutes maximum spend logs retention period: "60d" maximum spend logs retention interval: "1d" maximum spend logs cleanup cron: "0 4 " Daily cleanup at 4 AM user api key cache ttl: 600 Cache API key lookups for 10 minutes database connection pool limit: 20 database connection timeout: 60 proxy batch write at: 30 Batch database writes every 30 seconds litellm settings: json logs: true Structured logging for Grafana/CloudWatch cache: true set verbose: false Disable verbose logs reduces noise request timeout: 600 10 minutes for long-running model requests cache params: type: redis host: redis-service port: 6379 password: os.environ/REDIS PASSWORD ttl: 600 Cache responses for 10 minutes namespace: "litellm.caching" router settings: routing strategy: simple-shuffle redis host: redis-service redis password: os.environ/REDIS PASSWORD redis port: 6379 cooldown time: 10 Retry failed models after 10 seconds 1. Database Connection Pooling: 20 Connections database connection pool limit: 20 PostgreSQL’s default max connections is 100. With 5 HPA replicas max: 5 pods × 20 connections = 100 total . This prevents connection exhaustion while avoiding the memory overhead of increasing PostgreSQL’s connection limit. 2. Batch Writing: 30-Second Intervals proxy batch write at: 30 Groups 30 seconds of request logs into a single transaction, reducing PostgreSQL CPU by 60% from 3,000 transactions/min to 2 transactions/min . Trade-off: up to 30 seconds of logs lost if a pod crashes, which is acceptable for analytics workloads. 3. Caching Strategy: 10-Minute TTL user api key cache ttl: 600 API key validationcache params.ttl: 600 Response caching API Key : Reduces database load from 50 QPS cold cache to ~0.008 QPS warm cache for 5 unique users. 4. Spend Log Retention: 60 Days maximum spend logs retention period: "60d"maximum spend logs cleanup cron: "0 4 " Retains 2 months of spend data for financial auditing. Cleanup runs at 4 AM lowest traffic period to allow PostgreSQL VACUUM to reclaim disk space. At 50 rps, 60 days of logs consume ~129 GB 80% of our 100 Gi EBS volume . 5. Routing Strategy: Simple Shuffle routing strategy: simple-shuffle Randomly distributes requests across model deployments. We chose this over least-latency tracks metrics in Redis and weighted traffic percentages because we have a single Azure AI Foundry deployment — simpler is better for debugging. Switch to least-latency when adding multi-region redundancy. apiVersion: apps/v1kind: StatefulSetmetadata: name: litellm-pg-statefulset namespace: litellm-nsspec: serviceName: litellm-pg-headless-svc replicas: 1 selector: matchLabels: app: litellm-pg template: metadata: labels: app: litellm-pg spec: containers: - name: postgres image: postgres:17 imagePullPolicy: IfNotPresent ports: - name: postgres containerPort: 5432 volumeMounts: - name: litellm-pg-data mountPath: /var/lib/postgresql/data env: - name: POSTGRES USER valueFrom: configMapKeyRef: name: litellm-pg-config key: POSTGRES USER - name: POSTGRES DB valueFrom: configMapKeyRef: name: litellm-pg-config key: POSTGRES DB - name: POSTGRES PASSWORD valueFrom: secretKeyRef: name: litellm-pg-secret key: POSTGRES PASSWORD - name: PGDATA value: /var/lib/postgresql/data/pgdata resources: requests: cpu: "250m" memory: "512Mi" limits: cpu: "2000m" memory: "8Gi" readinessProbe: exec: command: - /bin/sh - -c - pg isready -U "$POSTGRES USER" -d "$POSTGRES DB" initialDelaySeconds: 10 periodSeconds: 5 livenessProbe: exec: command: - /bin/sh - -c - pg isready -U "$POSTGRES USER" -d "$POSTGRES DB" initialDelaySeconds: 30 periodSeconds: 10 volumeClaimTemplates: - metadata: name: litellm-pg-data spec: accessModes: - ReadWriteOnce storageClassName: ebs-sc resources: requests: storage: 100Gi PostgreSQL 17 delivers 20% faster bulk INSERT’s critical for batch writes , better JSONB indexing for metadata storage, and incremental VACUUM that reduces write amplification by 30%. RDS vs Self-Managed : RDS db.t3.medium costs $50/month vs. $12/month self-managed on EKS. We needed custom max connections tuning and avoided VPC peering complexity. Trade-off: manual backups pg dump to S3 daily instead of RDS automated snapshots. StatefulSets provide stable DNS names, ordered deployment critical for single-replica databases , and persistent volume binding that survives pod restarts. We use a headless service ClusterIP: None for direct pod connections, eliminating load balancer overhead. Resource Allocation: Wide Gap Between Requests and Limits resources: requests: cpu: "250m" memory: "512Mi" limits: cpu: "2000m" memory: "8Gi" PostgreSQL is idle 90% of the time due to 30-second batch writes , justifying low requests 250m CPU fits on smaller nodes . High limits accommodate burst capacity during batch writes 1.5 CPU, 3 Gi and daily VACUUM operations up to 6 Gi memory . Measured performance: P50 CPU at 150m, P95 at 1,200m during writes. Setting requests = limits would waste EKS node capacity during idle periods. apiVersion: apps/v1kind: Deploymentmetadata: name: litellm-redis namespace: litellm-nsspec: replicas: 1 selector: matchLabels: app: litellm-redis template: metadata: labels: app: litellm-redis spec: containers: - name: redis image: redis:8-alpine imagePullPolicy: IfNotPresent command: "sh", "-c" args: - 'exec redis-server --requirepass "$REDIS PASSWORD" --appendonly yes' ports: - name: redis containerPort: 6379 env: - name: REDIS PASSWORD valueFrom: secretKeyRef: name: litellm-redis-secret key: REDIS PASSWORD - name: REDISCLI AUTH valueFrom: secretKeyRef: name: litellm-redis-secret key: REDIS PASSWORD volumeMounts: - name: litellm-redis-pvc mountPath: /data resources: requests: cpu: "200m" memory: "512Mi" limits: cpu: "1000m" memory: "2Gi" readinessProbe: exec: command: "redis-cli", "ping" initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 livenessProbe: exec: command: "redis-cli", "ping" initialDelaySeconds: 15 periodSeconds: 20 timeoutSeconds: 3 volumes: - name: litellm-redis-pvc persistentVolumeClaim: claimName: litellm-redis-pvc LiteLLM makes Redis essential for three critical functions: 1. Response Caching Identical prompts same model, same input return cached responses instead of hitting the upstream LLM provider. This is crucial for: Our 10-minute TTL achieved a 40% cache hit rate. 2. API Key Validation Caching Every LiteLLM request validates the user’s API key against PostgreSQL. Without caching, this means 50 database queries per second. Redis caches API key lookups for 10 minutes, reducing database load by 99.98% from 50 QPS to ~0.01 QPS . 3. Router State Management LiteLLM tracks model endpoint health which deployments are responding, which are failing in Redis. This enables: apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: litellm-ingress namespace: litellm-ns annotations: alb.ingress.kubernetes.io/scheme: internal alb.ingress.kubernetes.io/target-type: ip alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:01234567890:certificate/a1639c-6c2c-4e60-9211-59cd0a alb.ingress.kubernetes.io/listen-ports: ' {"HTTP":80}, {"HTTPS":443} ' alb.ingress.kubernetes.io/ssl-redirect: "443" alb.ingress.kubernetes.io/backend-protocol: HTTP alb.ingress.kubernetes.io/healthcheck-path: /health/readiness alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30" alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "10" alb.ingress.kubernetes.io/healthy-threshold-count: "2" alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"spec: ingressClassName: alb rules: - host: litellm.data-lab.io http: paths: - path: / pathType: Prefix backend: service: name: litellm-svc port: number: 4000 Creates a VPC-internal ALB no public IP . LiteLLM contains sensitive API keys and exposes admin UI — public exposure would require WAF, rate limiting, and OAuth2, which wasn’t cost-justified for internal usage. Target Type: IP alb.ingress.kubernetes.io/target-type: ip Routes traffic directly to pod IPs, eliminating kube-proxy overhead ~5ms latency reduction and enabling faster health checks. IP targets are free; instance targets charge per registration. Health Check Configuration alb.ingress.kubernetes.io/healthcheck-path: /health/readinessalb.ingress.kubernetes.io/healthcheck-interval-seconds: "30" Uses /health/readiness pod can serve traffic vs. /health/liveliness pod process running . Pods rejoin ALB after 60 seconds 2 successes × 30s and are removed after 90 seconds 3 failures × 30s . Lesson Learned: 10-second intervals caused false negatives during brief database blips. 30-second intervals tolerate transient issues. apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: litellm namespace: argocdspec: project: litellm source: repoURL: https://gitlab.com/devops1952445/argocd-playground.git targetRevision: main Production branch path: src/apps/litellm destination: server: https://kubernetes.default.svc namespace: litellm-ns syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true Why ArgoCD for LiteLLM? 1. Declarative State : Git is source of truth, not kubectl apply commands 2. Audit Trail : Every change tracked via Git commits who, what, when, why 3. Rollback : git revert instantly rolls back deployments 4. Multi-Environment : Same manifests for dev/staging/prod with Kustomize overlays 5. Self-Healing : If someone kubectl delete a pod, ArgoCD recreates it within 3 minutes Auto-Sync with Prune and Self-Heal syncPolicy:automated:prune: true Delete resources removed from GitselfHeal: true Revert manual kubectl changes Prune : If we delete litellm redis deployment.yml from Git, ArgoCD deletes the Redis deployment from the cluster. Self-Heal : If an engineer runs kubectl scale deployment/litellm — replicas=10 , ArgoCD reverts to replicas: 1 within 3 minutes. Is This Too Aggressive? Concern : Self-heal might revert emergency hotfixes during incidents. Mitigation : 1. Emergency Override : Pause ArgoCD sync during incidents: kubectl patch app litellm -n argocd -p ‘{“spec”:{“syncPolicy”:{“automated”:null}}}’ 2. Sync Wave : Critical resources database, Redis have argocd.argoproj.io/sync-wave: “0” , LiteLLM pods have wave ”1" . Database won’t self-heal during LiteLLM rollouts. 3. Notification : ArgoCD webhooks to Slack alert on any manual changes, prompting engineers to commit fixes to Git. Real-World Incident : An engineer increased PostgreSQL memory limits to debug OOM. Self-heal reverted the change, OOM recurred. We now: Deploying a production-grade LiteLLM proxy on AWS EKS requires more than just running a container. It demands thoughtful decisions about: Our deployment serves 500–1000 requests/second across 5 HPA replicas, with p95 latency under 150ms and 99.9% uptime over 1months. The architecture scales to 200 rps without infrastructure changes. Key Takeaways : 1. Test HPA behavior under load before production 2. Enable Prisma migrations to avoid schema mismatch errors 3. Monitor database connections to prevent exhaustion 4. Use structured JSON logs for easier debugging 5. Cache aggressively if workloads are repetitive If you’re deploying LiteLLM in your organization, start with this architecture as a foundation. Adjust resource limits based on your load testing, and remember: GitOps with ArgoCD is your safety net — every change is reversible, audited, and reproducible. You can also get the complete source code from my public gitlab project,https://gitlab.com/KnowledgeSharing1/lite-llm-hpa.git Production LiteLLM on AWS EKS: High Availability with GitOps https://pub.towardsai.net/production-litellm-on-aws-eks-high-availability-with-gitops-996eda87e14d was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.