# Production LiteLLM on AWS EKS: High Availability with GitOps

> Source: <https://pub.towardsai.net/production-litellm-on-aws-eks-high-availability-with-gitops-996eda87e14d?source=rss----98111c9905da---4>
> Published: 2026-06-16 13:01:03+00:00

In today’s multi-cloud AI landscape, organizations often find themselves juggling multiple LLM providers — OpenAI, Anthropic Claude via Azure AI Foundry, Cohere for embeddings, and more. Each provider has its own API, authentication mechanism, rate limits, and pricing model. Managing access for multiple users while enforcing budget controls and usage restrictions quickly becomes a nightmare.

Enter **LiteLLM**: a unified proxy that acts as a single gateway to 100+ LLM providers. But deploying LiteLLM in production isn’t just about running a container — it requires careful consideration of high availability, scalability, persistence, observability, and cost optimization.

In this article, I’ll walk you through our production-grade deployment of LiteLLM on AWS EKS, managed via ArgoCD for GitOps. I’ll explain every architectural decision, configuration parameter, and best practice we followed to build a system that serves 500–1000 requests per second with automatic scaling, persistent logging, and enterprise-grade reliability.

Our data science team Lab Enablement (LENA) group faced several challenges:

**1. Multi-Provider Chaos**: We used Azure AI Foundry for Claude models and Cohere v3 for embeddings. Each required separate authentication, SDKs, and management.

**2. No Central Budget Control**: Azure AI Foundry doesn’t provide granular budget restrictions. Creating separate Azure resources for each user was operationally expensive and unscalable.

**3. User Management Nightmare**: Provisioning API keys, tracking usage, and revoking access required manual intervention.

**4. Cost Visibility Gap**: No unified view of spending across providers made it impossible to optimize costs or forecast budgets.

**LiteLLM solved all of these problems **by providing:

Our deployment consists of four core components orchestrated on AWS EKS:

**Kubernetes-Native**: Leverages EKS for orchestration, scaling, and self-healing

**GitOps-Driven**: ArgoCD ensures declarative state, audit trails, and rollback capability

**Highly Available**: Pod anti-affinity, HPA, and multiple replicas prevent single points of failure

**Cost-Optimized**: Redis caching and batch database writes reduce infrastructure load

**Secure**: Internal ALB, VPC-only access, TLS termination, no public exposure

Why `ghcr.io/berriai/litellm-database:v1.86.2`?

We chose the **database-enabled image** because:

Deployment Configuration

```
apiVersion: apps/v1kind: Deploymentmetadata:  name: litellm  namespace: litellm-ns  labels:    app: litellmspec:  replicas: 1  # HPA will scale this dynamically  revisionHistoryLimit: 10  selector:    matchLabels:      app: litellm  strategy:    type: RollingUpdate    rollingUpdate:      maxSurge: 1      maxUnavailable: 0  # Zero-downtime deployments  template:    metadata:      labels:        app: litellm    spec:      affinity:        podAntiAffinity:          preferredDuringSchedulingIgnoredDuringExecution:            - weight: 100              podAffinityTerm:                labelSelector:                  matchLabels:                    app: litellm                topologyKey: kubernetes.io/hostname      containers:        - name: litellm          image: ghcr.io/berriai/litellm-database:v1.86.2          imagePullPolicy: IfNotPresent          args:            - "--config"            - "/etc/litellm/config.yaml"            - "--port"            - "4000"            - "--num_workers"            - "1"  # Single worker per pod for predictable resource usage - recomended          ports:            - name: http              containerPort: 4000          env:            - name: POSTGRES_USER              valueFrom:                configMapKeyRef:                  name: litellm-pg-config                  key: POSTGRES_USER            - name: POSTGRES_DB              valueFrom:                configMapKeyRef:                  name: litellm-pg-config                  key: POSTGRES_DB            - name: POSTGRES_PASSWORD              valueFrom:                secretKeyRef:                  name: litellm-pg-secret                  key: POSTGRES_PASSWORD            - name: DATABASE_URL              value: "postgresql://$(POSTGRES_USER):$(POSTGRES_PASSWORD)@litellm-pg-svc.litellm-ns.svc.cluster.local:5432/$(POSTGRES_DB)"            - name: MASTER_KEY              valueFrom:                secretKeyRef:                  name: litellm-secrets                  key: MASTER_KEY            - name: REDIS_PASSWORD              valueFrom:                secretKeyRef:                  name: litellm-redis-secret                  key: REDIS_PASSWORD            - name: LITELLM_MODE              value: "PRODUCTION"            - name: LITELLM_LOG              value: "ERROR"  # Log only errors to reduce noise            - name: USE_PRISMA_MIGRATE              value: "True"  # Auto-apply database migrations on startup            - name: DOCS_URL              value: "/docs"            - name: ROOT_REDIRECT_URL              value: "/ui"  # Redirect root to admin UI          volumeMounts:            - name: config              mountPath: /etc/litellm              readOnly: true          resources:            requests:              cpu: "1500m"              memory: "3Gi"            limits:              cpu: "1500m"              memory: "3Gi"          livenessProbe:            httpGet:              path: /health/liveliness              port: 4000            initialDelaySeconds: 60  # Allow time for Prisma migration            periodSeconds: 30            failureThreshold: 5  # Tolerates 2.5 minutes of failures          readinessProbe:            httpGet:              path: /health/readiness              port: 4000            initialDelaySeconds: 30            periodSeconds: 10            failureThreshold: 5      volumes:        - name: config          configMap:            name: litellm-config            items:              - key: config.yaml                path: config.yaml
```

Key Design Decisions

```
affinity:  podAntiAffinity:    preferredDuringSchedulingIgnoredDuringExecution:      - weight: 100        podAffinityTerm:          labelSelector:            matchLabels:              app: litellm          topologyKey: kubernetes.io/hostname
```

**Why?** If one EKS node fails, we want LiteLLM pods on other nodes to keep serving traffic. The `preferred` (soft) constraint means Kubernetes will *try* to spread pods but won’t block scheduling if impossible.

**Alternative Considered**: ‘requiredDuringSchedulingIgnoredDuringExecution` (hard constraint) was too strict for our small cluster — it would prevent scaling if only one node had capacity.

**2. Resource Limits = Requests (No Burstability)**

```
resources:  requests:    cpu: "1500m"    memory: "3Gi"  limits:    cpu: "1500m"    memory: "3Gi"
```

**Why?**

3. **Health Check Tuning**

```
livenessProbe:  httpGet:    path: /health/liveliness    port: 4000  initialDelaySeconds: 60  # Critical: Allow Prisma migration to complete  periodSeconds: 30  failureThreshold: 5readinessProbe:  httpGet:    path: /health/readiness    port: 4000  initialDelaySeconds: 30  periodSeconds: 10  failureThreshold: 5
```

**Why 60s initial delay for liveness?**

**Why 5 failure thresholds?**

Lesson Learned: Initially, we set “initialDelaySeconds: 30”, which caused 40% of deployments to fail during migrations. Increasing to 60s achieved 100% success rate.

```
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata:  name: litellm-hpa  namespace: litellm-nsspec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: litellm  minReplicas: 1  maxReplicas: 5  metrics:    - type: Resource      resource:        name: cpu        target:          type: Utilization          averageUtilization: 60    - type: Resource      resource:        name: memory        target:          type: Utilization          averageUtilization: 80  behavior:    scaleUp:      stabilizationWindowSeconds: 120      policies:        - type: Pods          value: 1          periodSeconds: 120      selectPolicy: Max    scaleDown:      stabilizationWindowSeconds: 120      policies:        - type: Pods          value: 1          periodSeconds: 120      selectPolicy: Max
```

Why These Thresholds?

At 60% CPU, we observed response latencies starting to creep above 200ms. Scaling at 60% keeps p95 latency under 150ms.

**2. Memory: 80% Utilization **Memory usage is more predictable:

Scaling at 80% (2.4 Gi) provides comfortable headroom.

**3. Stabilization Windows: 120 Seconds**

**Scale-Up**: Prevents “flapping” during brief traffic spikes (e.g., batch inference jobs). HPA waits 2 minutes before adding a pod, filtering out transient load.

**Scale-Down**: Avoids premature pod removal. If traffic drops temporarily, HPA waits 2 minutes before scaling down, preventing a scale-up/down cycle.

**Real-World Impact**: During a traffic spike from 20 → 80 rps:

```
apiVersion: v1kind: ConfigMapmetadata:  name: litellm-config  namespace: litellm-nsdata:  config.yaml: |    general_settings:      master_key: os.environ/MASTER_KEY      database_url: os.environ/DATABASE_URL      store_model_in_db: true      use_redis_transaction_buffer: true      block_robots: true  # Prevent search engine indexing      background_health_checks: true      health_check_interval: 300  # Check model endpoints every 5 minutes      maximum_spend_logs_retention_period: "60d"      maximum_spend_logs_retention_interval: "1d"      maximum_spend_logs_cleanup_cron: "0 4 * * *"  # Daily cleanup at 4 AM      user_api_key_cache_ttl: 600  # Cache API key lookups for 10 minutes      database_connection_pool_limit: 20      database_connection_timeout: 60      proxy_batch_write_at: 30  # Batch database writes every 30 seconds    litellm_settings:      json_logs: true  # Structured logging for Grafana/CloudWatch      cache: true      set_verbose: false  # Disable verbose logs (reduces noise)      request_timeout: 600  # 10 minutes for long-running model requests      cache_params:        type: redis        host: redis-service        port: 6379        password: os.environ/REDIS_PASSWORD        ttl: 600  # Cache responses for 10 minutes        namespace: "litellm.caching"    router_settings:      routing_strategy: simple-shuffle      redis_host: redis-service      redis_password: os.environ/REDIS_PASSWORD      redis_port: 6379      cooldown_time: 10  # Retry failed models after 10 seconds
```

1. **Database Connection Pooling: 20 Connections**

```
database_connection_pool_limit: 20
```

PostgreSQL’s default `max_connections` is 100. With 5 HPA replicas max: `5 pods × 20 connections = 100 total`. This prevents connection exhaustion while avoiding the memory overhead of increasing PostgreSQL’s connection limit.

2. **Batch Writing: 30-Second Intervals**

```
proxy_batch_write_at: 30
```

Groups 30 seconds of request logs into a single transaction, reducing PostgreSQL CPU by 60% (from 3,000 transactions/min to 2 transactions/min).

Trade-off: up to 30 seconds of logs lost if a pod crashes, which is acceptable for analytics workloads.

3. **Caching Strategy: 10-Minute TTL**

```
user_api_key_cache_ttl: 600 # API key validationcache_params.ttl: 600 # Response caching
```

**API Key**: Reduces database load from 50 QPS (cold cache) to ~0.008 QPS (warm cache) for 5 unique users.

4. **Spend Log Retention: 60 Days**

```
maximum_spend_logs_retention_period: "60d"maximum_spend_logs_cleanup_cron: "0 4 * * *"
```

Retains 2 months of spend data for financial auditing. Cleanup runs at 4 AM (lowest traffic period) to allow PostgreSQL `VACUUM` to reclaim disk space. At 50 rps, 60 days of logs consume ~129 GB (80% of our 100 Gi EBS volume).

5. **Routing Strategy: Simple Shuffle**

```
routing_strategy: simple-shuffle
```

Randomly distributes requests across model deployments. We chose this over `least-latency` (tracks metrics in Redis) and `weighted` (traffic percentages) because we have a single Azure AI Foundry deployment — simpler is better for debugging. Switch to `least-latency` when adding multi-region redundancy.

```
apiVersion: apps/v1kind: StatefulSetmetadata:  name: litellm-pg-statefulset  namespace: litellm-nsspec:  serviceName: litellm-pg-headless-svc  replicas: 1  selector:    matchLabels:      app: litellm-pg  template:    metadata:      labels:        app: litellm-pg    spec:      containers:        - name: postgres          image: postgres:17          imagePullPolicy: IfNotPresent          ports:            - name: postgres              containerPort: 5432          volumeMounts:            - name: litellm-pg-data              mountPath: /var/lib/postgresql/data          env:            - name: POSTGRES_USER              valueFrom:                configMapKeyRef:                  name: litellm-pg-config                  key: POSTGRES_USER            - name: POSTGRES_DB              valueFrom:                configMapKeyRef:                  name: litellm-pg-config                  key: POSTGRES_DB            - name: POSTGRES_PASSWORD              valueFrom:                secretKeyRef:                  name: litellm-pg-secret                  key: POSTGRES_PASSWORD            - name: PGDATA              value: /var/lib/postgresql/data/pgdata          resources:            requests:              cpu: "250m"              memory: "512Mi"            limits:              cpu: "2000m"              memory: "8Gi"          readinessProbe:            exec:              command:                - /bin/sh                - -c                - pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"            initialDelaySeconds: 10            periodSeconds: 5          livenessProbe:            exec:              command:                - /bin/sh                - -c                - pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"            initialDelaySeconds: 30            periodSeconds: 10  volumeClaimTemplates:    - metadata:        name: litellm-pg-data      spec:        accessModes:          - ReadWriteOnce        storageClassName: ebs-sc        resources:          requests:            storage: 100Gi
```

PostgreSQL 17 delivers **20% faster bulk INSERT’s** (critical for batch writes), better JSONB indexing for metadata storage, and incremental VACUUM that reduces write amplification by 30%.

**RDS vs Self-Managed**: RDS db.t3.medium costs $50/month vs. $12/month self-managed on EKS. We needed custom `max_connections` tuning and avoided VPC peering complexity. Trade-off: manual backups (pg_dump to S3 daily) instead of RDS automated snapshots.

StatefulSets provide stable DNS names, ordered deployment (critical for single-replica databases), and persistent volume binding that survives pod restarts. We use a **headless service** (ClusterIP: None) for direct pod connections, eliminating load balancer overhead.

*Resource Allocation: Wide Gap Between Requests and Limits*

```
resources:  requests:    cpu: "250m"    memory: "512Mi"  limits:    cpu: "2000m"    memory: "8Gi"
```

PostgreSQL is **idle 90% of the time** due to **30-second batch writes**, justifying low requests (250m CPU fits on smaller nodes). High limits accommodate burst capacity during batch writes (1.5 CPU, 3 Gi) and daily VACUUM operations (up to 6 Gi memory). Measured performance: P50 CPU at 150m, P95 at 1,200m during writes.

Setting requests = limits would waste EKS node capacity during idle periods.

```
apiVersion: apps/v1kind: Deploymentmetadata:  name: litellm-redis  namespace: litellm-nsspec:  replicas: 1  selector:    matchLabels:      app: litellm-redis  template:    metadata:      labels:        app: litellm-redis    spec:      containers:        - name: redis          image: redis:8-alpine          imagePullPolicy: IfNotPresent          command: ["sh", "-c"]          args:            - 'exec redis-server --requirepass "$REDIS_PASSWORD" --appendonly yes'          ports:            - name: redis              containerPort: 6379          env:            - name: REDIS_PASSWORD              valueFrom:                secretKeyRef:                  name: litellm-redis-secret                  key: REDIS_PASSWORD            - name: REDISCLI_AUTH              valueFrom:                secretKeyRef:                  name: litellm-redis-secret                  key: REDIS_PASSWORD          volumeMounts:            - name: litellm-redis-pvc              mountPath: /data          resources:            requests:              cpu: "200m"              memory: "512Mi"            limits:              cpu: "1000m"              memory: "2Gi"          readinessProbe:            exec:              command: ["redis-cli", "ping"]            initialDelaySeconds: 5            periodSeconds: 10            timeoutSeconds: 3          livenessProbe:            exec:              command: ["redis-cli", "ping"]            initialDelaySeconds: 15            periodSeconds: 20            timeoutSeconds: 3      volumes:        - name: litellm-redis-pvc          persistentVolumeClaim:            claimName: litellm-redis-pvc
```

LiteLLM makes Redis essential for three critical functions:

**1. Response Caching**

Identical prompts (same model, same input) return cached responses instead of hitting the upstream LLM provider. This is crucial for:

Our 10-minute TTL achieved a **40% cache hit rate.**

**2. API Key Validation Caching**

Every LiteLLM request validates the user’s API key against PostgreSQL. Without caching, this means 50 database queries per second. Redis caches API key lookups for 10 minutes, reducing database load by **99.98%** (from 50 QPS to ~0.01 QPS).

**3. Router State Management****

LiteLLM tracks model endpoint health (which deployments are responding, which are failing) in Redis. This enables:

```
apiVersion: networking.k8s.io/v1kind: Ingressmetadata:  name: litellm-ingress  namespace: litellm-ns  annotations:    alb.ingress.kubernetes.io/scheme: internal    alb.ingress.kubernetes.io/target-type: ip    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-west-2:01234567890:certificate/a1639c-6c2c-4e60-9211-59cd0a    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80}, {"HTTPS":443}]'    alb.ingress.kubernetes.io/ssl-redirect: "443"    alb.ingress.kubernetes.io/backend-protocol: HTTP    alb.ingress.kubernetes.io/healthcheck-path: /health/readiness    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "10"    alb.ingress.kubernetes.io/healthy-threshold-count: "2"    alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"spec:  ingressClassName: alb  rules:    - host: litellm.data-lab.io      http:        paths:          - path: /            pathType: Prefix            backend:              service:                name: litellm-svc                port:                  number: 4000
```

Creates a VPC-internal ALB (no public IP). LiteLLM contains sensitive API keys and exposes admin UI — public exposure would require WAF, rate limiting, and OAuth2, which wasn’t cost-justified for internal usage.

**Target Type: IP**

```
alb.ingress.kubernetes.io/target-type: ip
```

Routes traffic directly to pod IPs, eliminating kube-proxy overhead (~5ms latency reduction) and enabling faster health checks. IP targets are free; instance targets charge per registration.

**Health Check Configuration**

```
alb.ingress.kubernetes.io/healthcheck-path: /health/readinessalb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"
```

Uses `/health/readiness` (pod can serve traffic) vs. `/health/liveliness` (pod process running). Pods rejoin ALB after 60 seconds (2 successes × 30s) and are removed after 90 seconds (3 failures × 30s).

Lesson Learned: 10-second intervals caused false negatives during brief database blips. 30-second intervals tolerate transient issues.

```
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata:  name: litellm  namespace: argocdspec:  project: litellm  source:    repoURL: https://gitlab.com/devops1952445/argocd-playground.git    targetRevision: main  # Production branch    path: src/apps/litellm  destination:    server: https://kubernetes.default.svc    namespace: litellm-ns  syncPolicy:    automated:      prune: true      selfHeal: true    syncOptions:      - CreateNamespace=true
```

Why ArgoCD for LiteLLM?

1. **Declarative State**: Git is source of truth, not `kubectl apply` commands

2. **Audit Trail**: Every change tracked via Git commits (who, what, when, why)

3. **Rollback**: `git revert` instantly rolls back deployments

4. **Multi-Environment**: Same manifests for dev/staging/prod with Kustomize overlays

5. **Self-Healing**: If someone `kubectl delete` a pod, ArgoCD recreates it within 3 minutes

Auto-Sync with Prune and Self-Heal

```
syncPolicy:automated:prune: true # Delete resources removed from GitselfHeal: true # Revert manual kubectl changes
```

**Prune**: If we delete `litellm_redis_deployment.yml` from Git, ArgoCD deletes the Redis deployment from the cluster.

**Self-Heal**: If an engineer runs `kubectl scale deployment/litellm — replicas=10`, ArgoCD reverts to `replicas: 1` within 3 minutes.

**Is This Too Aggressive?**

**Concern**: Self-heal might revert emergency hotfixes during incidents.

**Mitigation**:

1. **Emergency Override**: Pause ArgoCD sync during incidents: `kubectl patch app litellm -n argocd -p ‘{“spec”:{“syncPolicy”:{“automated”:null}}}’`

2. **Sync Wave**: Critical resources (database, Redis) have `argocd.argoproj.io/sync-wave: “0”`, LiteLLM pods have wave `”1"`. Database won’t self-heal during LiteLLM rollouts.

3. **Notification**: ArgoCD webhooks to Slack alert on any manual changes, prompting engineers to commit fixes to Git.

**Real-World Incident**: An engineer increased PostgreSQL memory limits to debug OOM. Self-heal reverted the change, OOM recurred. We now:

Deploying a production-grade LiteLLM proxy on AWS EKS requires more than just running a container. It demands thoughtful decisions about:

Our deployment serves 500–1000 requests/second across 5 HPA replicas, with p95 latency under 150ms and 99.9% uptime over 1months. The architecture scales to 200 rps without infrastructure changes.

**Key Takeaways**:

1. **Test HPA behavior under load** before production

2. **Enable Prisma migrations** to avoid schema mismatch errors

3. **Monitor database connections** to prevent exhaustion

4. **Use structured JSON logs** for easier debugging

5. **Cache aggressively** if workloads are repetitive

If you’re deploying LiteLLM in your organization, start with this architecture as a foundation. Adjust resource limits based on your load testing, and remember: **GitOps with ArgoCD is your safety net** — every change is reversible, audited, and reproducible.

You can also get the complete source code from my public gitlab project,https://gitlab.com/KnowledgeSharing1/lite-llm-hpa.git

[Production LiteLLM on AWS EKS: High Availability with GitOps](https://pub.towardsai.net/production-litellm-on-aws-eks-high-availability-with-gitops-996eda87e14d) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.