{"slug": "beyond-the-hype-my-production-playbook-for-docker-swarm", "title": "Beyond the Hype: My Production Playbook for Docker Swarm", "summary": "The article argues that Docker Swarm remains a viable and pragmatic choice for production container orchestration, particularly for teams prioritizing simplicity and low operational complexity over the extensive features of Kubernetes. It presents Swarm as a practical solution for running backend services on small to medium clusters, emphasizing that the best infrastructure is one a team can confidently operate under pressure. The author provides a senior engineering perspective on Swarm's architecture, security, and deployment, comparing its trade-offs directly with Kubernetes.", "body_md": "Every time container orchestration comes up, the conversation almost immediately turns into Kubernetes.\n\nAnd I understand why.\n\nKubernetes is powerful. It has a huge ecosystem, strong abstractions, custom resources, operators, service meshes, admission controllers, and almost unlimited extensibility. For large organizations running complex multi-tenant platforms, Kubernetes often makes sense.\n\nBut after years of working with Linux infrastructure, backend systems, private and public cloud environments, CI/CD pipelines, monitoring stacks, and production deployments, I learned something that is easy to forget:\n\nThe best infrastructure is not always the most powerful one. It is the one your team can operate safely under pressure.\n\nThis is where Docker Swarm still deserves respect.\n\nI do not see Docker Swarm as a toy, a legacy fallback, or something only suitable for small demos. I see it as a pragmatic orchestration layer for teams that want production-grade container deployment without turning the orchestrator itself into the main project.\n\nIn this article, I want to share how I think about Docker Swarm from a senior engineering perspective: not as a beginner tutorial, but as a practical playbook for architecture, security, deployment, monitoring, and real production trade-offs.\n\nNo `hello-world`\n\nexamples.\n\nNo hype.\n\nJust the things that matter when systems are running at 3 AM and someone has to debug them.\n\n## Why I Still Take Docker Swarm Seriously\n\nThe biggest advantage of Docker Swarm is not that it has more features than Kubernetes.\n\nIt does not.\n\nThe advantage is that it gives you enough orchestration with much less operational complexity.\n\nIf your team already understands Docker and `docker-compose`\n\n, Swarm feels natural. You can move from a single-machine Compose setup to a multi-node cluster without completely changing the mental model.\n\nThat matters.\n\nIn production, cognitive load is a real cost. Every abstraction you introduce must be learned, documented, monitored, upgraded, secured, and debugged. A more powerful platform can easily become a liability if the team cannot operate it confidently.\n\nFor many real-world backend systems, the requirements are very clear:\n\n- run multiple replicas of an API\n- deploy without downtime\n- rollback automatically when something fails\n- isolate internal services from public traffic\n- keep secrets out of images and environment files\n- monitor node and container health\n- scale horizontally when needed\n- keep the architecture understandable\n\nDocker Swarm can handle these requirements well.\n\nThe important point is this: Swarm is not a replacement for Kubernetes in every scenario. But it can be the better engineering choice when simplicity, speed, and operational clarity matter more than infinite extensibility.\n\n## Docker Swarm vs Kubernetes: My Practical Comparison\n\nI do not like religious technology debates. Most tools are good or bad depending on the context.\n\nSo instead of asking, “Which one is better?”, I prefer to ask:\n\nWhat operational cost am I accepting, and what business or technical capability am I getting in return?\n\nHere is how I usually compare Docker Swarm and Kubernetes.\n\n| Area | Docker Swarm | Kubernetes |\n|---|---|---|\n| Operational complexity | Low | High |\n| Learning curve | Friendly if you know Docker Compose | Steep, with many new abstractions |\n| Control plane | Built into Docker Engine | Multiple components and etcd |\n| Service discovery | Built-in DNS and VIP | Built-in, but with more moving parts |\n| Networking | Overlay networks and routing mesh | CNI-based networking |\n| Extensibility | Limited | Very high |\n| Ecosystem | Smaller | Massive |\n| Best fit | Simple to medium production systems | Large platforms and complex orchestration needs |\n\nFor example, if you need custom operators, advanced autoscaling, complex RBAC policies, admission controllers, multi-tenant platform engineering, or service mesh integration, Kubernetes is usually the stronger choice.\n\nBut if your goal is to run backend services, workers, reverse proxies, queues, internal APIs, and scheduled workloads across a small or medium cluster, Swarm may give you a cleaner path with fewer operational surprises.\n\nIn my experience, many teams do not fail because their orchestrator lacks features. They fail because their infrastructure becomes too complex for the team operating it.\n\n## The Mental Model: Managers, Workers, Services, and Tasks\n\nBefore talking about production architecture, it is important to understand the Swarm model.\n\nA Swarm cluster has two main node roles:\n\n-\n**Manager nodes** maintain cluster state and make scheduling decisions. -\n**Worker nodes** run the actual containers.\n\nIn Swarm, you do not usually think in terms of individual containers. You think in terms of **services**.\n\nA service defines the desired state:\n\n- which image to run\n- how many replicas should exist\n- which networks it should join\n- which secrets it needs\n- which constraints control placement\n- how updates and rollbacks should happen\n\nSwarm then creates **tasks** to satisfy that desired state. A task is basically one running instance of a service.\n\nThis desired-state model is one of the most important ideas in orchestration. You are no longer manually saying, “Run this container here.” You are saying, “I want five replicas of this service, and I want them to follow these rules.”\n\nThe orchestrator continuously tries to make reality match that desired state.\n\nThat sounds simple, but it changes how you design deployments.\n\n## Raft, Quorum, and Why Manager Count Matters\n\nManager nodes in Docker Swarm use the Raft consensus algorithm to maintain cluster state.\n\nThis is one of the most important production details.\n\nRaft needs quorum. In simple terms, the cluster must have a majority of managers available to make decisions.\n\nThe formula is:\n\n```\nquorum = (number_of_managers / 2) + 1\n```\n\nThis leads to one very practical rule:\n\nDo not run an even number of manager nodes.\n\nIf you run two managers and lose one, the cluster cannot maintain quorum. If a network partition happens, you can end up with an unavailable control plane.\n\nFor most production Swarm clusters, I prefer:\n\n-\n**1 manager** for small non-critical environments -\n**3 managers** for serious production -\n**5 managers** for larger or more resilient setups\n\nI rarely see a good reason to go beyond five managers. More managers increase coordination overhead and do not automatically make your system better.\n\nAlso, managers should be treated carefully. They are not just “normal servers.” They hold the cluster state. If they are overloaded, unstable, or poorly secured, the whole cluster becomes fragile.\n\n## My Rule: Managers Manage, Workers Work\n\nOne of the first production mistakes I try to avoid is running application workloads on manager nodes.\n\nYes, Docker Swarm technically allows it.\n\nBut in a serious environment, I prefer to drain manager nodes:\n\n```\ndocker node update --availability drain <manager-node-name>\n```\n\nThis prevents normal application tasks from being scheduled on that manager.\n\nWhy?\n\nBecause manager nodes are responsible for orchestration, scheduling, cluster state, and Raft communication. If your application has a traffic spike, a memory leak, noisy logs, or heavy CPU usage, you do not want that to compete with the control plane.\n\nA clean separation is easier to reason about:\n\n- managers handle cluster decisions\n- workers run application workloads\n- edge nodes handle public traffic when needed\n- monitoring nodes handle metrics and dashboards when possible\n\nThis separation may look boring, but boring infrastructure is usually what you want in production.\n\n## Networking: The Part You Must Understand Before Production\n\nDocker Swarm networking is powerful, but you need to understand what it is doing.\n\nSwarm gives you overlay networks. Services attached to the same overlay network can communicate with each other by service name.\n\nFor example, your API can connect to Redis using:\n\n```\nredis:6379\n```\n\ninstead of hardcoding an IP address.\n\nThat is useful because tasks can move between nodes, containers can restart, and IP addresses can change. Service discovery hides that complexity.\n\nBut there are two important networking concepts you must understand: **VIP mode** and **DNSRR mode**.\n\n## VIP Mode and the Routing Mesh\n\nBy default, Swarm gives a service a Virtual IP, or VIP.\n\nWhen another service connects to that VIP, Docker uses internal load balancing to route traffic to one of the active tasks.\n\nThis is convenient. It gives you service-level load balancing without installing an external load balancer for internal traffic.\n\nSwarm also has the routing mesh. When you publish a port, traffic can arrive on any node in the cluster and still be routed to a container running somewhere else.\n\nFor example:\n\n```\nports:\n  - target: 8080\n    published: 8080\n    protocol: tcp\n```\n\nWith the routing mesh, a request to port `8080`\n\non any node can reach the service.\n\nThis is useful, but it also has trade-offs.\n\nThe biggest one: preserving the real client IP can become difficult. If you need accurate client IPs for rate limiting, security logs, fraud checks, or WAF rules, you need to be careful.\n\nIn those cases, I often prefer publishing ports in host mode at the edge:\n\n```\nports:\n  - target: 443\n    published: 443\n    protocol: tcp\n    mode: host\n```\n\nThis bypasses the routing mesh for that published port and gives you more predictable network behavior at the cost of less automatic distribution.\n\nIn production, I usually prefer a clear edge layer using Nginx, Traefik, or HAProxy instead of exposing many services directly through the routing mesh.\n\n## DNSRR Mode: When You Want More Control\n\nVIP mode is simple, but sometimes you want the client or an upstream proxy to see all task IPs directly.\n\nThat is where DNS round-robin mode can help.\n\nYou can configure endpoint mode like this:\n\n```\ndeploy:\n  endpoint_mode: dnsrr\n```\n\nIn this mode, DNS returns multiple task IPs instead of a single VIP.\n\nThis can be useful when you have an external load balancer, a custom proxy layer, or a system that needs direct awareness of backend instances.\n\nBut I would not use DNSRR everywhere by default. VIP mode is usually simpler. DNSRR is useful when you have a specific reason for it.\n\nThat is a general production rule I follow:\n\nDo not choose a more advanced mode just because it exists. Choose it when it solves a real operational problem.\n\n## Security: Swarm Gives You Good Primitives, But You Must Use Them Correctly\n\nSecurity in container orchestration is not one feature. It is a posture.\n\nDocker Swarm gives you useful security primitives:\n\n- mutual TLS between nodes\n- encrypted manager communication\n- overlay networks\n- secrets\n- service constraints\n- least-exposure networking\n\nBut these primitives only help if the architecture uses them properly.\n\nA weak Swarm setup usually has the same problems I see in many weak Docker setups:\n\n- public services attached to internal networks unnecessarily\n- secrets stored in\n`.env`\n\nfiles - manager nodes running random workloads\n- no resource limits\n- no clear firewall rules\n- too many published ports\n- no monitoring for abnormal behavior\n- no separation between frontend and backend networks\n\nThe goal is not to make containers magically secure. The goal is to reduce blast radius.\n\n## Use Docker Secrets Instead of Environment Variables\n\nOne of the easiest security improvements is to stop passing sensitive values through environment variables.\n\nEnvironment variables are convenient, but they are not ideal for secrets. They can appear in process inspection, debugging output, crash reports, or accidental logs. They can also be exposed through careless use of `docker inspect`\n\nor application error reporting.\n\nDocker Secrets are a better default.\n\nA secret is mounted inside the container as a file, usually under:\n\n```\n/run/secrets/<secret_name>\n```\n\nFor example:\n\n```\nservices:\n  api:\n    image: my-registry.example.com/backend-api:1.0.0\n    secrets:\n      - db_password\n    environment:\n      - DB_PASSWORD_FILE=/run/secrets/db_password\n\nsecrets:\n  db_password:\n    external: true\n```\n\nThen your application reads the password from the file path instead of directly from an environment variable.\n\nThis pattern works especially well for Go, Node.js, Python, and PHP services because it is easy to implement a small helper that reads from `*_FILE`\n\nvariables.\n\nA simple Node.js example:\n\n``` python\nimport fs from \"fs\";\n\nfunction readSecret(path) {\n  return fs.readFileSync(path, \"utf8\").trim();\n}\n\nconst dbPassword = readSecret(process.env.DB_PASSWORD_FILE);\n```\n\nThis is not perfect security, but it is a much cleaner baseline.\n\n## Encrypt Overlay Networks When Needed\n\nBy default, Docker Swarm encrypts control-plane communication between nodes.\n\nBut application traffic on overlay networks is not automatically encrypted in every case. If your services communicate across untrusted networks, multiple data centers, or environments where internal traffic should be treated as sensitive, you should consider encrypted overlay networks.\n\nExample:\n\n```\nnetworks:\n  backend_net:\n    driver: overlay\n    driver_opts:\n      encrypted: \"true\"\n```\n\nThis enables IPsec encryption for traffic on that overlay network.\n\nThere is a performance cost, so I do not enable it blindly for every network in every environment. But for sensitive backend communication, I prefer the overhead over silently passing internal traffic in plain form.\n\nThe senior-level decision is not “encrypt everything always.”\n\nThe decision is:\n\nWhich network paths carry sensitive data, and what is the cost of exposure compared to the cost of encryption?\n\n## Network Segmentation: Do Not Put Everything on One Overlay Network\n\nA common mistake is creating one large overlay network and attaching every service to it.\n\nThat is easy, but it creates unnecessary reachability.\n\nI prefer separating networks by trust boundary.\n\nExample:\n\n```\nnetworks:\n  edge_net:\n    driver: overlay\n\n  app_net:\n    driver: overlay\n\n  data_net:\n    driver: overlay\n    driver_opts:\n      encrypted: \"true\"\n```\n\nThen services are attached only where needed:\n\n- reverse proxy joins\n`edge_net`\n\nand`app_net`\n\n- API joins\n`app_net`\n\nand`data_net`\n\n- database joins only\n`data_net`\n\n- monitoring joins only monitoring-related networks\n\nThis is simple, but it makes lateral movement harder and reduces accidental exposure.\n\nIn production, “can this service reach that service?” should have a deliberate answer.\n\n## Resource Limits Are Not Optional\n\nContainers without resource limits are a production risk.\n\nA memory leak, CPU-heavy request, bad loop, or unexpected traffic pattern can damage the entire node.\n\nIn Swarm stacks, I like to define both reservations and limits:\n\n```\ndeploy:\n  resources:\n    reservations:\n      cpus: \"0.25\"\n      memory: 256M\n    limits:\n      cpus: \"1.0\"\n      memory: 768M\n```\n\nReservations help the scheduler make better placement decisions.\n\nLimits protect the node from a noisy container.\n\nYou need to tune these numbers based on real metrics, not guesses. But having no limits at all is usually worse than starting with conservative values and adjusting later.\n\n## Placement Constraints: Treat Nodes as Different Classes\n\nNot every node should run every workload.\n\nIn real infrastructure, nodes often have different purposes:\n\n- public edge nodes\n- backend worker nodes\n- storage-heavy nodes\n- GPU nodes\n- monitoring nodes\n- nodes in different availability zones\n\nDocker Swarm supports node labels and placement constraints.\n\nExample:\n\n```\ndocker node update --label-add zone=dmz worker-01\ndocker node update --label-add workload=backend worker-02\ndocker node update --label-add storage=fast worker-03\n```\n\nThen in your stack:\n\n```\ndeploy:\n  placement:\n    constraints:\n      - node.role == worker\n      - node.labels.workload == backend\n```\n\nThis gives you a clean way to express infrastructure intent.\n\nFor example, I may want:\n\n- Nginx or Traefik only on edge nodes\n- APIs only on backend workers\n- databases only on storage nodes\n- internal tools never on public-facing nodes\n\nThis is one of those features that looks small but becomes very important as the cluster grows.\n\n## Zero-Downtime Deployment Strategy\n\nA production deployment should not be “stop old containers, start new containers, hope everything works.”\n\nSwarm gives you rolling update controls.\n\nA good baseline looks like this:\n\n```\ndeploy:\n  update_config:\n    parallelism: 1\n    delay: 10s\n    order: start-first\n    failure_action: rollback\n  rollback_config:\n    parallelism: 1\n    delay: 5s\n    order: stop-first\n```\n\nThe most important setting here is often:\n\n```\norder: start-first\n```\n\nThis tells Swarm to start the new task before stopping the old one. For APIs and web services, this helps reduce downtime during deployments.\n\nBut this only works well if your application is ready for it.\n\nThat means:\n\n- the app must start reliably\n- health checks must be meaningful\n- migrations must be backward-compatible\n- old and new versions may run at the same time briefly\n- the service must handle graceful shutdown\n\nThe orchestrator can help with deployment, but it cannot fix bad application lifecycle design.\n\n## Health Checks: Do Not Fake Them\n\nA weak health check is worse than no health check because it creates false confidence.\n\nI do not like health checks that only confirm the process is alive.\n\nFor example, this is weak:\n\n```\nGET /health\n```\n\nreturning `200 OK`\n\nno matter what.\n\nA better health check should answer:\n\nCan this instance actually serve traffic?\n\nDepending on the service, that may include checking:\n\n- HTTP server readiness\n- database connection\n- Redis connection\n- required config loaded\n- dependency availability\n- migration state\n\nBut be careful: a health check should not become too expensive. If every health check runs a heavy database query every few seconds, the health check becomes part of the problem.\n\nA practical pattern is separating liveness and readiness:\n\n-\n**liveness**: is the process alive? -\n**readiness**: is the service ready to receive traffic?\n\nIn many Swarm setups, you implement this at the application and reverse-proxy layer, even if Swarm's health check support is simpler than Kubernetes.\n\nExample:\n\n```\nhealthcheck:\n  test: [\"CMD\", \"wget\", \"-qO-\", \"http://localhost:8080/health\"]\n  interval: 10s\n  timeout: 3s\n  retries: 3\n  start_period: 20s\n```\n\nThe real work is not adding this YAML. The real work is making `/health`\n\nmeaningful.\n\n## A Production-Ready Stack Example\n\nHere is a practical Swarm stack example with edge routing, backend isolation, secrets, placement constraints, update strategy, and resource controls.\n\n```\nversion: \"3.8\"\n\nservices:\n  reverse_proxy:\n    image: nginx:1.25-alpine\n    ports:\n      - target: 443\n        published: 443\n        protocol: tcp\n        mode: host\n    networks:\n      - edge_net\n      - app_net\n    configs:\n      - source: nginx_conf\n        target: /etc/nginx/nginx.conf\n    deploy:\n      mode: global\n      placement:\n        constraints:\n          - node.role == worker\n          - node.labels.zone == dmz\n      resources:\n        reservations:\n          cpus: \"0.10\"\n          memory: 128M\n        limits:\n          cpus: \"0.50\"\n          memory: 256M\n      update_config:\n        parallelism: 1\n        delay: 5s\n        order: start-first\n        failure_action: rollback\n\n  api:\n    image: registry.example.com/company/api:v2.4.1\n    networks:\n      - app_net\n      - data_net\n    secrets:\n      - db_password\n      - jwt_private_key\n    environment:\n      - NODE_ENV=production\n      - PORT=8080\n      - DB_HOST=postgres\n      - DB_PASSWORD_FILE=/run/secrets/db_password\n      - JWT_PRIVATE_KEY_FILE=/run/secrets/jwt_private_key\n    healthcheck:\n      test: [\"CMD\", \"wget\", \"-qO-\", \"http://localhost:8080/health\"]\n      interval: 10s\n      timeout: 3s\n      retries: 3\n      start_period: 20s\n    deploy:\n      replicas: 6\n      placement:\n        constraints:\n          - node.role == worker\n          - node.labels.workload == backend\n      resources:\n        reservations:\n          cpus: \"0.25\"\n          memory: 256M\n        limits:\n          cpus: \"1.00\"\n          memory: 768M\n      update_config:\n        parallelism: 2\n        delay: 10s\n        order: start-first\n        failure_action: rollback\n      rollback_config:\n        parallelism: 1\n        delay: 5s\n\n  worker:\n    image: registry.example.com/company/worker:v2.4.1\n    networks:\n      - data_net\n    secrets:\n      - db_password\n    environment:\n      - QUEUE_NAME=default\n      - DB_PASSWORD_FILE=/run/secrets/db_password\n    deploy:\n      replicas: 3\n      placement:\n        constraints:\n          - node.role == worker\n          - node.labels.workload == backend\n      resources:\n        reservations:\n          cpus: \"0.25\"\n          memory: 256M\n        limits:\n          cpus: \"1.50\"\n          memory: 1024M\n      update_config:\n        parallelism: 1\n        delay: 15s\n        order: stop-first\n        failure_action: rollback\n\nnetworks:\n  edge_net:\n    driver: overlay\n\n  app_net:\n    driver: overlay\n\n  data_net:\n    driver: overlay\n    driver_opts:\n      encrypted: \"true\"\n\nsecrets:\n  db_password:\n    external: true\n  jwt_private_key:\n    external: true\n\nconfigs:\n  nginx_conf:\n    external: true\n```\n\nThis is not a copy-paste solution for every company. But it shows the mindset:\n\n- edge traffic is separated\n- backend traffic is internal\n- sensitive network traffic is encrypted\n- secrets are mounted as files\n- managers are avoided for workloads\n- resources are controlled\n- deployments have rollback behavior\n- public ports are minimized\n\nThat is the difference between “it runs” and “I can operate this under stress.”\n\n## Observability: What I Monitor in Swarm\n\nYou cannot run production infrastructure with hope as your monitoring strategy.\n\nFor Docker Swarm, I usually think about observability at four levels:\n\n**Node-level metrics****Container-level metrics****Service-level behavior****Application-level signals**\n\nA common stack is:\n\n- Prometheus\n- Grafana\n- Node Exporter\n- cAdvisor\n- Loki or another log aggregation system\n- Alertmanager\n\nNode Exporter gives host metrics:\n\n- CPU\n- memory\n- disk usage\n- disk I/O\n- filesystem pressure\n- network traffic\n\ncAdvisor gives container metrics:\n\n- container CPU usage\n- memory usage\n- restart patterns\n- throttling\n- network traffic per container\n\nPrometheus scrapes metrics, Grafana visualizes them, and Alertmanager handles notifications.\n\nA minimal global cAdvisor service can look like this:\n\n```\nservices:\n  cadvisor:\n    image: gcr.io/cadvisor/cadvisor:v0.49.1\n    volumes:\n      - /:/rootfs:ro\n      - /var/run:/var/run:rw\n      - /sys:/sys:ro\n      - /var/lib/docker/:/var/lib/docker:ro\n    networks:\n      - monitoring\n    deploy:\n      mode: global\n      resources:\n        reservations:\n          memory: 64M\n        limits:\n          memory: 256M\n```\n\nAnd Node Exporter:\n\n```\nservices:\n  node_exporter:\n    image: prom/node-exporter:v1.7.0\n    command:\n      - \"--path.rootfs=/host\"\n    volumes:\n      - \"/:/host:ro,rslave\"\n    networks:\n      - monitoring\n    deploy:\n      mode: global\n```\n\nFor service discovery inside Swarm, I like using the `tasks.<service_name>`\n\nDNS pattern.\n\nFor example, Prometheus can resolve:\n\n```\ntasks.cadvisor\n```\n\nand discover all running cAdvisor tasks.\n\nThis is simple and works well without introducing another service discovery system.\n\n## The Alerts I Actually Care About\n\nDashboards are useful, but alerts are what wake people up.\n\nI try to avoid noisy alerts. A noisy alerting system becomes background noise and people stop trusting it.\n\nFor Swarm, I care about alerts like:\n\n- manager node down\n- quorum risk\n- worker node down\n- high memory pressure\n- disk usage above threshold\n- container restart loop\n- service replica count below desired state\n- high 5xx rate from the reverse proxy\n- high latency on critical APIs\n- certificate expiration approaching\n- unusual increase in blocked or suspicious requests\n\nThe important thing is to connect infrastructure signals with application signals.\n\nFor example, high CPU alone may not be urgent. But high CPU plus increased latency plus replica restarts is a real incident.\n\nThis is where senior engineering judgment matters. Monitoring is not about collecting everything. It is about knowing which signals explain user impact.\n\n## Logging: Centralize It Early\n\nOn a single server, `docker logs`\n\nis fine.\n\nIn a cluster, it is not enough.\n\nOnce a service has replicas across several nodes, manual log inspection becomes painful. You need centralized logs.\n\nI prefer to standardize logs as structured JSON where possible.\n\nA good application log should include:\n\n- timestamp\n- request ID or correlation ID\n- service name\n- environment\n- log level\n- route or operation name\n- latency\n- user or tenant ID when appropriate\n- error code\n- upstream dependency name\n\nFor the infrastructure side, reverse proxy logs are extremely valuable. They show:\n\n- real client IP\n- request path\n- status code\n- upstream response time\n- user agent\n- TLS details\n- blocked paths\n- suspicious probing attempts\n\nThis is especially important for public APIs because a lot of malicious traffic starts with boring-looking HTTP requests.\n\n## Backup and Disaster Recovery: Do Not Ignore the Managers\n\nPeople often think about database backups but forget about orchestrator state.\n\nIn Docker Swarm, manager state matters.\n\nYou should have a recovery plan for:\n\n- losing a worker node\n- losing a manager node\n- losing quorum\n- rotating join tokens\n- restoring service definitions\n- restoring secrets/configs\n- rebuilding the cluster from infrastructure code\n\nI do not like relying only on manual server state. I prefer keeping stack files, Nginx configs, Prometheus configs, deployment scripts, and infrastructure notes in Git.\n\nThat way, if the cluster has to be rebuilt, the knowledge is not trapped inside one engineer's terminal history.\n\nFor manager backups, you need to understand your Docker version and operational procedure carefully. The key point is not to improvise disaster recovery during the disaster.\n\nTest the recovery path before you need it.\n\n## CI/CD: Keep Deployments Predictable\n\nA simple Swarm deployment pipeline can be very effective.\n\nA common flow:\n\n- Build the Docker image.\n- Tag it with a commit SHA or version.\n- Push it to a registry.\n- SSH into a manager or use a controlled deployment runner.\n- Run\n`docker stack deploy`\n\nwith the updated image. - Verify service state.\n- Check health and logs.\n- Roll back if needed.\n\nExample:\n\n```\ndocker stack deploy \\\n  --with-registry-auth \\\n  -c docker-compose.prod.yml \\\n  my_app\n```\n\nI strongly prefer immutable image tags for production deployments.\n\nAvoid this:\n\n```\nmy-api:latest\n```\n\nPrefer this:\n\n```\nmy-api:2026-05-23-a1b2c3d\n```\n\nor:\n\n```\nmy-api:git-sha-a1b2c3d\n```\n\nWhen something breaks, you need to know exactly what is running.\n\n`latest`\n\nis convenient until you are debugging an incident and cannot prove which code is actually deployed.\n\n## Database Migrations: The Dangerous Part of Zero-Downtime Deployments\n\nOrchestrators make it easy to roll out new containers.\n\nThey do not automatically make database changes safe.\n\nThis is where many “zero-downtime” strategies fail.\n\nThe safe pattern is usually expand-and-contract:\n\n- Add new schema changes in a backward-compatible way.\n- Deploy application code that can work with both old and new schema.\n- Backfill data if needed.\n- Switch reads/writes to the new structure.\n- Remove old columns or tables later.\n\nAvoid deployments where the new application requires a schema change that immediately breaks the old version.\n\nDuring rolling updates, old and new containers may run at the same time. Your database design must tolerate that.\n\nThis is not a Docker Swarm problem. This is a distributed systems deployment problem.\n\n## When I Would Not Use Docker Swarm\n\nI like Docker Swarm, but I would not use it everywhere.\n\nI would probably choose Kubernetes if I needed:\n\n- custom operators\n- advanced autoscaling patterns\n- strong multi-tenancy\n- complex RBAC and policy enforcement\n- service mesh as a core requirement\n- a large platform team\n- advanced ecosystem integrations\n- cloud-native managed control plane support\n\nI would also avoid Swarm if the organization already has a mature Kubernetes platform and the team knows how to operate it well.\n\nTechnology choice should respect organizational reality.\n\nA simple tool in the wrong organization can still fail. A complex tool in a mature organization can work beautifully.\n\n## When I Would Choose Docker Swarm\n\nI would choose Docker Swarm when the system needs:\n\n- simple multi-node orchestration\n- predictable deployments\n- low operational overhead\n- easy onboarding for Docker-experienced engineers\n- internal APIs and workers\n- small to medium clusters\n- strong enough orchestration without platform engineering complexity\n- fast recovery and simple mental models\n\nFor many backend teams, this is enough.\n\nAnd “enough” is underrated.\n\nIn engineering, the goal is not to maximize architecture complexity. The goal is to deliver reliable systems that the team can understand, secure, monitor, and improve.\n\n## My Final Production Checklist\n\nIf I were reviewing a Docker Swarm setup before production, I would ask these questions:\n\n- Do we have 3 or 5 managers for production?\n- Are manager nodes drained from application workloads?\n- Are public ports minimized?\n- Are edge, app, data, and monitoring networks separated?\n- Are sensitive overlay networks encrypted where needed?\n- Are secrets stored with Docker Secrets instead of\n`.env`\n\nfiles? - Do services have CPU and memory limits?\n- Do critical services have health checks?\n- Do deployments use rolling updates and rollback behavior?\n- Are image tags immutable?\n- Are logs centralized?\n- Are metrics collected from nodes and containers?\n- Are alerts meaningful and not noisy?\n- Is there a documented recovery plan?\n- Can a new engineer understand the architecture without reading 50 pages of hidden tribal knowledge?\n\nThat last question matters more than people think.\n\nIf the infrastructure only works because one person remembers every command, it is not production-ready. It is a personal project running on production servers.\n\n## Final Thoughts\n\nDocker Swarm is not the loudest technology in the room anymore.\n\nBut that does not make it useless.\n\nFor the right team and the right workload, it is still a clean, stable, and practical way to run containers in production. It gives you orchestration, service discovery, rolling updates, secrets, overlay networks, and scheduling without forcing you to adopt the full complexity of Kubernetes.\n\nMy general philosophy is simple:\n\nKeep infrastructure as simple as possible, but never simpler than production requires.\n\nDocker Swarm fits that philosophy well.\n\nIt is not about avoiding Kubernetes because Kubernetes is bad. It is about choosing the smallest reliable system that can do the job.\n\nAnd in many real production environments, that is exactly the kind of decision that keeps systems stable, teams productive, and pagers quiet.", "url": "https://wpnews.pro/news/beyond-the-hype-my-production-playbook-for-docker-swarm", "canonical_source": "https://dev.to/amirsefati/beyond-the-hype-my-production-playbook-for-docker-swarm-54d8", "published_at": "2026-05-23 14:22:13+00:00", "updated_at": "2026-05-23 14:31:29.136233+00:00", "lang": "en", "topics": ["cloud-computing", "developer-tools", "open-source", "enterprise-software"], "entities": ["Docker Swarm", "Kubernetes"], "alternates": {"html": "https://wpnews.pro/news/beyond-the-hype-my-production-playbook-for-docker-swarm", "markdown": "https://wpnews.pro/news/beyond-the-hype-my-production-playbook-for-docker-swarm.md", "text": "https://wpnews.pro/news/beyond-the-hype-my-production-playbook-for-docker-swarm.txt", "jsonld": "https://wpnews.pro/news/beyond-the-hype-my-production-playbook-for-docker-swarm.jsonld"}}