{"slug": "show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle", "title": "Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle", "summary": "ZeroGate, an open-source event-driven cross-cloud GPU orchestration fabric, eliminates idle hardware costs in multi-tenant inference pipelines by scaling dedicated infrastructure pools to zero when demand ceases. The tool features automated scale-to-zero daemons, dynamic market arbitrage, and a mock mode for local testing without GPU dependencies.", "body_md": "ZeroGate is an open-source, event-driven cross-cloud GPU orchestration fabric. It eliminates unmanaged hardware idle costs in multi-tenant inference (vLLM) pipelines. You no longer have to suffer brutal 5-minute bare-metal cold starts.\n\nSitting directly between your application gateway and underlying hardware providers, ZeroGate implements a reactive architecture. It securely scales dedicated infrastructure pools to absolute zero the moment tenant demand dries up.\n\n**Automated Scale-to-Zero Daemon**: Continuously evaluates distributed tenant idle-tick registries via background event loops, executing immediate infrastructure erasure to flatten compute bills.**Thread-Safe Concurrency Lock Guard**: Implements non-blocking distributed lock coordination over incoming telemetry surges via Redis, cleanly parking requests while underlying hardware scales up.**Dynamic Market Arbitrage**: Gracefully intercepts provider spot instance exhaustion events, automatically falling back across priority lanes to standard bare-metal configurations without breaking runtime inference streams.**Immutable Relational Billing Ledger**: Features an integrated, real-time relational logging pipeline to calculate token-level utilization metrics and track infrastructure cost savings.\n\nEvaluate ZeroGate's queuing, state boundaries, and automated scaling primitives entirely on your local machine.\n\nBy default, the engine boots with an isolated **Mock Mode** turned on (`ZEROGATE_MOCK=True`\n\n). This allows you to stress-test the complete orchestration fabric on any hardware (including Apple Silicon or non-GPU laptops) with **zero infrastructure costs, zero provider accounts, and zero local CUDA/NVIDIA driver dependencies**.\n\n```\ngit clone https://github.com/noah-garner/zerogate\ncd zerogate\n```\n\nCopy the pre-santized environment template. The default settings are pre-configured to launch the engine in an offline mock layer cleanly:\n\n```\ncp .env.example .env\n```\n\nLaunch the ultra-lightweight Alpine service container stack (API Gateway, Kafka event broker streams, Redis state cache, and PostgreSQL billing database):\n\n```\ndocker compose up --build -d\n```\n\n*(Verify all services are up and healthy by running docker compose ps)*\n\nFire an automated batch of concurrent prompt streams directly inside the private container network mesh:\n\n```\ndocker compose run --build --rm simulator\n```\n\nWhile the simulator container floods the network, watch the event loops handle the infrastructure expansion, rate-limiting, and scale-down lifecycle phases in real time:\n\n```\n# Watch the gateway ingest prompts and handle backpressure limits\ndocker compose logs -f gateway\n\n# Watch the worker lock states, mock compute boot-ups, and SQL ledger commits\ndocker compose logs -f worker\n```\n\nTo watch ZeroGate handle live cluster expansion and scale-to-zero loops entirely inside the local deployment, you need to overwhelm the default baseline thread pools. Instead of editing source code, you can trigger this directly via your environment configurations:\n\n-\nOpen your local\n\n`.env`\n\nfile and increase your workload density to breach your burst threshold limit of 15:\n\n```\nSIM_TOTAL_REQUESTS=20\nSIM_BATCH_SIZE=20\n```\n\n-\nTrigger the automated over-capacity surge container:\n\n```\ndocker compose run --build --rm simulator\n```\n\n-\nOpen your second terminal tab and monitor your background worker daemon (\n\n`docker compose logs -f worker`\n\n). You will watch the engine detect the`Global Pipeline Load: 20`\n\n, spin up the simulation cluster drivers, and execute the infrastructure cleanup erasure loop exactly 10 seconds after the batch clears!\n\nSystem Resilience Note:\n\nIf you fire a secondary workload surge while a scale-to-zero teardown loop is actively running, the engine will instantly intercept the new traffic metrics, cancel the erasure cycle, and spin up fresh compute pools to process the payload without dropping a single packet.\n\nZeroGate provides high-performance, non-blocking telemetry and state queries over your active workspaces and transaction queues.\n\nQuery the real-time lifecycle phase of a specific inference job cached across your distributed state layer.\n\n-\n**Path**:`Get /v1/status/{request_id}`\n\n-\n**Authentication**: None (Designed for safe, frictionless frontend/client-side polling without leaking master admin tokens). -\n**Verification Command**:\n\n```\ncurl -X GET http://localhost:8000/v1/status/<request_id_from_logs>\n```\n\nPull real-time relational aggregation directly from the PostgreSQL ledger to track token velocity, queue overhead latencies, and total accumulated dollar-value savings.\n\n-\n**Path**:`GET /v1/analytics`\n\n-\n**Required Header**:`X-ZeroGate-Key: {your_workspace_token}`\n\n-\n**Verification Command**:\n\n```\ncurl -X GET http://localhost:8000/v1/analytics \\\n    -H \"X-ZeroGate-Key: zerogate-alpha-demo\"\n```\n\nSystems Engineering Note: Telemetry Aggregation\n\nThe`aggregated_idle_tax_saved_usd`\n\nmetric updatesstrictly upon the completion of an infrastructure erasure cycle. If your workload testing batch does not breach the`BURST_THRESHOLD`\n\nparameter (default: 15), the system processes your tasks entirely on the warm baseline buffer layer without provisioning extra burst nodes. Consequently, no cloud waste occurs, and the ledger will accurately report`0.0`\n\nuntil an over-capacity surge actively triggers a spin-up, idle tracking sequence, and a subsequent teardown loop.\n\n-\n**Processing / Scaling Phase**:\n\n```\n{\n    \"request_id\": \"73909cd9-3af7-4f20-a209-c5857619c680\",\n    \"status\": \"processing\",\n    \"infrastructure\": { \"cluster_slice\": \"allocated_pool\", \"vllm_state\": \"hot_path_stream_engaged\" }\n}\n```\n\n-\n**Completed Phase**:\n\n```\n{\n  \"request_id\": \"73909cd9-3af7-4f20-a209-c5857619c680\",\n  \"status\": \"completed\",\n  \"metrics\": { \"execution_duration_seconds\": 4.5, \"estimated_savings_usd\": 0.00022 },\n  \"prompt\": \"...\",\n  \"result\": \"...\",\n  \"message\": \"Inference lifecycle finished. Idle footprint flattened.\"\n}\n```\n\n-\n**Analytics Phase**:\n\n```\n{\n\"workspace_key\": \"zerogate-alpha-demo\",\n\"ledger\": {\n    \"total_inferences_processed\": 20,\n    \"total_tokens_generated\": 1460,\n    \"aggregated_idle_tax_saved_usd\": 0.00442,\n    \"average_queue_overhead_ms\": 400.1\n},\n\"status\": \"Healthy. Workspace data plane separation verified.\"\n}\n```\n\nWhen you are ready to transition past local simulation loops and orchestrate real hardware layers inside your own private cloud setups, toggle `ZEROGATE_MOCK=False`\n\ninside your local `.env`\n\n.\n\nFor high-performance, non-virtualized production workloads. To bypass heavy runtime package installation delays and secure low cold-start latency, this tier utilizes optimized hardware snapshots.\n\n-\nSave your pre-configured vLLM execution environment as a\n\n**Custom Snapshot Image** inside your Hyperstack console. -\nUpdate your credentials inside your\n\n`.env`\n\nconfiguration file:\n\n```\n# ==============================================================================\n# ZEROGATE SYSTEM CONFIGS\n# ==============================================================================\nZEROGATE_MOCK=False\nZEROGATE_API_KEY=your_zerogate_key\nZEROGATE_BASE_URL=your_zerogate_endpoint\n\n# ==============================================================================\n# HYPERSTACK CONFIGS\n# ==============================================================================\nHYPERSTACK_BASE_URL=https://nexgencloud.com\nHYPERSTACK_API_KEY=your_secret_api_key\nHYPERSTACK_MAIN_NODE_IP=your_prewarmed_hyperstack_node_ip\nHYPERSTACK_SSH_KEY_NAME=your_ssh_key_name\nHYPERSTACK_ENVIRONMENT_NAME=your_environment_name\nHYPERSTACK_REGION=your_region_name\n```\n\nOur active engineering sprint is focused on launching native RunPod container driver hooks. This will allow deploying public vLLM images straight from a standard API request with **zero custom snapshot configurations required**, slashing hypervisor cold-start latency down from 5 minutes to **<40 seconds**. Follow along with our active development tracking inside Issue #1!\n\nDeep-tech infrastructure is built iteratively. We publish our engineering milestones openly to cultivate transparent collaboration with our core alpha developer network.\n\n-\n**v0.1.0-alpha (Current)**: Full event-driven kafka consumer gateway, distributed locks, automated scale-to-zero background daemons, and local evaluation engine. -\n**v0.2.0 (Active Sprint)**: Implement fluid cross-cloud pod drivers (RunPod) to leverage container-based GPU scaling, dropping cold starts under 90 seconds (and sub-40 seconds on cached container nodes). -\n**v0.3.0 (Production Enterprise Milestone)**: Transition from a push-based proxy router to a pull-based late-binding work-stealing consumer mesh to optimize multi-node execution throughput.\n\nWe are selecting **5-10 Core Alpha Developers** building production-grade AI platforms who need to optimize infrastructure utilization, secure multi-cloud fault tolerance, and eliminate unmanaged GPU idle tax.\n\n**File an Issue**: Found an edge case in our async locking primitives? Open a detailed GitHub issue with your simulator log output.** Get Early Enterprise Access**: Reach out directly if you require custom private-cloud deployment scripts or dedicated queue isolation.\n\nLicensed under the [Apache 2.0 License](/noah-garner/zerogate/blob/main/LICENSE).", "url": "https://wpnews.pro/news/show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle", "canonical_source": "https://github.com/noah-garner/zerogate", "published_at": "2026-06-26 14:23:42+00:00", "updated_at": "2026-06-26 14:35:25.631408+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "developer-tools"], "entities": ["ZeroGate", "vLLM", "Redis", "Kafka", "PostgreSQL", "Alpine", "Docker"], "alternates": {"html": "https://wpnews.pro/news/show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle", "markdown": "https://wpnews.pro/news/show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle.md", "text": "https://wpnews.pro/news/show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle.txt", "jsonld": "https://wpnews.pro/news/show-hn-zerogate-api-gateway-to-scale-cloud-gpus-to-zero-when-idle.jsonld"}}