Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle ZeroGate, an open-source event-driven cross-cloud GPU orchestration fabric, eliminates idle hardware costs in multi-tenant inference pipelines by scaling dedicated infrastructure pools to zero when demand ceases. The tool features automated scale-to-zero daemons, dynamic market arbitrage, and a mock mode for local testing without GPU dependencies. ZeroGate is an open-source, event-driven cross-cloud GPU orchestration fabric. It eliminates unmanaged hardware idle costs in multi-tenant inference vLLM pipelines. You no longer have to suffer brutal 5-minute bare-metal cold starts. Sitting directly between your application gateway and underlying hardware providers, ZeroGate implements a reactive architecture. It securely scales dedicated infrastructure pools to absolute zero the moment tenant demand dries up. Automated Scale-to-Zero Daemon : Continuously evaluates distributed tenant idle-tick registries via background event loops, executing immediate infrastructure erasure to flatten compute bills. Thread-Safe Concurrency Lock Guard : Implements non-blocking distributed lock coordination over incoming telemetry surges via Redis, cleanly parking requests while underlying hardware scales up. Dynamic Market Arbitrage : Gracefully intercepts provider spot instance exhaustion events, automatically falling back across priority lanes to standard bare-metal configurations without breaking runtime inference streams. Immutable Relational Billing Ledger : Features an integrated, real-time relational logging pipeline to calculate token-level utilization metrics and track infrastructure cost savings. Evaluate ZeroGate's queuing, state boundaries, and automated scaling primitives entirely on your local machine. By default, the engine boots with an isolated Mock Mode turned on ZEROGATE MOCK=True . This allows you to stress-test the complete orchestration fabric on any hardware including Apple Silicon or non-GPU laptops with zero infrastructure costs, zero provider accounts, and zero local CUDA/NVIDIA driver dependencies . git clone https://github.com/noah-garner/zerogate cd zerogate Copy the pre-santized environment template. The default settings are pre-configured to launch the engine in an offline mock layer cleanly: cp .env.example .env Launch the ultra-lightweight Alpine service container stack API Gateway, Kafka event broker streams, Redis state cache, and PostgreSQL billing database : docker compose up --build -d Verify all services are up and healthy by running docker compose ps Fire an automated batch of concurrent prompt streams directly inside the private container network mesh: docker compose run --build --rm simulator While the simulator container floods the network, watch the event loops handle the infrastructure expansion, rate-limiting, and scale-down lifecycle phases in real time: Watch the gateway ingest prompts and handle backpressure limits docker compose logs -f gateway Watch the worker lock states, mock compute boot-ups, and SQL ledger commits docker compose logs -f worker To watch ZeroGate handle live cluster expansion and scale-to-zero loops entirely inside the local deployment, you need to overwhelm the default baseline thread pools. Instead of editing source code, you can trigger this directly via your environment configurations: - Open your local .env file and increase your workload density to breach your burst threshold limit of 15: SIM TOTAL REQUESTS=20 SIM BATCH SIZE=20 - Trigger the automated over-capacity surge container: docker compose run --build --rm simulator - Open your second terminal tab and monitor your background worker daemon docker compose logs -f worker . You will watch the engine detect the Global Pipeline Load: 20 , spin up the simulation cluster drivers, and execute the infrastructure cleanup erasure loop exactly 10 seconds after the batch clears System Resilience Note: If you fire a secondary workload surge while a scale-to-zero teardown loop is actively running, the engine will instantly intercept the new traffic metrics, cancel the erasure cycle, and spin up fresh compute pools to process the payload without dropping a single packet. ZeroGate provides high-performance, non-blocking telemetry and state queries over your active workspaces and transaction queues. Query the real-time lifecycle phase of a specific inference job cached across your distributed state layer. - Path : Get /v1/status/{request id} - Authentication : None Designed for safe, frictionless frontend/client-side polling without leaking master admin tokens . - Verification Command : curl -X GET http://localhost:8000/v1/status/