Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle

wpnews.pro

ZeroGate is an open-source, event-driven cross-cloud GPU orchestration fabric. It eliminates unmanaged hardware idle costs in multi-tenant inference (vLLM) pipelines. You no longer have to suffer brutal 5-minute bare-metal cold starts.

Sitting directly between your application gateway and underlying hardware providers, ZeroGate implements a reactive architecture. It securely scales dedicated infrastructure pools to absolute zero the moment tenant demand dries up.

Automated Scale-to-Zero Daemon: Continuously evaluates distributed tenant idle-tick registries via background event loops, executing immediate infrastructure erasure to flatten compute bills.Thread-Safe Concurrency Lock Guard: Implements non-blocking distributed lock coordination over incoming telemetry surges via Redis, cleanly parking requests while underlying hardware scales up.Dynamic Market Arbitrage: Gracefully intercepts provider spot instance exhaustion events, automatically falling back across priority lanes to standard bare-metal configurations without breaking runtime inference streams.Immutable Relational Billing Ledger: Features an integrated, real-time relational logging pipeline to calculate token-level utilization metrics and track infrastructure cost savings.

Evaluate ZeroGate's queuing, state boundaries, and automated scaling primitives entirely on your local machine.

By default, the engine boots with an isolated Mock Mode turned on (ZEROGATE_MOCK=True

). This allows you to stress-test the complete orchestration fabric on any hardware (including Apple Silicon or non-GPU laptops) with zero infrastructure costs, zero provider accounts, and zero local CUDA/NVIDIA driver dependencies.

git clone https://github.com/noah-garner/zerogate
cd zerogate

Copy the pre-santized environment template. The default settings are pre-configured to launch the engine in an offline mock layer cleanly:

cp .env.example .env

Launch the ultra-lightweight Alpine service container stack (API Gateway, Kafka event broker streams, Redis state cache, and PostgreSQL billing database):

docker compose up --build -d

(Verify all services are up and healthy by running docker compose ps)

Fire an automated batch of concurrent prompt streams directly inside the private container network mesh:

docker compose run --build --rm simulator

While the simulator container floods the network, watch the event loops handle the infrastructure expansion, rate-limiting, and scale-down lifecycle phases in real time:

docker compose logs -f gateway

docker compose logs -f worker

To watch ZeroGate handle live cluster expansion and scale-to-zero loops entirely inside the local deployment, you need to overwhelm the default baseline thread pools. Instead of editing source code, you can trigger this directly via your environment configurations:

Open your local

.env

file and increase your workload density to breach your burst threshold limit of 15:

SIM_TOTAL_REQUESTS=20
SIM_BATCH_SIZE=20

Trigger the automated over-capacity surge container:

docker compose run --build --rm simulator

Open your second terminal tab and monitor your background worker daemon (

docker compose logs -f worker

). You will watch the engine detect theGlobal Pipeline Load: 20

, spin up the simulation cluster drivers, and execute the infrastructure cleanup erasure loop exactly 10 seconds after the batch clears!

System Resilience Note:

If you fire a secondary workload surge while a scale-to-zero teardown loop is actively running, the engine will instantly intercept the new traffic metrics, cancel the erasure cycle, and spin up fresh compute pools to process the payload without dropping a single packet.

ZeroGate provides high-performance, non-blocking telemetry and state queries over your active workspaces and transaction queues.

Query the real-time lifecycle phase of a specific inference job cached across your distributed state layer.

Path:Get /v1/status/{request_id}

Authentication: None (Designed for safe, frictionless frontend/client-side polling without leaking master admin tokens). - Verification Command:

curl -X GET http://localhost:8000/v1/status/<request_id_from_logs>

Pull real-time relational aggregation directly from the PostgreSQL ledger to track token velocity, queue overhead latencies, and total accumulated dollar-value savings.

Path:GET /v1/analytics

Required Header:X-ZeroGate-Key: {your_workspace_token}

Verification Command:

curl -X GET http://localhost:8000/v1/analytics \
    -H "X-ZeroGate-Key: zerogate-alpha-demo"

Systems Engineering Note: Telemetry Aggregation

Theaggregated_idle_tax_saved_usd

metric updatesstrictly upon the completion of an infrastructure erasure cycle. If your workload testing batch does not breach theBURST_THRESHOLD

parameter (default: 15), the system processes your tasks entirely on the warm baseline buffer layer without provisioning extra burst nodes. Consequently, no cloud waste occurs, and the ledger will accurately report0.0

until an over-capacity surge actively triggers a spin-up, idle tracking sequence, and a subsequent teardown loop.

Processing / Scaling Phase:

{
    "request_id": "73909cd9-3af7-4f20-a209-c5857619c680",
    "status": "processing",
    "infrastructure": { "cluster_slice": "allocated_pool", "vllm_state": "hot_path_stream_engaged" }
}

Completed Phase:

{
  "request_id": "73909cd9-3af7-4f20-a209-c5857619c680",
  "status": "completed",
  "metrics": { "execution_duration_seconds": 4.5, "estimated_savings_usd": 0.00022 },
  "prompt": "...",
  "result": "...",
  "message": "Inference lifecycle finished. Idle footprint flattened."
}

Analytics Phase:

{
"workspace_key": "zerogate-alpha-demo",
"ledger": {
    "total_inferences_processed": 20,
    "total_tokens_generated": 1460,
    "aggregated_idle_tax_saved_usd": 0.00442,
    "average_queue_overhead_ms": 400.1
},
"status": "Healthy. Workspace data plane separation verified."
}

When you are ready to transition past local simulation loops and orchestrate real hardware layers inside your own private cloud setups, toggle ZEROGATE_MOCK=False

inside your local .env

.

For high-performance, non-virtualized production workloads. To bypass heavy runtime package installation delays and secure low cold-start latency, this tier utilizes optimized hardware snapshots.

Save your pre-configured vLLM execution environment as a

Custom Snapshot Image inside your Hyperstack console. - Update your credentials inside your

.env

configuration file:

ZEROGATE_MOCK=False
ZEROGATE_API_KEY=your_zerogate_key
ZEROGATE_BASE_URL=your_zerogate_endpoint

HYPERSTACK_BASE_URL=https://nexgencloud.com
HYPERSTACK_API_KEY=your_secret_api_key
HYPERSTACK_MAIN_NODE_IP=your_prewarmed_hyperstack_node_ip
HYPERSTACK_SSH_KEY_NAME=your_ssh_key_name
HYPERSTACK_ENVIRONMENT_NAME=your_environment_name
HYPERSTACK_REGION=your_region_name

Our active engineering sprint is focused on launching native RunPod container driver hooks. This will allow deploying public vLLM images straight from a standard API request with zero custom snapshot configurations required, slashing hypervisor cold-start latency down from 5 minutes to <40 seconds. Follow along with our active development tracking inside Issue #1!

Deep-tech infrastructure is built iteratively. We publish our engineering milestones openly to cultivate transparent collaboration with our core alpha developer network.

v0.1.0-alpha (Current): Full event-driven kafka consumer gateway, distributed locks, automated scale-to-zero background daemons, and local evaluation engine. - v0.2.0 (Active Sprint): Implement fluid cross-cloud pod drivers (RunPod) to leverage container-based GPU scaling, dropping cold starts under 90 seconds (and sub-40 seconds on cached container nodes). - v0.3.0 (Production Enterprise Milestone): Transition from a push-based proxy router to a pull-based late-binding work-stealing consumer mesh to optimize multi-node execution throughput.

We are selecting 5-10 Core Alpha Developers building production-grade AI platforms who need to optimize infrastructure utilization, secure multi-cloud fault tolerance, and eliminate unmanaged GPU idle tax.

File an Issue: Found an edge case in our async locking primitives? Open a detailed GitHub issue with your simulator log output.** Get Early Enterprise Access**: Reach out directly if you require custom private-cloud deployment scripts or dedicated queue isolation.

Licensed under the Apache 2.0 License.

source & further reading

github.com — original article

Show HN: ZeroGate – API gateway to scale cloud GPUs to zero when idle

Run your AI side-project on zahid.host