{"slug": "local-ai-needs-data-plane-health-checks", "title": "Local AI Needs Data-Plane Health Checks", "summary": "A developer debugging a local AI mesh network discovered that control-plane health checks can report green while the data plane is dead. The issue was caused by privacy VPN killswitch rules silently dropping traffic on the mesh interface, requiring narrow nftables exceptions to restore connectivity.", "body_md": "The worst network bugs are the ones where every dashboard says green and the packet still dies.\n\nThat was my Sunday.\n\nI have a Mac that I use as my daily machine and a Linux box called `newtorob`\n\nwith a 2080 Ti in it. Potluck runs a local AI sidecar on each machine. The Mac can use its own model locally, or route a request to another machine in my household over a WireGuard mesh.\n\nThe product shape is simple:\n\nMac app -> local sidecar -> WireGuard mesh -> Linux sidecar -> model runtime -> streamed tokens back to the Mac.\n\nThis is the \"my machines\" path. No model API. No cloud inference. The coordinator handles roster and signaling metadata, but the prompt itself should go directly over the private mesh to my own hardware.\n\nEverything looked connected. The Linux peer was enrolled. The coordinator knew about it. The mesh sidecar was running. The UI showed a peer. The machine had a model loaded.\n\nThen I sent a prompt and got: no reachable peer.\n\nThe control plane was green. The data plane was dead.\n\nMost peer health checks answer the wrong question.\n\nA coordinator heartbeat proves the peer can talk to the coordinator.\n\nA WebSocket connection proves the peer can keep one control connection open.\n\nA WireGuard handshake proves two tunnel endpoints exchanged packets recently.\n\nA capabilities response proves a process can report what it thinks it can serve.\n\nNone of those prove that an inference request can cross the exact path the product needs right now.\n\nFor a local AI mesh, the real question is not \"is this peer online?\"\n\nThe real question is:\n\nCan this prompt reach that model and stream a token back right now?\n\nThat distinction matters because the failure modes sit between the layers. A peer can be present in the roster while the tunnel is broken. A tunnel can have a fresh handshake while HTTP over the tunnel fails. A model can be loaded while the process is unreachable from the other machine. A privacy VPN can silently drop traffic on an interface it does not recognize while every higher-level control check looks fine.\n\nThat last one was my bug.\n\nMy first theory was MTU.\n\nThat was not random. WireGuard-over-WireGuard paths are good at producing partial success. A small handshake packet can pass while larger data packets disappear. If path MTU discovery is broken, the tunnel looks alive and the application path dies. This is exactly the kind of problem where \"connected\" and \"usable\" diverge.\n\nTailscale and NetBird both default to conservative MTUs around 1280 for a reason. WireGuard adds overhead. Relays add overhead. Residential networks add weirdness. If you run a local mesh on top of another VPN, a 1420-byte default can turn into a packet shredder.\n\nSo I checked the mesh MTU.\n\nIt was already 1280.\n\nThat was a useful dead end. It ruled out the cleanest explanation and left the uglier one: the packet was not too large. It was not allowed.\n\nThe Linux box runs Mullvad. The Mac also has Tailscale. Potluck uses a `potluck0`\n\nWireGuard interface and mesh IPs in the `100.64.0.0/10`\n\nrange.\n\nThat combination has two separate traps.\n\nTailscale treats `100.64.0.0/10`\n\nas its space. Its nftables rules can drop packets from that range when they arrive on a non-`tailscale0`\n\ninterface.\n\nMullvad's killswitch is stricter. It installs nftables chains with default-drop policy and allows traffic only through interfaces it trusts. `potluck0`\n\nis not one of them.\n\nFrom Mullvad's perspective, this is correct behavior. A privacy VPN killswitch should not let random interfaces become escape hatches.\n\nFrom Potluck's perspective, this means my own mesh interface is blocked unless I add a narrow exception.\n\nThe fix was three scoped accept rules:\n\n```\nnft insert rule ip filter ts-input iifname potluck0 accept\nnft insert rule inet mullvad input iifname potluck0 accept\nnft insert rule inet mullvad output oifname potluck0 accept\n```\n\nNot a flush. Not a policy change. Not disabling the VPN firewall. Just a hole for the Potluck mesh interface.\n\nAfter that, the Mac could hit:\n\n```\ncurl http://100.64.0.7:8321/health\ncurl http://100.64.0.7:8321/peer/capabilities\n```\n\nBoth returned 200. The prompt routed to the Linux box. The footer showed it ran on `newtorob-a16`\n\n.\n\nThat fixed the immediate problem.\n\nThen it broke again.\n\nVPN clients rebuild firewall rules.\n\nThat sentence is obvious after you have been bitten by it once. It is not obvious when you are staring at a mesh that worked five minutes ago.\n\nMullvad, Proton, Nord, and similar clients do not treat nftables as a stable place where your hand-inserted rule gets to live forever. Reconnect the VPN, switch servers, wake from sleep, change networks, and the client may recreate its ruleset. Your narrow exception disappears. The killswitch keeps doing its job. Your mesh goes dark again.\n\nMy first fix was a boot-time one-shot. It installed the three accept rules when the machine started. That survives reboots. It does not survive VPN reconnects.\n\nThe better fix was a watcher.\n\nEvery few seconds it checks whether the accept rules that should exist still exist. If Tailscale or Mullvad is not present, it owes nothing. If they are present and any of the three `potluck0`\n\nrules are missing, it reruns the same idempotent insert path.\n\nThe loop is boring by design:\n\n```\nwhile true; do\n    sleep \"${POTLUCK_FW_WATCH_INTERVAL:-5}\"\n    if ! rules_intact; then\n        log \"accept rule(s) missing; reapplying\"\n        do_install || log \"reapply hit an error; will retry on next tick\"\n    fi\ndone\n```\n\nThe systemd unit is also boring:\n\n```\nType=exec\nExecStart=/usr/local/lib/potluck/install-firewall-rules.sh --watch\nExecStop=/usr/local/lib/potluck/install-firewall-rules.sh --uninstall\nRestart=always\n```\n\nI tested it the blunt way. Run `--uninstall`\n\n, confirm all three rules are missing, wait seven seconds, confirm they are back. The journal logged the reapply event. The mesh stayed usable after that.\n\nThat is not the whole product fix. It is only the repair for this Linux VPN coexistence case.\n\nThe product fix is diagnostics.\n\nThe lesson is not \"add firewall rules.\"\n\nThe lesson is that local AI needs data-plane health checks.\n\nIf a system routes inference across machines, it should have a check that uses the same route as inference. Not just the same peer. Not just the same coordinator. The same path.\n\nFor my setup, a real health check should answer separate questions:\n\nThose are different failures with different owners and different fixes.\n\nIf the coordinator is down, restarting Mullvad will not help.\n\nIf the peer is powered off, reapplying nftables rules will not help.\n\nIf the WireGuard key in the coordinator is stale, reloading the model will not help.\n\nIf the model runtime is missing CUDA libraries, the tunnel can be perfect and inference will still fail.\n\nIf Mullvad dropped `potluck0`\n\n, the peer can look enrolled and still be unusable.\n\nThe UI should not compress all of that into \"offline.\"\n\nIt should say \"coordinator unreachable,\" \"peer not present,\" \"no tunnel,\" \"relayed,\" \"firewall likely blocking mesh traffic,\" or \"model runtime not ready.\" The exact labels matter less than the principle: name the layer that failed.\n\nIn a normal SaaS product, most of the network path is owned by the operator. The user opens a browser. Your load balancer works or it does not. Your app servers work or they do not. There are still ugly edge cases, but the core path is under one operational umbrella.\n\nLocal AI is different.\n\nThe path crosses the user's laptop, their OS firewall, their VPN, their home router, a mesh tunnel, another machine's firewall, a model sidecar, a Python runtime, a GPU driver, and a model file on disk.\n\nThe product does not get to pretend that is one boolean.\n\nThis is especially true for \"my machines\" routing. The whole point is to make a user's idle hardware useful: Mac for the app, Linux box for GPU inference, Windows desktop for another model, maybe a mini PC in a closet. That is a better architecture for ownership and cost. It is also a worse architecture for lazy health checks.\n\nThe user should not need to know nftables to understand why their peer is unavailable.\n\nThe software should know enough to say: \"Your peer is visible, but data-plane traffic over `potluck0`\n\nis blocked. Reapply the scoped firewall rules or disable the VPN killswitch exception.\"\n\nEven better, with consent, it should offer the fix.\n\nThe immediate change was operational:\n\n`potluck0`\n\naccepts. Do not flush rulesets. Do not weaken the broader VPN policy.`curl`\n\nto `/health`\n\nand `/peer/capabilities`\n\n, then a prompt that actually runs on the remote machine.The next change is product:\n\nReplace the single peer-status badge with a small diagnostics model. Local host, coordinator, relay, tunnel, peer data plane, model runtime. Each layer gets a named failure and a concrete fix.\n\nThat is less elegant than a green dot.\n\nIt is also more honest.\n\nThe check I want is not expensive.\n\nSend a tiny HTTP probe to the peer over the mesh. Sometimes send a larger one to catch MTU and fragmentation problems. If the app is about to route inference, ask the peer for capabilities over the same path. If that passes, optionally send a tiny model-path probe before marking the peer usable for a real prompt.\n\nCache the answer briefly. Debounce flaps. Suppress downstream errors when an upstream layer is already broken.\n\nBut do not call the peer reachable just because a control-plane heartbeat exists.\n\nThat is how I lost an afternoon.\n\nIf you are building local-first AI across machines, do not start with \"peer online.\"\n\nStart with the path:\n\nCan the request leave this process?\n\nCan it cross the mesh?\n\nCan it reach the peer process?\n\nCan the peer reach the model runtime?\n\nCan one token come back?\n\nEverything else is metadata.\n\nThe metadata is still useful. Heartbeats, handshakes, rosters, relay status, and capabilities all help narrow the search. But they are not proof that the system can do the work.\n\nA local AI mesh should not ask \"is the peer online?\"\n\nIt should ask \"can this prompt reach that model and stream a token back right now?\"\n\nThat is the health check that matters.\n\nRob writes the *Local AI Engineering Notes* series on strake.dev. He's building [Potluck AI](https://trypotluck.ai), a local-first AI system that routes inference across your own machines and trusted peers, and [Strake](https://strake.dev), a GitHub Action deploy gate.", "url": "https://wpnews.pro/news/local-ai-needs-data-plane-health-checks", "canonical_source": "https://dev.to/newtorob/local-ai-needs-data-plane-health-checks-2ene", "published_at": "2026-06-14 20:59:02+00:00", "updated_at": "2026-06-14 21:10:51.107159+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-infrastructure", "developer-tools"], "entities": ["Potluck", "WireGuard", "Tailscale", "Mullvad", "nftables", "Mac", "Linux", "2080 Ti"], "alternates": {"html": "https://wpnews.pro/news/local-ai-needs-data-plane-health-checks", "markdown": "https://wpnews.pro/news/local-ai-needs-data-plane-health-checks.md", "text": "https://wpnews.pro/news/local-ai-needs-data-plane-health-checks.txt", "jsonld": "https://wpnews.pro/news/local-ai-needs-data-plane-health-checks.jsonld"}}