{"slug": "how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device", "title": "How to Point Your IDE and Apps at a Local AI Model (Private, On-Device)", "summary": "Off Grid AI Desktop is a free, open-source app that provides an OpenAI-compatible endpoint on local machines, enabling developers to run AI tools against on-device models without an internet connection. The app supports text, vision, embeddings, speech-to-text, text-to-speech, and image generation, and works with any tool that accepts an OpenAI base URL.", "body_md": "Your editor, your terminal scripts, and half the AI tools you installed last month all speak the same protocol: the OpenAI HTTP API. They all assume that protocol points at a server you pay for. It does not have to. Off Grid AI Desktop is a free, open-source app that puts an OpenAI-compatible endpoint on your own Mac or PC, so every one of those tools can run against on-device models instead.\n\nFree, open-source (AGPL-3.0), runs offline. No account, no telemetry, no API key.\n\nThere is one endpoint to remember:\n\n```\nhttp://127.0.0.1:7878/v1\n```\n\nAnything that takes an OpenAI base URL takes this one. IDE extensions, CLI tools, a Python script, a browser extension, a shell alias. You give them this address and a placeholder key, and they get a private inference backend that works on a plane.\n\nIt is bound to loopback, so it answers only from your own machine. Nothing on your network or the internet can reach it. That is the point. Your code, your prompts, and your files go to a process you control, not to a vendor.\n\n| Tier | macOS | Windows |\n|---|---|---|\n| Minimum | Apple Silicon (M1), 16 GB unified memory, macOS 13+, ~12 GB free disk | NVIDIA or recent CPU, 16 GB RAM, Windows 11, ~12 GB free disk |\n| Recommended | M2/M3/M4, 24 GB+ unified memory | NVIDIA GPU (CUDA) or Vulkan GPU, 32 GB RAM |\n\nCPU fallback works on Windows when there is no GPU. It runs slower but it runs.\n\nThe gateway is OpenAI-SDK compatible, so the list of things you can point at it is long. A few that developers reach for first:\n\n`curl`\n\none-liner or a shell function for quick prompts from the terminal.`openai-python`\n\nor `openai-node`\n\n, where you change two arguments and the script now runs offline.One endpoint covers more than text. You get vision, embeddings, speech-to-text, text-to-speech, and image generation behind the same OpenAI routes, so the tools you point at it are not limited to chat.\n\nStart with the smallest possible test. This confirms the endpoint answers.\n\n```\ncurl http://127.0.0.1:7878/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"local\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Reply with just: it works\"}]\n  }'\n```\n\nNow the same in Python with the official SDK. The two lines that matter are `base_url`\n\nand `api_key`\n\n.\n\n``` python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://127.0.0.1:7878/v1\",\n    api_key=\"local\",  # any placeholder, the gateway ignores it\n)\n\nresp = client.chat.completions.create(\n    model=\"local\",\n    messages=[{\"role\": \"user\", \"content\": \"Summarize this commit message: fix off-by-one in pager\"}],\n)\nprint(resp.choices[0].message.content)\n```\n\nIn Node, with the `openai`\n\npackage:\n\n``` python\nimport OpenAI from \"openai\";\n\nconst client = new OpenAI({\n  baseURL: \"http://127.0.0.1:7878/v1\",\n  apiKey: \"local\",\n});\n\nconst resp = await client.chat.completions.create({\n  model: \"local\",\n  messages: [{ role: \"user\", content: \"Rename this variable to be clearer: tmp2\" }],\n});\nconsole.log(resp.choices[0].message.content);\n```\n\nMost IDE assistants and editor AI extensions expose two settings: a base URL and an API key. Set them like this:\n\n`http://127.0.0.1:7878/v1`\n\n`local`\n\n`local`\n\n, or whatever id `GET /v1/models`\n\nreportsIf the extension also asks for a provider type, choose OpenAI-compatible or custom. From there the extension's chat, inline completion, and edit features run against the model on your disk. To check which models are active and their `kind`\n\n(chat, vision, image, speech, transcription), call:\n\n```\ncurl http://127.0.0.1:7878/v1/models\n```\n\nBecause the same endpoint serves every modality, you can build small tools that would normally need three vendors.\n\nTranscribe an audio file with whisper.cpp, sent as multipart:\n\n```\ncurl http://127.0.0.1:7878/v1/audio/transcriptions \\\n  -F \"file=@meeting.m4a\" \\\n  -F \"model=local\"\n```\n\nGenerate embeddings for a local search script, using `all-MiniLM-L6-v2`\n\n:\n\n```\ncurl http://127.0.0.1:7878/v1/embeddings \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"local\", \"input\": \"the cat sat on the mat\"}'\n```\n\nThere is also text-to-speech at `/v1/audio/speech`\n\n(Kokoro, WAV output, with voice ids from `/v1/audio/voices`\n\n) and image generation at `/v1/images/generations`\n\n. Same base URL, same placeholder key.\n\nSome calls take a while. The first request to a modality downloads its model, and multi-step image generation runs for seconds to minutes. Rather than risk a client timeout in your script, opt into async with `?async=true`\n\n, a body field `\"async\": true`\n\n, or the header `Prefer: respond-async`\n\n. You get a `202`\n\nwith a `poll_url`\n\n, then poll `GET /v1/requests/{id}`\n\nuntil it finishes. For an IDE assistant doing short chat turns you will not need this, but a batch script will.\n\nModels load on demand per modality and offload when the call ends, so a chat model and an image model never sit in RAM together. Your peak memory is set by the largest single job, not the sum of all of them.\n\nThe models themselves are quantized GGUF files at levels like q8_0 and Q4_K, which shrinks a model that wanted tens of gigabytes down to a handful. On macOS the GPU runs them on Metal over unified memory. On Windows it is CUDA for NVIDIA cards or Vulkan for others, with a CPU path as backup. That combination is why a consumer machine handles models that needed a rented server not long ago.\n\nWhen your IDE talks to a hosted AI service, your source code goes to that service. It is logged, billed per token, and tied to an account.\n\nWhen your IDE talks to `127.0.0.1:7878`\n\n, the code goes to a process on your own machine and stops there. The gateway makes no outbound calls for inference. There is no telemetry and no account. The whole app is AGPL-3.0, so you can read what it does before you trust it with your repository. Disconnect from the network and every example above keeps working.\n\n`http://127.0.0.1:7878/v1`\n\n. Browse `GET /docs`\n\nfor the Scalar playground or `/openapi.json`\n\nfor the spec.`POST /v1/.../mcp`\n\nover Streamable HTTP, so MCP clients can call them. A separate article goes into that.If the extension lets you set a custom OpenAI base URL and key, yes. Set the URL to `http://127.0.0.1:7878/v1`\n\nand the key to any placeholder.\n\nYes. AGPL-3.0, open source, no metered API. You run models on your own hardware, so there is no token bill.\n\nYes. After each modality downloads its model once, every endpoint runs with no internet.\n\nAnything that speaks the OpenAI HTTP API, including `openai-python`\n\nand `openai-node`\n\n. The gateway also mirrors an Ollama-style models array for tools that expect that.\n\n16 GB works on macOS and Windows. 24 GB or more helps with bigger models and image generation. Models load one at a time, so size for the heaviest single job.\n\nThe endpoint is bound to `127.0.0.1`\n\nand makes no outbound inference calls. No telemetry, no account, open source. Your repository stays on your disk.\n\nGive your whole machine a private inference backend.", "url": "https://wpnews.pro/news/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device", "canonical_source": "https://dev.to/alichherawalla/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device-1643", "published_at": "2026-06-25 05:21:01+00:00", "updated_at": "2026-06-25 05:43:55.575751+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-infrastructure"], "entities": ["Off Grid AI Desktop", "OpenAI", "AGPL-3.0", "whisper.cpp", "all-MiniLM-L6-v2"], "alternates": {"html": "https://wpnews.pro/news/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device", "markdown": "https://wpnews.pro/news/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device.md", "text": "https://wpnews.pro/news/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device.txt", "jsonld": "https://wpnews.pro/news/how-to-point-your-ide-and-apps-at-a-local-ai-model-private-on-device.jsonld"}}