{"slug": "building-stuff-that-doesn-t-leak-everyone-s-data", "title": "Building Stuff That Doesn't Leak Everyone's Data", "summary": "Developer Maneshwar is building git-lrc, a free and source-available Micro AI code reviewer that runs on every commit. The project highlights the critical need for data privacy in AI systems, warning that models can memorize and leak sensitive information from training data, as demonstrated by research on GPT-2 and subsequent studies. Maneshwar emphasizes that anonymization is insufficient and that developers must treat data protection as a hard engineering problem.", "body_md": "*Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback.*\n\nPeople talk to chatbots like they are a diary, a therapist, and a lawyer rolled into one.\n\nThey paste in medical histories, half their codebase, and the occasional 2 a.m. confession.\n\nThen one day that \"private\" conversation turns up in a Google search result, and everyone acts surprised.\n\nIf you build with AI, you are the person standing between all that trust and a very bad headline.\n\nThe uncomfortable truth is that an AI system is basically a giant memory sponge with an API in front of it, and sponges leak.\n\nLet's talk about where the leaks come from and how to stop being the cautionary tale in someone else's blog post.\n\nModels do not learn from vibes.\n\nThey learn from data, and modern systems are hungry in a way that older software never was.\n\nA normal CRUD app touches the fields you ask it to touch.\n\nAn AI pipeline slurps up structured records, unstructured text, images, voice notes, clickstreams, and whatever else it can reach, then transforms all of it into something it can train on or retrieve from.\n\nEvery stop on that journey is a place where data can escape.\n\nHere is the rough shape of it.\n\nNotice that most of those red boxes are not exotic AI magic.\n\nA public storage bucket and a sloppy logging setup have been ruining people's weekends since long before transformers showed up.\n\nAI just raises the stakes, because now the thing leaking is rich, personal, and often impossible to claw back once it is out.\n\nHere is the part that catches teams off guard.\n\nA model does not only learn general patterns.\n\nIt also memorizes chunks of its training data, word for word, especially the rare and unusual stuff.\n\nThings like an email signature, a phone number, an API key someone committed by accident.\n\nThe juicy outliers are exactly what models tend to remember.\n\nResearchers proved this is not theoretical.\n\nIn 2021 a team led by Nicholas Carlini showed you could [extract verbatim training examples out of GPT-2](https://arxiv.org/abs/2012.07805), including real names, phone numbers, and email addresses.\n\nA [follow up in 2023](https://arxiv.org/abs/2311.17035) was even nastier.\n\nThey found that getting a production chatbot to repeat a single token over and over could knock it out of its polite assistant persona and make it dump memorized training data at roughly 150 times the normal rate.\n\nThe lesson for builders is blunt.\n\nIf you fine tune on raw user data, internal docs, or support tickets, assume some of it can be coaxed back out later.\n\nThe model is not malicious.\n\nIt is just a very confident parrot with a photographic memory and zero discretion.\n\nSensitive information disclosure climbed all the way to number two on the [OWASP Top 10 for LLM Applications](https://genai.owasp.org/llm-top-10/) for exactly this reason.\n\nA lot of privacy plans boil down to deleting the name column and calling it a day.\n\nThen the model, or a curious analyst, stitches the remaining breadcrumbs back into a specific human.\n\nLocation plus timestamp plus a couple of behavioral quirks is often more than enough to re identify someone, even when the obvious identifiers are gone.\n\nThis is what privacy folks call inference risk.\n\nA model can predict things you never handed it, like health status or political leaning, from data that looked totally boring on its own.\n\nStripping the name field does not make that go away.\n\nTreat anonymization as a hard engineering problem with real techniques behind it, not a checkbox you tick before the demo.\n\nNone of this is hypothetical, and the examples keep getting better, by which I mean worse.\n\nIn 2023 Samsung engineers pasted confidential source code into a public chatbot to debug it.\n\nFast, convenient, and instantly outside the company's control forever.\n\nThe fix was a corporate ban, which is the security equivalent of unplugging the router.\n\nIn 2025, users discovered that a chatbot \"share\" feature was quietly making conversations crawlable, so private chats started showing up in plain Google searches.\n\nThe feature was killed, but search engines do not have an undo button.\n\nAnd in early 2026, a popular AI app [exposed around 300 million private messages from 25 million users](https://www.malwarebytes.com/blog/news/2026/02/ai-chat-app-leak-exposes-300-million-messages-tied-to-25-million-users) thanks to a misconfigured backend.\n\nNo clever hacker required.\n\nJust a database left open like a fridge with the door ajar.\n\nSpot the pattern. Almost none of these were sophisticated model attacks.\n\nThey were boring infrastructure mistakes attached to extremely not boring data.\n\nPrivacy is only half the story. The other half is that your model makes decisions, and those decisions can quietly discriminate.\n\nFeed a system historical data full of human bias and it will learn that bias, then apply it at scale with a straight face.\n\nNow imagine that running a hiring filter or a credit check.\n\nThe trap is that a neural net cannot explain itself in a way a regulator, or an angry user, will accept.\n\n\"The model said no\" is not a reason.\n\nIf your AI influences anything that affects people's lives, you need a way to show how it got there, what it was trained on, and where it tends to go wrong.\n\nTransparency is not a nice to have here.\n\nIt is the difference between a defensible system and a lawsuit with extra steps.\n\nGood news: the defenses are mostly things you already know how to do, just applied with more paranoia.\n\nStart with the simplest and most underrated move of all.\n\nThat is data minimization, and it is genuinely the best privacy control you have.\n\nData you never collected cannot leak, cannot be subpoenaed, and cannot be memorized by a model.\n\nBefore you log a field or feed it to training, ask whether you actually need it. The answer is no more often than you think.\n\nA few more patterns that pull their weight:\n\nEven if you do not care about any of this on principle, the law increasingly cares for you.\n\nThe [GDPR](https://gdpr-info.eu/) in the EU and the [CCPA](https://oag.ca.gov/privacy/ccpa) in California already give people rights over their data, including consent, access, and deletion.\n\nThe [EU AI Act](https://artificialintelligenceact.eu/) goes further and sorts AI uses into risk tiers, with the spicy stuff like social scoring outright banned and high risk systems facing real obligations.\n\nExposing user data can count as a reportable incident even when the user technically shared it themselves through some confusing toggle.\n\n\"But they clicked the button\" is not the airtight defense people hope it is.\n\nBefore you ship anything that touches user data with a model, run through this:\n\nIf you can answer those honestly, you are already ahead of most of the apps in the breach roundups.\n\nDisclaimer: This article was written by me; AI was used to fix grammar and improve readability.\n\nAI agents write code fast. They also silently remove logic, change behavior, and introduce bugs — without telling you. You often find out in production.\n\ngit-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.\n\nAny feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.\n\n⭐ Star it on GitHub:\n\n| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |\n\nGenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.\n\n** git-lrc is your braking system.** It hooks into\n\n`git commit`\n\nand runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**\n\n**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…", "url": "https://wpnews.pro/news/building-stuff-that-doesn-t-leak-everyone-s-data", "canonical_source": "https://dev.to/lovestaco/building-stuff-that-doesnt-leak-everyones-data-7kn", "published_at": "2026-06-29 12:09:33+00:00", "updated_at": "2026-06-29 12:21:15.869372+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-safety", "ai-ethics", "developer-tools"], "entities": ["Maneshwar", "git-lrc", "GitHub", "GPT-2", "Nicholas Carlini", "OWASP"], "alternates": {"html": "https://wpnews.pro/news/building-stuff-that-doesn-t-leak-everyone-s-data", "markdown": "https://wpnews.pro/news/building-stuff-that-doesn-t-leak-everyone-s-data.md", "text": "https://wpnews.pro/news/building-stuff-that-doesn-t-leak-everyone-s-data.txt", "jsonld": "https://wpnews.pro/news/building-stuff-that-doesn-t-leak-everyone-s-data.jsonld"}}