# Building Stuff That Doesn't Leak Everyone's Data

> Source: <https://dev.to/lovestaco/building-stuff-that-doesnt-leak-everyones-data-7kn>
> Published: 2026-06-29 12:09:33+00:00

*Hello, I'm Maneshwar. I'm building git-lrc, a Micro AI code reviewer that runs on every commit. It is free and source-available on Github. Star git-lrc to help devs discover the project. Do give it a try and share your feedback.*

People talk to chatbots like they are a diary, a therapist, and a lawyer rolled into one.

They paste in medical histories, half their codebase, and the occasional 2 a.m. confession.

Then one day that "private" conversation turns up in a Google search result, and everyone acts surprised.

If you build with AI, you are the person standing between all that trust and a very bad headline.

The uncomfortable truth is that an AI system is basically a giant memory sponge with an API in front of it, and sponges leak.

Let's talk about where the leaks come from and how to stop being the cautionary tale in someone else's blog post.

Models do not learn from vibes.

They learn from data, and modern systems are hungry in a way that older software never was.

A normal CRUD app touches the fields you ask it to touch.

An AI pipeline slurps up structured records, unstructured text, images, voice notes, clickstreams, and whatever else it can reach, then transforms all of it into something it can train on or retrieve from.

Every stop on that journey is a place where data can escape.

Here is the rough shape of it.

Notice that most of those red boxes are not exotic AI magic.

A public storage bucket and a sloppy logging setup have been ruining people's weekends since long before transformers showed up.

AI just raises the stakes, because now the thing leaking is rich, personal, and often impossible to claw back once it is out.

Here is the part that catches teams off guard.

A model does not only learn general patterns.

It also memorizes chunks of its training data, word for word, especially the rare and unusual stuff.

Things like an email signature, a phone number, an API key someone committed by accident.

The juicy outliers are exactly what models tend to remember.

Researchers proved this is not theoretical.

In 2021 a team led by Nicholas Carlini showed you could [extract verbatim training examples out of GPT-2](https://arxiv.org/abs/2012.07805), including real names, phone numbers, and email addresses.

A [follow up in 2023](https://arxiv.org/abs/2311.17035) was even nastier.

They found that getting a production chatbot to repeat a single token over and over could knock it out of its polite assistant persona and make it dump memorized training data at roughly 150 times the normal rate.

The lesson for builders is blunt.

If you fine tune on raw user data, internal docs, or support tickets, assume some of it can be coaxed back out later.

The model is not malicious.

It is just a very confident parrot with a photographic memory and zero discretion.

Sensitive information disclosure climbed all the way to number two on the [OWASP Top 10 for LLM Applications](https://genai.owasp.org/llm-top-10/) for exactly this reason.

A lot of privacy plans boil down to deleting the name column and calling it a day.

Then the model, or a curious analyst, stitches the remaining breadcrumbs back into a specific human.

Location plus timestamp plus a couple of behavioral quirks is often more than enough to re identify someone, even when the obvious identifiers are gone.

This is what privacy folks call inference risk.

A model can predict things you never handed it, like health status or political leaning, from data that looked totally boring on its own.

Stripping the name field does not make that go away.

Treat anonymization as a hard engineering problem with real techniques behind it, not a checkbox you tick before the demo.

None of this is hypothetical, and the examples keep getting better, by which I mean worse.

In 2023 Samsung engineers pasted confidential source code into a public chatbot to debug it.

Fast, convenient, and instantly outside the company's control forever.

The fix was a corporate ban, which is the security equivalent of unplugging the router.

In 2025, users discovered that a chatbot "share" feature was quietly making conversations crawlable, so private chats started showing up in plain Google searches.

The feature was killed, but search engines do not have an undo button.

And in early 2026, a popular AI app [exposed around 300 million private messages from 25 million users](https://www.malwarebytes.com/blog/news/2026/02/ai-chat-app-leak-exposes-300-million-messages-tied-to-25-million-users) thanks to a misconfigured backend.

No clever hacker required.

Just a database left open like a fridge with the door ajar.

Spot the pattern. Almost none of these were sophisticated model attacks.

They were boring infrastructure mistakes attached to extremely not boring data.

Privacy is only half the story. The other half is that your model makes decisions, and those decisions can quietly discriminate.

Feed a system historical data full of human bias and it will learn that bias, then apply it at scale with a straight face.

Now imagine that running a hiring filter or a credit check.

The trap is that a neural net cannot explain itself in a way a regulator, or an angry user, will accept.

"The model said no" is not a reason.

If your AI influences anything that affects people's lives, you need a way to show how it got there, what it was trained on, and where it tends to go wrong.

Transparency is not a nice to have here.

It is the difference between a defensible system and a lawsuit with extra steps.

Good news: the defenses are mostly things you already know how to do, just applied with more paranoia.

Start with the simplest and most underrated move of all.

That is data minimization, and it is genuinely the best privacy control you have.

Data you never collected cannot leak, cannot be subpoenaed, and cannot be memorized by a model.

Before you log a field or feed it to training, ask whether you actually need it. The answer is no more often than you think.

A few more patterns that pull their weight:

Even if you do not care about any of this on principle, the law increasingly cares for you.

The [GDPR](https://gdpr-info.eu/) in the EU and the [CCPA](https://oag.ca.gov/privacy/ccpa) in California already give people rights over their data, including consent, access, and deletion.

The [EU AI Act](https://artificialintelligenceact.eu/) goes further and sorts AI uses into risk tiers, with the spicy stuff like social scoring outright banned and high risk systems facing real obligations.

Exposing user data can count as a reportable incident even when the user technically shared it themselves through some confusing toggle.

"But they clicked the button" is not the airtight defense people hope it is.

Before you ship anything that touches user data with a model, run through this:

If you can answer those honestly, you are already ahead of most of the apps in the breach roundups.

Disclaimer: This article was written by me; AI was used to fix grammar and improve readability.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs — without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

⭐ Star it on GitHub:

| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |

GenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

** git-lrc is your braking system.** It hooks into

`git commit`

and runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**

**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…