What Is Microsoft Presidio and Why You Need It (Setup + First Detection)

wpnews.pro

If you're building anything that touches user data and sends it to an LLM, you have a PII problem. Names, emails, phone numbers, credit card numbers, social security numbers sitting in support tickets, chat logs, documents, and database fields. Every time you pipe that data into a prompt, you're sending someone's personal information to a third-party model endpoint. Maybe that's fine for your use case. Maybe it's not. Either way, you should know what's in your data before you make that call.

Microsoft Presidio is an open-source framework that detects and anonymizes PII in text, images, and structured data. It's been around since 2019, it's actively maintained, and it's what I reach for when I need to scrub data before it hits an LLM. This series walks through the entire framework from installation to production deployment. No toy examples. Real workloads.

Presidio has two core modules that handle the detection and anonymization pipeline separately.

The Analyzer finds PII. It combines named entity recognition (NER) from spaCy or Hugging Face transformers with regex pattern matching and contextual scoring. When you feed it text, it returns a list of detected entities with types, confidence scores, and character positions. It doesn't modify the text. It just tells you what it found.

The Anonymizer takes the analyzer's output and does something with it. Replace detected names with <PERSON>

. Redact phone numbers entirely. Mask credit card numbers with asterisks. Hash emails. Encrypt values you need to reverse later. The anonymizer is where you decide how to handle each entity type.

Beyond those two, Presidio has additional modules for specific use cases. presidio-image-redactor handles OCR on images and redacts PII from screenshots and scanned documents. presidio-structured processes tabular data in DataFrames and JSON. We'll get to those in later parts of this series.

You have two paths: Python packages via pip or Docker containers. I'll cover both because you'll want pip for development and experimentation, and Docker for anything that needs to serve an API.

Set up a virtual environment first. Presidio pulls in spaCy and NLP models that you don't want colliding with other projects.

python -m venv presidio-env
source presidio-env/bin/activate  # Linux/Mac

pip install presidio-analyzer presidio-anonymizer

python -m spacy download en_core_web_lg

The en_core_web_lg

model is about 560MB. If you're tight on space or just experimenting, en_core_web_sm

works but you'll see lower accuracy on name and location detection. For anything beyond a quick test, use the large model.

Presidio publishes official images to Microsoft Container Registry. Each module runs as its own REST API.

docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer

docker run -d -p 5001:3000 mcr.microsoft.com/presidio-analyzer:latest

docker run -d -p 5002:3000 mcr.microsoft.com/presidio-anonymizer:latest

Both containers expose REST APIs on port 3000 internally. Map them to whatever ports you want on the host. Once they're running, you can hit them with curl or any HTTP client.

To verify they're up:

curl http://localhost:5001/health
curl http://localhost:5002/health

Let's feed the analyzer some text and see what comes back. I'll show both the Python API and the REST API so you can pick whichever fits your workflow.

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

text = """
Hi, my name is John Smith and I live in Seattle. 
My email is john.smith@example.com and my phone 
number is 206-555-0147. My SSN is 123-45-6789 
and my credit card is 4111-1111-1111-1111.
"""

results = analyzer.analyze(text=text, language="en")

for result in results:
    print(f"{result.entity_type}: '{text[result.start:result.end].strip()}' "
          f"(score: {result.score:.2f}, position: {result.start}-{result.end})")

Output:

PERSON: 'John Smith' (score: 0.85, position: 18-28)
LOCATION: 'Seattle' (score: 0.85, position: 42-49)
EMAIL_ADDRESS: 'john.smith@example.com' (score: 1.00, position: 64-86)
PHONE_NUMBER: '206-555-0147' (score: 0.75, position: 110-122)
US_SSN: '123-45-6789' (score: 0.85, position: 134-145)
CREDIT_CARD: '4111-1111-1111-1111' (score: 1.00, position: 169-188)
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "My name is John Smith and my email is john.smith@example.com",
    "language": "en"
  }'

The response is a JSON array of detected entities with the same fields: entity type, start position, end position, and confidence score.

Every detection result contains five fields that matter:

entity_type is what Presidio thinks it found. PERSON

, EMAIL_ADDRESS

, PHONE_NUMBER

, CREDIT_CARD

, US_SSN

, LOCATION

, and dozens more.

start and end are character positions in the original text. This is how you know exactly which substring triggered the detection. It's also how the anonymizer knows what to replace.

score is a confidence value between 0 and 1. A regex match on a credit card pattern returns 1.0 because the pattern is deterministic. A name detected by NER might return 0.85 because the model is making a probabilistic judgment. You can set a threshold to filter out low-confidence detections. The default is 0.

analysis_explanation is available in the detailed results and tells you which recognizer fired and why. Useful for debugging false positives.

Presidio ships with recognizers for a wide range of entity types across multiple categories.

Global entities (work across languages): credit card numbers, crypto wallet addresses, email addresses, IBAN codes, IP addresses, phone numbers, URLs, domain names, dates.

US-specific: Social Security numbers, bank account numbers, driver's license numbers, ITIN, passport numbers.

UK-specific: NHS numbers.

Other regions: Singapore financial numbers, Australian business numbers, and more through community recognizers.

The full list is in the Presidio supported entities documentation. If your entity type isn't covered, you can build custom recognizers. That's Part 3 of this series.

Detection is only half the job. Let's anonymize the results.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "My name is John Smith and my email is john.smith@example.com"

results = analyzer.analyze(text=text, language="en")

anonymized = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized.text)

The default behavior replaces each detected entity with its type label wrapped in angle brackets. In Part 4 we'll dig into all the anonymization operators (replace, redact, mask, hash, encrypt) and when to use each one. For now, the point is that detection and anonymization are separate steps. You can detect without anonymizing, anonymize differently per entity type, or build a pipeline that does both in one shot.

That's the foundation. Presidio installed, first detection running, and you understand what the output looks like. In Part 2, we'll go deeper on the analyzer: how the NER models, regex patterns, and context scoring work together, how to process different text types (emails, support tickets, chat logs), batch processing with presidio-structured, and image redaction with presidio-image-redactor.

This is Part 1 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

source & further reading

dev.to — original article LongCat-Video-Avatar 1.5 cuts inference to 8 steps — here's AI Didn't Kill Software Engineering. It Made It More Valuable Than Ever. TraceTree: Feature Update!!!

What Is Microsoft Presidio and Why You Need It (Setup + First Detection)

Run your AI side-project on zahid.host