What Is Microsoft Presidio and Why You Need It (Setup + First Detection)

Microsoft Presidio, an open-source framework for detecting and anonymizing personally identifiable information (PII) in text, images, and structured data, offers two core modules—the Analyzer and the Anonymizer—that handle detection and anonymization separately. The framework can be installed via Python packages for development or Docker containers for production API deployment, with the Analyzer using named entity recognition and regex to identify PII without modifying text, while the Anonymizer replaces, redacts, masks, hashes, or encrypts detected entities.

If you're building anything that touches user data and sends it to an LLM, you have a PII problem. Names, emails, phone numbers, credit card numbers, social security numbers sitting in support tickets, chat logs, documents, and database fields. Every time you pipe that data into a prompt, you're sending someone's personal information to a third-party model endpoint. Maybe that's fine for your use case. Maybe it's not. Either way, you should know what's in your data before you make that call. Microsoft Presidio is an open-source framework that detects and anonymizes PII in text, images, and structured data. It's been around since 2019, it's actively maintained, and it's what I reach for when I need to scrub data before it hits an LLM. This series walks through the entire framework from installation to production deployment. No toy examples. Real workloads. Presidio has two core modules that handle the detection and anonymization pipeline separately. The Analyzer finds PII. It combines named entity recognition NER from spaCy or Hugging Face transformers with regex pattern matching and contextual scoring. When you feed it text, it returns a list of detected entities with types, confidence scores, and character positions. It doesn't modify the text. It just tells you what it found. The Anonymizer takes the analyzer's output and does something with it. Replace detected names with <PERSON . Redact phone numbers entirely. Mask credit card numbers with asterisks. Hash emails. Encrypt values you need to reverse later. The anonymizer is where you decide how to handle each entity type. Beyond those two, Presidio has additional modules for specific use cases. presidio-image-redactor handles OCR on images and redacts PII from screenshots and scanned documents. presidio-structured processes tabular data in DataFrames and JSON. We'll get to those in later parts of this series. You have two paths: Python packages via pip or Docker containers. I'll cover both because you'll want pip for development and experimentation, and Docker for anything that needs to serve an API. Set up a virtual environment first. Presidio pulls in spaCy and NLP models that you don't want colliding with other projects. Create and activate a virtual environment python -m venv presidio-env source presidio-env/bin/activate Linux/Mac presidio-env\Scripts\activate Windows Install the core packages pip install presidio-analyzer presidio-anonymizer Download a spaCy language model the large model is more accurate python -m spacy download en core web lg The en core web lg model is about 560MB. If you're tight on space or just experimenting, en core web sm works but you'll see lower accuracy on name and location detection. For anything beyond a quick test, use the large model. Presidio publishes official images to Microsoft Container Registry. Each module runs as its own REST API. Pull the images docker pull mcr.microsoft.com/presidio-analyzer docker pull mcr.microsoft.com/presidio-anonymizer Run the analyzer on port 5001 docker run -d -p 5001:3000 mcr.microsoft.com/presidio-analyzer:latest Run the anonymizer on port 5002 docker run -d -p 5002:3000 mcr.microsoft.com/presidio-anonymizer:latest Both containers expose REST APIs on port 3000 internally. Map them to whatever ports you want on the host. Once they're running, you can hit them with curl or any HTTP client. To verify they're up: curl http://localhost:5001/health curl http://localhost:5002/health Let's feed the analyzer some text and see what comes back. I'll show both the Python API and the REST API so you can pick whichever fits your workflow. python from presidio analyzer import AnalyzerEngine Initialize the analyzer analyzer = AnalyzerEngine Sample text with multiple PII types text = """ Hi, my name is John Smith and I live in Seattle. My email is john.smith@example.com and my phone number is 206-555-0147. My SSN is 123-45-6789 and my credit card is 4111-1111-1111-1111. """ Analyze the text results = analyzer.analyze text=text, language="en" Print what we found for result in results: print f"{result.entity type}: '{text result.start:result.end .strip }' " f" score: {result.score:.2f}, position: {result.start}-{result.end} " Output: PERSON: 'John Smith' score: 0.85, position: 18-28 LOCATION: 'Seattle' score: 0.85, position: 42-49 EMAIL ADDRESS: 'john.smith@example.com' score: 1.00, position: 64-86 PHONE NUMBER: '206-555-0147' score: 0.75, position: 110-122 US SSN: '123-45-6789' score: 0.85, position: 134-145 CREDIT CARD: '4111-1111-1111-1111' score: 1.00, position: 169-188 curl -X POST http://localhost:5001/analyze \ -H "Content-Type: application/json" \ -d '{ "text": "My name is John Smith and my email is john.smith@example.com", "language": "en" }' The response is a JSON array of detected entities with the same fields: entity type, start position, end position, and confidence score. Every detection result contains five fields that matter: entity type is what Presidio thinks it found. PERSON , EMAIL ADDRESS , PHONE NUMBER , CREDIT CARD , US SSN , LOCATION , and dozens more. start and end are character positions in the original text. This is how you know exactly which substring triggered the detection. It's also how the anonymizer knows what to replace. score is a confidence value between 0 and 1. A regex match on a credit card pattern returns 1.0 because the pattern is deterministic. A name detected by NER might return 0.85 because the model is making a probabilistic judgment. You can set a threshold to filter out low-confidence detections. The default is 0. analysis explanation is available in the detailed results and tells you which recognizer fired and why. Useful for debugging false positives. Presidio ships with recognizers for a wide range of entity types across multiple categories. Global entities work across languages : credit card numbers, crypto wallet addresses, email addresses, IBAN codes, IP addresses, phone numbers, URLs, domain names, dates. US-specific : Social Security numbers, bank account numbers, driver's license numbers, ITIN, passport numbers. UK-specific : NHS numbers. Other regions : Singapore financial numbers, Australian business numbers, and more through community recognizers. The full list is in the Presidio supported entities documentation https://microsoft.github.io/presidio/supported entities/ . If your entity type isn't covered, you can build custom recognizers. That's Part 3 of this series. Detection is only half the job. Let's anonymize the results. python from presidio analyzer import AnalyzerEngine from presidio anonymizer import AnonymizerEngine analyzer = AnalyzerEngine anonymizer = AnonymizerEngine text = "My name is John Smith and my email is john.smith@example.com" Detect PII results = analyzer.analyze text=text, language="en" Anonymize with default settings replaces with entity type labels anonymized = anonymizer.anonymize text=text, analyzer results=results print anonymized.text Output: My name is <PERSON and my email is <EMAIL ADDRESS The default behavior replaces each detected entity with its type label wrapped in angle brackets. In Part 4 we'll dig into all the anonymization operators replace, redact, mask, hash, encrypt and when to use each one. For now, the point is that detection and anonymization are separate steps. You can detect without anonymizing, anonymize differently per entity type, or build a pipeline that does both in one shot. That's the foundation. Presidio installed, first detection running, and you understand what the output looks like. In Part 2, we'll go deeper on the analyzer: how the NER models, regex patterns, and context scoring work together, how to process different text types emails, support tickets, chat logs , batch processing with presidio-structured, and image redaction with presidio-image-redactor. This is Part 1 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.