{"slug": "what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection", "title": "What Is Microsoft Presidio and Why You Need It (Setup + First Detection)", "summary": "Microsoft Presidio, an open-source framework for detecting and anonymizing personally identifiable information (PII) in text, images, and structured data, offers two core modules—the Analyzer and the Anonymizer—that handle detection and anonymization separately. The framework can be installed via Python packages for development or Docker containers for production API deployment, with the Analyzer using named entity recognition and regex to identify PII without modifying text, while the Anonymizer replaces, redacts, masks, hashes, or encrypts detected entities.", "body_md": "If you're building anything that touches user data and sends it to an LLM, you have a PII problem. Names, emails, phone numbers, credit card numbers, social security numbers sitting in support tickets, chat logs, documents, and database fields. Every time you pipe that data into a prompt, you're sending someone's personal information to a third-party model endpoint. Maybe that's fine for your use case. Maybe it's not. Either way, you should know what's in your data before you make that call.\n\nMicrosoft Presidio is an open-source framework that detects and anonymizes PII in text, images, and structured data. It's been around since 2019, it's actively maintained, and it's what I reach for when I need to scrub data before it hits an LLM. This series walks through the entire framework from installation to production deployment. No toy examples. Real workloads.\n\nPresidio has two core modules that handle the detection and anonymization pipeline separately.\n\nThe **Analyzer** finds PII. It combines named entity recognition (NER) from spaCy or Hugging Face transformers with regex pattern matching and contextual scoring. When you feed it text, it returns a list of detected entities with types, confidence scores, and character positions. It doesn't modify the text. It just tells you what it found.\n\nThe **Anonymizer** takes the analyzer's output and does something with it. Replace detected names with `<PERSON>`\n\n. Redact phone numbers entirely. Mask credit card numbers with asterisks. Hash emails. Encrypt values you need to reverse later. The anonymizer is where you decide how to handle each entity type.\n\nBeyond those two, Presidio has additional modules for specific use cases. **presidio-image-redactor** handles OCR on images and redacts PII from screenshots and scanned documents. **presidio-structured** processes tabular data in DataFrames and JSON. We'll get to those in later parts of this series.\n\nYou have two paths: Python packages via pip or Docker containers. I'll cover both because you'll want pip for development and experimentation, and Docker for anything that needs to serve an API.\n\nSet up a virtual environment first. Presidio pulls in spaCy and NLP models that you don't want colliding with other projects.\n\n```\n# Create and activate a virtual environment\npython -m venv presidio-env\nsource presidio-env/bin/activate  # Linux/Mac\n# presidio-env\\Scripts\\activate   # Windows\n\n# Install the core packages\npip install presidio-analyzer presidio-anonymizer\n\n# Download a spaCy language model (the large model is more accurate)\npython -m spacy download en_core_web_lg\n```\n\nThe `en_core_web_lg`\n\nmodel is about 560MB. If you're tight on space or just experimenting, `en_core_web_sm`\n\nworks but you'll see lower accuracy on name and location detection. For anything beyond a quick test, use the large model.\n\nPresidio publishes official images to Microsoft Container Registry. Each module runs as its own REST API.\n\n```\n# Pull the images\ndocker pull mcr.microsoft.com/presidio-analyzer\ndocker pull mcr.microsoft.com/presidio-anonymizer\n\n# Run the analyzer on port 5001\ndocker run -d -p 5001:3000 mcr.microsoft.com/presidio-analyzer:latest\n\n# Run the anonymizer on port 5002\ndocker run -d -p 5002:3000 mcr.microsoft.com/presidio-anonymizer:latest\n```\n\nBoth containers expose REST APIs on port 3000 internally. Map them to whatever ports you want on the host. Once they're running, you can hit them with curl or any HTTP client.\n\nTo verify they're up:\n\n```\ncurl http://localhost:5001/health\ncurl http://localhost:5002/health\n```\n\nLet's feed the analyzer some text and see what comes back. I'll show both the Python API and the REST API so you can pick whichever fits your workflow.\n\n``` python\nfrom presidio_analyzer import AnalyzerEngine\n\n# Initialize the analyzer\nanalyzer = AnalyzerEngine()\n\n# Sample text with multiple PII types\ntext = \"\"\"\nHi, my name is John Smith and I live in Seattle. \nMy email is john.smith@example.com and my phone \nnumber is 206-555-0147. My SSN is 123-45-6789 \nand my credit card is 4111-1111-1111-1111.\n\"\"\"\n\n# Analyze the text\nresults = analyzer.analyze(text=text, language=\"en\")\n\n# Print what we found\nfor result in results:\n    print(f\"{result.entity_type}: '{text[result.start:result.end].strip()}' \"\n          f\"(score: {result.score:.2f}, position: {result.start}-{result.end})\")\n```\n\nOutput:\n\n```\nPERSON: 'John Smith' (score: 0.85, position: 18-28)\nLOCATION: 'Seattle' (score: 0.85, position: 42-49)\nEMAIL_ADDRESS: 'john.smith@example.com' (score: 1.00, position: 64-86)\nPHONE_NUMBER: '206-555-0147' (score: 0.75, position: 110-122)\nUS_SSN: '123-45-6789' (score: 0.85, position: 134-145)\nCREDIT_CARD: '4111-1111-1111-1111' (score: 1.00, position: 169-188)\ncurl -X POST http://localhost:5001/analyze \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"text\": \"My name is John Smith and my email is john.smith@example.com\",\n    \"language\": \"en\"\n  }'\n```\n\nThe response is a JSON array of detected entities with the same fields: entity type, start position, end position, and confidence score.\n\nEvery detection result contains five fields that matter:\n\n**entity_type** is what Presidio thinks it found. `PERSON`\n\n, `EMAIL_ADDRESS`\n\n, `PHONE_NUMBER`\n\n, `CREDIT_CARD`\n\n, `US_SSN`\n\n, `LOCATION`\n\n, and dozens more.\n\n**start** and **end** are character positions in the original text. This is how you know exactly which substring triggered the detection. It's also how the anonymizer knows what to replace.\n\n**score** is a confidence value between 0 and 1. A regex match on a credit card pattern returns 1.0 because the pattern is deterministic. A name detected by NER might return 0.85 because the model is making a probabilistic judgment. You can set a threshold to filter out low-confidence detections. The default is 0.\n\n**analysis_explanation** is available in the detailed results and tells you which recognizer fired and why. Useful for debugging false positives.\n\nPresidio ships with recognizers for a wide range of entity types across multiple categories.\n\n**Global entities** (work across languages): credit card numbers, crypto wallet addresses, email addresses, IBAN codes, IP addresses, phone numbers, URLs, domain names, dates.\n\n**US-specific**: Social Security numbers, bank account numbers, driver's license numbers, ITIN, passport numbers.\n\n**UK-specific**: NHS numbers.\n\n**Other regions**: Singapore financial numbers, Australian business numbers, and more through community recognizers.\n\nThe full list is in the [Presidio supported entities documentation](https://microsoft.github.io/presidio/supported_entities/). If your entity type isn't covered, you can build custom recognizers. That's Part 3 of this series.\n\nDetection is only half the job. Let's anonymize the results.\n\n``` python\nfrom presidio_analyzer import AnalyzerEngine\nfrom presidio_anonymizer import AnonymizerEngine\n\nanalyzer = AnalyzerEngine()\nanonymizer = AnonymizerEngine()\n\ntext = \"My name is John Smith and my email is john.smith@example.com\"\n\n# Detect PII\nresults = analyzer.analyze(text=text, language=\"en\")\n\n# Anonymize with default settings (replaces with entity type labels)\nanonymized = anonymizer.anonymize(text=text, analyzer_results=results)\n\nprint(anonymized.text)\n# Output: My name is <PERSON> and my email is <EMAIL_ADDRESS>\n```\n\nThe default behavior replaces each detected entity with its type label wrapped in angle brackets. In Part 4 we'll dig into all the anonymization operators (replace, redact, mask, hash, encrypt) and when to use each one. For now, the point is that detection and anonymization are separate steps. You can detect without anonymizing, anonymize differently per entity type, or build a pipeline that does both in one shot.\n\nThat's the foundation. Presidio installed, first detection running, and you understand what the output looks like. In Part 2, we'll go deeper on the analyzer: how the NER models, regex patterns, and context scoring work together, how to process different text types (emails, support tickets, chat logs), batch processing with presidio-structured, and image redaction with presidio-image-redactor.\n\n*This is Part 1 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.*", "url": "https://wpnews.pro/news/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection", "canonical_source": "https://dev.to/bspann/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection-6mh", "published_at": "2026-06-05 12:24:35+00:00", "updated_at": "2026-06-05 12:42:35.510283+00:00", "lang": "en", "topics": ["ai-tools", "ai-safety", "natural-language-processing", "large-language-models", "ai-products"], "entities": ["Microsoft Presidio", "spaCy", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection", "markdown": "https://wpnews.pro/news/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection.md", "text": "https://wpnews.pro/news/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection.txt", "jsonld": "https://wpnews.pro/news/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection.jsonld"}}