# What Is Microsoft Presidio and Why You Need It (Setup + First Detection)

> Source: <https://dev.to/bspann/what-is-microsoft-presidio-and-why-you-need-it-setup-first-detection-6mh>
> Published: 2026-06-05 12:24:35+00:00

If you're building anything that touches user data and sends it to an LLM, you have a PII problem. Names, emails, phone numbers, credit card numbers, social security numbers sitting in support tickets, chat logs, documents, and database fields. Every time you pipe that data into a prompt, you're sending someone's personal information to a third-party model endpoint. Maybe that's fine for your use case. Maybe it's not. Either way, you should know what's in your data before you make that call.

Microsoft Presidio is an open-source framework that detects and anonymizes PII in text, images, and structured data. It's been around since 2019, it's actively maintained, and it's what I reach for when I need to scrub data before it hits an LLM. This series walks through the entire framework from installation to production deployment. No toy examples. Real workloads.

Presidio has two core modules that handle the detection and anonymization pipeline separately.

The **Analyzer** finds PII. It combines named entity recognition (NER) from spaCy or Hugging Face transformers with regex pattern matching and contextual scoring. When you feed it text, it returns a list of detected entities with types, confidence scores, and character positions. It doesn't modify the text. It just tells you what it found.

The **Anonymizer** takes the analyzer's output and does something with it. Replace detected names with `<PERSON>`

. Redact phone numbers entirely. Mask credit card numbers with asterisks. Hash emails. Encrypt values you need to reverse later. The anonymizer is where you decide how to handle each entity type.

Beyond those two, Presidio has additional modules for specific use cases. **presidio-image-redactor** handles OCR on images and redacts PII from screenshots and scanned documents. **presidio-structured** processes tabular data in DataFrames and JSON. We'll get to those in later parts of this series.

You have two paths: Python packages via pip or Docker containers. I'll cover both because you'll want pip for development and experimentation, and Docker for anything that needs to serve an API.

Set up a virtual environment first. Presidio pulls in spaCy and NLP models that you don't want colliding with other projects.

```
# Create and activate a virtual environment
python -m venv presidio-env
source presidio-env/bin/activate  # Linux/Mac
# presidio-env\Scripts\activate   # Windows

# Install the core packages
pip install presidio-analyzer presidio-anonymizer

# Download a spaCy language model (the large model is more accurate)
python -m spacy download en_core_web_lg
```

The `en_core_web_lg`

model is about 560MB. If you're tight on space or just experimenting, `en_core_web_sm`

works but you'll see lower accuracy on name and location detection. For anything beyond a quick test, use the large model.

Presidio publishes official images to Microsoft Container Registry. Each module runs as its own REST API.

```
# Pull the images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer

# Run the analyzer on port 5001
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-analyzer:latest

# Run the anonymizer on port 5002
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-anonymizer:latest
```

Both containers expose REST APIs on port 3000 internally. Map them to whatever ports you want on the host. Once they're running, you can hit them with curl or any HTTP client.

To verify they're up:

```
curl http://localhost:5001/health
curl http://localhost:5002/health
```

Let's feed the analyzer some text and see what comes back. I'll show both the Python API and the REST API so you can pick whichever fits your workflow.

``` python
from presidio_analyzer import AnalyzerEngine

# Initialize the analyzer
analyzer = AnalyzerEngine()

# Sample text with multiple PII types
text = """
Hi, my name is John Smith and I live in Seattle. 
My email is john.smith@example.com and my phone 
number is 206-555-0147. My SSN is 123-45-6789 
and my credit card is 4111-1111-1111-1111.
"""

# Analyze the text
results = analyzer.analyze(text=text, language="en")

# Print what we found
for result in results:
    print(f"{result.entity_type}: '{text[result.start:result.end].strip()}' "
          f"(score: {result.score:.2f}, position: {result.start}-{result.end})")
```

Output:

```
PERSON: 'John Smith' (score: 0.85, position: 18-28)
LOCATION: 'Seattle' (score: 0.85, position: 42-49)
EMAIL_ADDRESS: 'john.smith@example.com' (score: 1.00, position: 64-86)
PHONE_NUMBER: '206-555-0147' (score: 0.75, position: 110-122)
US_SSN: '123-45-6789' (score: 0.85, position: 134-145)
CREDIT_CARD: '4111-1111-1111-1111' (score: 1.00, position: 169-188)
curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "My name is John Smith and my email is john.smith@example.com",
    "language": "en"
  }'
```

The response is a JSON array of detected entities with the same fields: entity type, start position, end position, and confidence score.

Every detection result contains five fields that matter:

**entity_type** is what Presidio thinks it found. `PERSON`

, `EMAIL_ADDRESS`

, `PHONE_NUMBER`

, `CREDIT_CARD`

, `US_SSN`

, `LOCATION`

, and dozens more.

**start** and **end** are character positions in the original text. This is how you know exactly which substring triggered the detection. It's also how the anonymizer knows what to replace.

**score** is a confidence value between 0 and 1. A regex match on a credit card pattern returns 1.0 because the pattern is deterministic. A name detected by NER might return 0.85 because the model is making a probabilistic judgment. You can set a threshold to filter out low-confidence detections. The default is 0.

**analysis_explanation** is available in the detailed results and tells you which recognizer fired and why. Useful for debugging false positives.

Presidio ships with recognizers for a wide range of entity types across multiple categories.

**Global entities** (work across languages): credit card numbers, crypto wallet addresses, email addresses, IBAN codes, IP addresses, phone numbers, URLs, domain names, dates.

**US-specific**: Social Security numbers, bank account numbers, driver's license numbers, ITIN, passport numbers.

**UK-specific**: NHS numbers.

**Other regions**: Singapore financial numbers, Australian business numbers, and more through community recognizers.

The full list is in the [Presidio supported entities documentation](https://microsoft.github.io/presidio/supported_entities/). If your entity type isn't covered, you can build custom recognizers. That's Part 3 of this series.

Detection is only half the job. Let's anonymize the results.

``` python
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "My name is John Smith and my email is john.smith@example.com"

# Detect PII
results = analyzer.analyze(text=text, language="en")

# Anonymize with default settings (replaces with entity type labels)
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized.text)
# Output: My name is <PERSON> and my email is <EMAIL_ADDRESS>
```

The default behavior replaces each detected entity with its type label wrapped in angle brackets. In Part 4 we'll dig into all the anonymization operators (replace, redact, mask, hash, encrypt) and when to use each one. For now, the point is that detection and anonymization are separate steps. You can detect without anonymizing, anonymize differently per entity type, or build a pipeline that does both in one shot.

That's the foundation. Presidio installed, first detection running, and you understand what the output looks like. In Part 2, we'll go deeper on the analyzer: how the NER models, regex patterns, and context scoring work together, how to process different text types (emails, support tickets, chat logs), batch processing with presidio-structured, and image redaction with presidio-image-redactor.

*This is Part 1 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.*
