Data privacy regulations such as GDPR, LGPD, and HIPAA demand that organizations know exactly where Personally Identifiable Information (PII) lives inside their databases. Yet in practice, most teams rely on manual inventories, tribal knowledge, or external scanning tools that require data to leave the database engine β a process that itself creates privacy and security risks.
This article presents an MVP that takes a different approach: it runs PII detection inside InterSystems IRIS using Embedded Python, analyzing data where it lives and never exporting it to an external process. The result is a lightweight, non-intrusive utility that scans your tables, identifies PII using AI, and produces a structured CSV report β all without data ever leaving the IRIS process.
Organizations today face a painful blind spot. A typical IRIS instance may contain hundreds of tables across dozens of schemas, some holding decades of accumulated data. Columns named ContactInfo
, Notes
, or Description
might silently contain social security numbers, email addresses, or government IDs β sometimes intentionally, sometimes as a side effect of free-text fields that capture whatever users type in.
Traditional approaches to PII discovery share a common flaw: they require data extraction. You export samples, send them to an external service, or pipe them through a standalone tool. Every step in that pipeline is an additional attack surface and a potential compliance violation.
The principle of data sovereignty β keeping data within its jurisdiction and under controlled access β suggests a better path: bring the analysis to the data, not the data to the analysis.
This is not just a technical preference; it is a governance requirement:
Running the scan inside the database engine eliminates the transmission step entirely, simplifying compliance and reducing risk.
The utility follows a simple but deliberate separation of concerns. Three independent components cooperate in a pipeline:
PIIScanner β PIIIdentifier β PIIReporter
(database) (AI detection) (reporting)
PIIIdentifier β Wraps the AI detection library. It has zero knowledge of IRIS, SQL, or database schemas. Its single method, identify(text)
, takes a string and returns the highest-confidence PII entity type (e.g., "EMAIL_ADDRESS"
, "PERSON"
, "CPF"
) or None
. This isolation means the detection logic can be tested, swapped, or upgraded without touching the database layer.
PIIScanner β The only component that interacts with IRIS. It queries INFORMATION_SCHEMA.TABLES
to discover user tables, samples up to N rows per table via SELECT TOP N *
, feeds each column's values to the identifier, and collects findings. It respects schema exclusion patterns (exact match and wildcard prefix like "Ens*"
) and lets the caller configure the sample size.
PIIReporter β Deduplicates findings and writes a CSV with five columns: schema_name, table_name, column_name, pii_type, confidence
. The confidence score (0.0β1.0) helps reviewers prioritize findings and identify likely false positives.
This separation is not accidental. It means the identifier could be replaced with a more powerful model tomorrow without changing a single line of scanner or reporter code.
The PIIIdentifier is powered by Microsoft Presidio, an open-source data protection and de-identification framework. Presidio is the current detection engine, but the architecture is deliberately engine-agnostic β the PIIIdentifier
wrapper fully isolates the detection library from the scanner and reporter. Swapping to a different detection approach would only require changes to that one module, leaving the rest of the pipeline untouched. Presidio's analyzer combines two detection strategies:
The utility configures Presidio with two spaCy models:
en_core_web_sm
β English small model (~12 MB)pt_core_news_sm
β Portuguese small model (~13 MB)Each row of data is analyzed against both languages, and the highest-confidence result wins. Multi-language support is essential for this kind of tool to be useful for users around the world β databases rarely contain data in a single language, and PII detection that only understands English would miss critical findings in Portuguese, Spanish, German, or any other language. The current MVP supports English and Portuguese as a starting point, but the architecture makes it straightforward to add more spaCy models for additional languages.
For every text input, the identify()
method iterates through both language analyzers, collects all results, and returns the entity type with the highest confidence score:
def identify(self, text):
best_entity = None
best_score = 0.0
for lang in self.languages:
results = self._analyzer.analyze(text=text, language=lang)
for result in results:
if result.score > best_score:
best_score = result.score
best_entity = result.entity_type
return best_entity
This design means a Brazilian CPF mentioned in an English sentence will still be caught by the PT analyzer's pattern recognizer, even though the surrounding text is English.
The entire utility runs as a Python module inside the IRIS process via irispython
. No external API calls, no data exports, no network transfers. The scanner uses iris.sql.exec()
β IRIS's native Python SQL interface β to query metadata and sample data directly within the engine.
irispython -m irisapp.pii_discovery
A single command starts the scan. The output is a CSV file written to the mounted volume, immediately available on the host machine.
The utility also integrates with IRIS's built-in Task Scheduler. A %SYS.Task.Definition
subclass (PIIScannerTask
) exposes configurable OutputPath
and SampleSize
properties in the Admin Portal, and its OnTask()
method invokes the Python module via %SYS.Python.Import()
. The task is registered automatically during Docker build and can be scheduled to run periodically β for instance, a weekly PII inventory scan that appends results to a central compliance report.
docker compose exec iris irispython -m irisapp.pii_discovery
docker compose exec iris irispython -m irisapp.pii_discovery -n USER -s 50
docker compose exec iris irispython -m irisapp.pii_discovery --populate
To make the utility immediately testable, the project includes a sample database in the PIISample
schema with three tables that cover the main PII patterns:
PIISample.Patients β Structured single-field PII. Each column holds one type of personal data: full names, email addresses, phone numbers, SSNs/CPFs, and street addresses. The table deliberately mixes US and Brazilian records to exercise both NLP models. Non-PII columns (Diagnosis, AdmissionDate) serve as internal controls.
PIISample.CustomerFeedback β Free-text PII. Narrative paragraphs contain PII embedded in natural language β the hardest detection pattern. Examples include "My SSN is 111-22-3333 for insurance verification" and "Meu CPF Γ© 345.678.901-22". Two rows contain no PII at all, acting as negative controls within the table.
PIISample.Products β No PII. A control table with product names, categories, prices, and stock quantities. Ideally the scanner should produce zero findings here β in practice, the small NLP model produces false positives, which we will examine in the results section.
The sample data is populated by a Python function (populate()
) that runs during Docker build and can be re-invoked at any time. It uses DROP TABLE IF EXISTS
before each CREATE TABLE
, making it idempotent and safe to call repeatedly.
Running the scanner against the sample database produces something like the following report:
schema_name,table_name,column_name,pii_type,confidence
PIISample,CustomerFeedback,CustomerName,PERSON,0.85
PIISample,CustomerFeedback,FeedbackText,EMAIL_ADDRESS,1.0
PIISample,CustomerFeedback,CreatedAt,DATE_TIME,0.85
PIISample,Patients,FullName,PERSON,0.85
PIISample,Patients,Email,EMAIL_ADDRESS,1.0
PIISample,Patients,Phone,PHONE_NUMBER,0.4
PIISample,Patients,SSN,PHONE_NUMBER,0.4
PIISample,Patients,DateOfBirth,DATE_TIME,0.85
PIISample,Patients,Address,LOCATION,0.85
PIISample,Patients,Diagnosis,LOCATION,0.85
PIISample,Patients,AdmissionDate,DATE_TIME,0.85
PIISample,Products,ProductName,PERSON,0.85
PIISample,Products,Category,LOCATION,0.85
The true positives are clear: names detected as PERSON, emails as EMAIL_ADDRESS, phone numbers as PHONE_NUMBER, addresses as LOCATION. Confidence scores help reviewers prioritize β well-structured PII like emails consistently scores 0.85, while borderline cases like false positives on the Products table score below 0.5.
But the results also reveal the limitations of the current approach β and they are not limited to edge cases:
Products β not a clean pass. The Products table was designed as a no-PII control, containing only product names, categories, prices, and stock quantities. Yet the scanner reports PERSON
in ProductName and LOCATION
in Category. Product names like "Wireless Mouse" and categories like "Sports" are misidentified by the NLP model because the small spaCy model lacks the contextual understanding to distinguish generic nouns from personal names or place names. This is the most striking false positive in the results: a table with zero PII produces two findings, demonstrating exactly where the small model trade-off hurts.
Diagnosis flagged as LOCATION. Medical diagnoses like "Hypertension" and "Diabetes Type 2" are misclassified as LOCATION. This is another NLP false positive β the small model confuses medical terminology with geographic references.
SSN detected as PHONE_NUMBER. The Patients.SSN column contains values like 123-45-6789
(US SSN) and 123.456.789-00
(Brazilian CPF). Presidio has dedicated recognizers for both US_SSN
and CPF
, but the small spaCy models sometimes assign a higher confidence score to the PHONE_NUMBER recognizer for these digit-heavy patterns. The scanner reports the highest-scoring entity β which in this case is the wrong one.
Date columns flagged as DATE_TIME. Values like 1985-03-15
trigger the DATE_TIME recognizer. Whether dates of birth and admission dates constitute PII is context-dependent: under HIPAA they are, under some interpretations of GDPR they might not be (on their own). The scanner makes no policy judgment β it reports what it finds.
One PII type per column. The scanner's scan_column()
method returns the first PII type found in a column. If a column contains both email addresses and phone numbers (as FeedbackText does), only the first type detected gets reported. This is by design for the MVP β a full inventory might list all detected types per column.
The false positives and misclassifications stem from a deliberate architectural choice: using spaCy's small models (_sm
suffix) rather than medium (_md
) or large (_lg
) variants.
| Variant | Size (EN) | Accuracy | Memory | Load Time |
|---|---|---|---|---|
en_core_web_sm |
||||
| ~12 MB | Lower | ~100 MB | Fast | |
en_core_web_md |
||||
| ~40 MB | Higher | ~300 MB | Moderate | |
en_core_web_lg |
||||
| ~560 MB | Highest | ~1 GB | Slow |
The small models were chosen for the MVP because they keep the Docker image lean, startup fast, and run comfortably within the memory constraints of a containerized IRIS instance. For a proof-of-concept that needs to demonstrate feasibility, this is the right trade-off.
But the trade-off is real. Small models have less training data, fewer word vectors, and coarser entity boundaries. In practice, this means:
PERSON
in ProductName and LOCATION
in Category). Common nouns like "Wireless Mouse" or "Sports" are misidentified because the small model lacks the word vectors to distinguish them from personal names or place names. Similarly, medical diagnoses like "Hypertension" are misclassified as LOCATION.Upgrading to medium or large models would improve accuracy significantly, but at a cost:
An alternative path is replacing spaCy with transformer-based models (e.g., HuggingFace BERT or RoBERTa fine-tuned for NER), which offer state-of-the-art accuracy. Presidio supports this via its NlpEngineProvider
β you can configure a Transformers-backed engine instead of spaCy. But transformer models carry even heavier resource requirements: GPU inference for acceptable latency, multiple gigabytes of memory, and significantly longer processing times per text.
The architecture of this MVP β with the PIIIdentifier fully isolated from the scanner β makes this upgrade path straightforward. Swap the NLP engine configuration, and the rest of the pipeline continues to work unchanged.
SELECT TOP N
) rather than full table scans. Configurable sample size and schema exclusions let you control scope and impact.notes
column that is known to contain PII might be intentionally excluded from the report to avoid noise.SELECT TOP 100
sample will miss it. Random sampling (e.g., TABLESAMPLE
) would be more robust but is not yet implemented.The project runs on InterSystems IRIS Community Edition in Docker. Clone the repository, build the image, and start the container:
docker compose build
docker compose up -d
The sample database is populated automatically during the build. To run your first scan:
docker compose exec iris irispython -m irisapp.pii_discovery
The report will be written to pii_report.csv
in the project root. Open it, review the findings, and compare them against the sample data to understand what the scanner catches β and what it doesn't.
You can check the sample database here, then choosing the PIISample
schema. Use default IRIS Community Version credentials (_system/SYS).
From there, try the --populate
flag to reset the sample data, change the sample size with -s
, or point the scanner at a different namespace with -n
. The --populate
flag is particularly useful: it resets the sample tables and runs the scan in one step, making iteration fast.
This is an MVP β a proof of concept that demonstrates the compute-to-data approach for PII discovery inside InterSystems IRIS. The small NLP models are a starting point, not a ceiling. The architecture is built to grow.
This article was developed with the assistance of Artificial Intelligence tools for drafting and language refinement. All technical validation and final review were performed by the author.