Discovering PII Inside InterSystems IRIS

A developer built a lightweight PII detection utility that runs inside InterSystems IRIS using Embedded Python, eliminating the need to export data for analysis. The tool scans tables, identifies PII via Microsoft Presidio and spaCy models, and outputs a CSV report without data leaving the database engine. This approach addresses data sovereignty and compliance requirements under regulations like GDPR, LGPD, and HIPAA.

Data privacy regulations such as GDPR, LGPD, and HIPAA demand that organizations know exactly where Personally Identifiable Information PII lives inside their databases. Yet in practice, most teams rely on manual inventories, tribal knowledge, or external scanning tools that require data to leave the database engine — a process that itself creates privacy and security risks. This article presents an MVP that takes a different approach: it runs PII detection inside InterSystems IRIS using Embedded Python, analyzing data where it lives and never exporting it to an external process. The result is a lightweight, non-intrusive utility that scans your tables, identifies PII using AI, and produces a structured CSV report — all without data ever leaving the IRIS process. Organizations today face a painful blind spot. A typical IRIS instance may contain hundreds of tables across dozens of schemas, some holding decades of accumulated data. Columns named ContactInfo , Notes , or Description might silently contain social security numbers, email addresses, or government IDs — sometimes intentionally, sometimes as a side effect of free-text fields that capture whatever users type in. Traditional approaches to PII discovery share a common flaw: they require data extraction. You export samples, send them to an external service, or pipe them through a standalone tool. Every step in that pipeline is an additional attack surface and a potential compliance violation. The principle of data sovereignty — keeping data within its jurisdiction and under controlled access — suggests a better path: bring the analysis to the data, not the data to the analysis. This is not just a technical preference; it is a governance requirement: Running the scan inside the database engine eliminates the transmission step entirely, simplifying compliance and reducing risk. The utility follows a simple but deliberate separation of concerns. Three independent components cooperate in a pipeline: PIIScanner → PIIIdentifier → PIIReporter database AI detection reporting PIIIdentifier — Wraps the AI detection library. It has zero knowledge of IRIS, SQL, or database schemas. Its single method, identify text , takes a string and returns the highest-confidence PII entity type e.g., "EMAIL ADDRESS" , "PERSON" , "CPF" or None . This isolation means the detection logic can be tested, swapped, or upgraded without touching the database layer. PIIScanner — The only component that interacts with IRIS. It queries INFORMATION SCHEMA.TABLES to discover user tables, samples up to N rows per table via SELECT TOP N , feeds each column's values to the identifier, and collects findings. It respects schema exclusion patterns exact match and wildcard prefix like "Ens " and lets the caller configure the sample size. PIIReporter — Deduplicates findings and writes a CSV with five columns: schema name, table name, column name, pii type, confidence . The confidence score 0.0–1.0 helps reviewers prioritize findings and identify likely false positives. This separation is not accidental. It means the identifier could be replaced with a more powerful model tomorrow without changing a single line of scanner or reporter code. The PIIIdentifier is powered by Microsoft Presidio https://microsoft.github.io/presidio/ , an open-source data protection and de-identification framework. Presidio is the current detection engine, but the architecture is deliberately engine-agnostic — the PIIIdentifier wrapper fully isolates the detection library from the scanner and reporter. Swapping to a different detection approach would only require changes to that one module, leaving the rest of the pipeline untouched. Presidio's analyzer combines two detection strategies: The utility configures Presidio with two spaCy models: en core web sm — English small model ~12 MB pt core news sm — Portuguese small model ~13 MB Each row of data is analyzed against both languages, and the highest-confidence result wins. Multi-language support is essential for this kind of tool to be useful for users around the world — databases rarely contain data in a single language, and PII detection that only understands English would miss critical findings in Portuguese, Spanish, German, or any other language. The current MVP supports English and Portuguese as a starting point, but the architecture makes it straightforward to add more spaCy models for additional languages. For every text input, the identify method iterates through both language analyzers, collects all results, and returns the entity type with the highest confidence score: python def identify self, text : best entity = None best score = 0.0 for lang in self.languages: results = self. analyzer.analyze text=text, language=lang for result in results: if result.score best score: best score = result.score best entity = result.entity type return best entity This design means a Brazilian CPF mentioned in an English sentence will still be caught by the PT analyzer's pattern recognizer, even though the surrounding text is English. The entire utility runs as a Python module inside the IRIS process via irispython . No external API calls, no data exports, no network transfers. The scanner uses iris.sql.exec — IRIS's native Python SQL interface — to query metadata and sample data directly within the engine. irispython -m irisapp.pii discovery A single command starts the scan. The output is a CSV file written to the mounted volume, immediately available on the host machine. The utility also integrates with IRIS's built-in Task Scheduler. A %SYS.Task.Definition subclass PIIScannerTask exposes configurable OutputPath and SampleSize properties in the Admin Portal, and its OnTask method invokes the Python module via %SYS.Python.Import . The task is registered automatically during Docker build and can be scheduled to run periodically — for instance, a weekly PII inventory scan that appends results to a central compliance report. One-shot scan from the command line docker compose exec iris irispython -m irisapp.pii discovery Scan with custom namespace and sample size docker compose exec iris irispython -m irisapp.pii discovery -n USER -s 50 Populate sample data + scan in one command docker compose exec iris irispython -m irisapp.pii discovery --populate To make the utility immediately testable, the project includes a sample database in the PIISample schema with three tables that cover the main PII patterns: PIISample.Patients — Structured single-field PII. Each column holds one type of personal data: full names, email addresses, phone numbers, SSNs/CPFs, and street addresses. The table deliberately mixes US and Brazilian records to exercise both NLP models. Non-PII columns Diagnosis, AdmissionDate serve as internal controls. PIISample.CustomerFeedback — Free-text PII. Narrative paragraphs contain PII embedded in natural language — the hardest detection pattern. Examples include "My SSN is 111-22-3333 for insurance verification" and "Meu CPF é 345.678.901-22" . Two rows contain no PII at all, acting as negative controls within the table. PIISample.Products — No PII. A control table with product names, categories, prices, and stock quantities. Ideally the scanner should produce zero findings here — in practice, the small NLP model produces false positives, which we will examine in the results section. The sample data is populated by a Python function populate that runs during Docker build and can be re-invoked at any time. It uses DROP TABLE IF EXISTS before each CREATE TABLE , making it idempotent and safe to call repeatedly. Running the scanner against the sample database produces something like the following report: schema name,table name,column name,pii type,confidence PIISample,CustomerFeedback,CustomerName,PERSON,0.85 PIISample,CustomerFeedback,FeedbackText,EMAIL ADDRESS,1.0 PIISample,CustomerFeedback,CreatedAt,DATE TIME,0.85 PIISample,Patients,FullName,PERSON,0.85 PIISample,Patients,Email,EMAIL ADDRESS,1.0 PIISample,Patients,Phone,PHONE NUMBER,0.4 PIISample,Patients,SSN,PHONE NUMBER,0.4 PIISample,Patients,DateOfBirth,DATE TIME,0.85 PIISample,Patients,Address,LOCATION,0.85 PIISample,Patients,Diagnosis,LOCATION,0.85 PIISample,Patients,AdmissionDate,DATE TIME,0.85 PIISample,Products,ProductName,PERSON,0.85 PIISample,Products,Category,LOCATION,0.85 The true positives are clear: names detected as PERSON, emails as EMAIL ADDRESS, phone numbers as PHONE NUMBER, addresses as LOCATION. Confidence scores help reviewers prioritize — well-structured PII like emails consistently scores 0.85, while borderline cases like false positives on the Products table score below 0.5. But the results also reveal the limitations of the current approach — and they are not limited to edge cases: Products — not a clean pass. The Products table was designed as a no-PII control, containing only product names, categories, prices, and stock quantities. Yet the scanner reports PERSON in ProductName and LOCATION in Category. Product names like "Wireless Mouse" and categories like "Sports" are misidentified by the NLP model because the small spaCy model lacks the contextual understanding to distinguish generic nouns from personal names or place names. This is the most striking false positive in the results: a table with zero PII produces two findings, demonstrating exactly where the small model trade-off hurts. Diagnosis flagged as LOCATION. Medical diagnoses like "Hypertension" and "Diabetes Type 2" are misclassified as LOCATION. This is another NLP false positive — the small model confuses medical terminology with geographic references. SSN detected as PHONE NUMBER. The Patients.SSN column contains values like 123-45-6789 US SSN and 123.456.789-00 Brazilian CPF . Presidio has dedicated recognizers for both US SSN and CPF , but the small spaCy models sometimes assign a higher confidence score to the PHONE NUMBER recognizer for these digit-heavy patterns. The scanner reports the highest-scoring entity — which in this case is the wrong one. Date columns flagged as DATE TIME. Values like 1985-03-15 trigger the DATE TIME recognizer. Whether dates of birth and admission dates constitute PII is context-dependent: under HIPAA they are, under some interpretations of GDPR they might not be on their own . The scanner makes no policy judgment — it reports what it finds. One PII type per column. The scanner's scan column method returns the first PII type found in a column. If a column contains both email addresses and phone numbers as FeedbackText does , only the first type detected gets reported. This is by design for the MVP — a full inventory might list all detected types per column. The false positives and misclassifications stem from a deliberate architectural choice: using spaCy's small models sm suffix rather than medium md or large lg variants. | Variant | Size EN | Accuracy | Memory | Load Time | |---|---|---|---|---| en core web sm | ~12 MB | Lower | ~100 MB | Fast | en core web md | ~40 MB | Higher | ~300 MB | Moderate | en core web lg | ~560 MB | Highest | ~1 GB | Slow | The small models were chosen for the MVP because they keep the Docker image lean, startup fast, and run comfortably within the memory constraints of a containerized IRIS instance. For a proof-of-concept that needs to demonstrate feasibility, this is the right trade-off. But the trade-off is real. Small models have less training data, fewer word vectors, and coarser entity boundaries. In practice, this means: PERSON in ProductName and LOCATION in Category . Common nouns like "Wireless Mouse" or "Sports" are misidentified because the small model lacks the word vectors to distinguish them from personal names or place names. Similarly, medical diagnoses like "Hypertension" are misclassified as LOCATION.Upgrading to medium or large models would improve accuracy significantly, but at a cost: An alternative path is replacing spaCy with transformer-based models e.g., HuggingFace BERT or RoBERTa fine-tuned for NER , which offer state-of-the-art accuracy. Presidio supports this via its NlpEngineProvider — you can configure a Transformers-backed engine instead of spaCy. But transformer models carry even heavier resource requirements: GPU inference for acceptable latency, multiple gigabytes of memory, and significantly longer processing times per text. The architecture of this MVP — with the PIIIdentifier fully isolated from the scanner — makes this upgrade path straightforward. Swap the NLP engine configuration, and the rest of the pipeline continues to work unchanged. SELECT TOP N rather than full table scans. Configurable sample size and schema exclusions let you control scope and impact. notes column that is known to contain PII might be intentionally excluded from the report to avoid noise. SELECT TOP 100 sample will miss it. Random sampling e.g., TABLESAMPLE would be more robust but is not yet implemented.The project runs on InterSystems IRIS Community Edition in Docker. Clone the repository, build the image, and start the container: docker compose build docker compose up -d The sample database is populated automatically during the build. To run your first scan: docker compose exec iris irispython -m irisapp.pii discovery The report will be written to pii report.csv in the project root. Open it, review the findings, and compare them against the sample data to understand what the scanner catches — and what it doesn't. You can check the sample database here http://localhost:55038/csp/sys/exp/%25CSP.UI.Portal.SQL.Home.zen?$NAMESPACE=IRISAPP , then choosing the PIISample schema. Use default IRIS Community Version credentials system/SYS . From there, try the --populate flag to reset the sample data, change the sample size with -s , or point the scanner at a different namespace with -n . The --populate flag is particularly useful: it resets the sample tables and runs the scan in one step, making iteration fast. This is an MVP — a proof of concept that demonstrates the compute-to-data approach for PII discovery inside InterSystems IRIS. The small NLP models are a starting point, not a ceiling. The architecture is built to grow. This article was developed with the assistance of Artificial Intelligence tools for drafting and language refinement. All technical validation and final review were performed by the author.