{"slug": "discovering-pii-inside-intersystems-iris", "title": "Discovering PII Inside InterSystems IRIS", "summary": "A developer built a lightweight PII detection utility that runs inside InterSystems IRIS using Embedded Python, eliminating the need to export data for analysis. The tool scans tables, identifies PII via Microsoft Presidio and spaCy models, and outputs a CSV report without data leaving the database engine. This approach addresses data sovereignty and compliance requirements under regulations like GDPR, LGPD, and HIPAA.", "body_md": "Data privacy regulations such as GDPR, LGPD, and HIPAA demand that organizations know exactly where Personally Identifiable Information (PII) lives inside their databases. Yet in practice, most teams rely on manual inventories, tribal knowledge, or external scanning tools that require data to leave the database engine — a process that itself creates privacy and security risks.\n\nThis article presents an MVP that takes a different approach: it runs PII detection **inside** InterSystems IRIS using Embedded Python, analyzing data where it lives and never exporting it to an external process. The result is a lightweight, non-intrusive utility that scans your tables, identifies PII using AI, and produces a structured CSV report — all without data ever leaving the IRIS process.\n\nOrganizations today face a painful blind spot. A typical IRIS instance may contain hundreds of tables across dozens of schemas, some holding decades of accumulated data. Columns named `ContactInfo`\n\n, `Notes`\n\n, or `Description`\n\nmight silently contain social security numbers, email addresses, or government IDs — sometimes intentionally, sometimes as a side effect of free-text fields that capture whatever users type in.\n\nTraditional approaches to PII discovery share a common flaw: they require data extraction. You export samples, send them to an external service, or pipe them through a standalone tool. Every step in that pipeline is an additional attack surface and a potential compliance violation.\n\nThe principle of **data sovereignty** — keeping data within its jurisdiction and under controlled access — suggests a better path: bring the analysis to the data, not the data to the analysis.\n\nThis is not just a technical preference; it is a governance requirement:\n\nRunning the scan inside the database engine eliminates the transmission step entirely, simplifying compliance and reducing risk.\n\nThe utility follows a simple but deliberate separation of concerns. Three independent components cooperate in a pipeline:\n\n```\nPIIScanner  →  PIIIdentifier  →  PIIReporter\n(database)     (AI detection)     (reporting)\n```\n\n**PIIIdentifier** — Wraps the AI detection library. It has zero knowledge of IRIS, SQL, or database schemas. Its single method, `identify(text)`\n\n, takes a string and returns the highest-confidence PII entity type (e.g., `\"EMAIL_ADDRESS\"`\n\n, `\"PERSON\"`\n\n, `\"CPF\"`\n\n) or `None`\n\n. This isolation means the detection logic can be tested, swapped, or upgraded without touching the database layer.\n\n**PIIScanner** — The only component that interacts with IRIS. It queries `INFORMATION_SCHEMA.TABLES`\n\nto discover user tables, samples up to N rows per table via `SELECT TOP N *`\n\n, feeds each column's values to the identifier, and collects findings. It respects schema exclusion patterns (exact match and wildcard prefix like `\"Ens*\"`\n\n) and lets the caller configure the sample size.\n\n**PIIReporter** — Deduplicates findings and writes a CSV with five columns: `schema_name, table_name, column_name, pii_type, confidence`\n\n. The confidence score (0.0–1.0) helps reviewers prioritize findings and identify likely false positives.\n\nThis separation is not accidental. It means the identifier could be replaced with a more powerful model tomorrow without changing a single line of scanner or reporter code.\n\nThe PIIIdentifier is powered by [Microsoft Presidio](https://microsoft.github.io/presidio/), an open-source data protection and de-identification framework. Presidio is the current detection engine, but the architecture is deliberately engine-agnostic — the `PIIIdentifier`\n\nwrapper fully isolates the detection library from the scanner and reporter. Swapping to a different detection approach would only require changes to that one module, leaving the rest of the pipeline untouched. Presidio's analyzer combines two detection strategies:\n\nThe utility configures Presidio with two spaCy models:\n\n`en_core_web_sm`\n\n— English small model (~12 MB)`pt_core_news_sm`\n\n— Portuguese small model (~13 MB)Each row of data is analyzed against both languages, and the highest-confidence result wins. Multi-language support is essential for this kind of tool to be useful for users around the world — databases rarely contain data in a single language, and PII detection that only understands English would miss critical findings in Portuguese, Spanish, German, or any other language. The current MVP supports English and Portuguese as a starting point, but the architecture makes it straightforward to add more spaCy models for additional languages.\n\nFor every text input, the `identify()`\n\nmethod iterates through both language analyzers, collects all results, and returns the entity type with the highest confidence score:\n\n``` python\ndef identify(self, text):\n    best_entity = None\n    best_score = 0.0\n    for lang in self.languages:\n        results = self._analyzer.analyze(text=text, language=lang)\n        for result in results:\n            if result.score > best_score:\n                best_score = result.score\n                best_entity = result.entity_type\n    return best_entity\n```\n\nThis design means a Brazilian CPF mentioned in an English sentence will still be caught by the PT analyzer's pattern recognizer, even though the surrounding text is English.\n\nThe entire utility runs as a Python module inside the IRIS process via `irispython`\n\n. No external API calls, no data exports, no network transfers. The scanner uses `iris.sql.exec()`\n\n— IRIS's native Python SQL interface — to query metadata and sample data directly within the engine.\n\n```\nirispython -m irisapp.pii_discovery\n```\n\nA single command starts the scan. The output is a CSV file written to the mounted volume, immediately available on the host machine.\n\nThe utility also integrates with IRIS's built-in Task Scheduler. A `%SYS.Task.Definition`\n\nsubclass (`PIIScannerTask`\n\n) exposes configurable `OutputPath`\n\nand `SampleSize`\n\nproperties in the Admin Portal, and its `OnTask()`\n\nmethod invokes the Python module via `%SYS.Python.Import()`\n\n. The task is registered automatically during Docker build and can be scheduled to run periodically — for instance, a weekly PII inventory scan that appends results to a central compliance report.\n\n```\n# One-shot scan from the command line\ndocker compose exec iris irispython -m irisapp.pii_discovery\n\n# Scan with custom namespace and sample size\ndocker compose exec iris irispython -m irisapp.pii_discovery -n USER -s 50\n\n# Populate sample data + scan in one command\ndocker compose exec iris irispython -m irisapp.pii_discovery --populate\n```\n\nTo make the utility immediately testable, the project includes a sample database in the `PIISample`\n\nschema with three tables that cover the main PII patterns:\n\n**PIISample.Patients** — Structured single-field PII. Each column holds one type of personal data: full names, email addresses, phone numbers, SSNs/CPFs, and street addresses. The table deliberately mixes US and Brazilian records to exercise both NLP models. Non-PII columns (Diagnosis, AdmissionDate) serve as internal controls.\n\n**PIISample.CustomerFeedback** — Free-text PII. Narrative paragraphs contain PII embedded in natural language — the hardest detection pattern. Examples include *\"My SSN is 111-22-3333 for insurance verification\"* and *\"Meu CPF é 345.678.901-22\"*. Two rows contain no PII at all, acting as negative controls within the table.\n\n**PIISample.Products** — No PII. A control table with product names, categories, prices, and stock quantities. Ideally the scanner should produce zero findings here — in practice, the small NLP model produces false positives, which we will examine in the results section.\n\nThe sample data is populated by a Python function (`populate()`\n\n) that runs during Docker build and can be re-invoked at any time. It uses `DROP TABLE IF EXISTS`\n\nbefore each `CREATE TABLE`\n\n, making it idempotent and safe to call repeatedly.\n\nRunning the scanner against the sample database produces something like the following report:\n\n```\nschema_name,table_name,column_name,pii_type,confidence\nPIISample,CustomerFeedback,CustomerName,PERSON,0.85\nPIISample,CustomerFeedback,FeedbackText,EMAIL_ADDRESS,1.0\nPIISample,CustomerFeedback,CreatedAt,DATE_TIME,0.85\nPIISample,Patients,FullName,PERSON,0.85\nPIISample,Patients,Email,EMAIL_ADDRESS,1.0\nPIISample,Patients,Phone,PHONE_NUMBER,0.4\nPIISample,Patients,SSN,PHONE_NUMBER,0.4\nPIISample,Patients,DateOfBirth,DATE_TIME,0.85\nPIISample,Patients,Address,LOCATION,0.85\nPIISample,Patients,Diagnosis,LOCATION,0.85\nPIISample,Patients,AdmissionDate,DATE_TIME,0.85\nPIISample,Products,ProductName,PERSON,0.85\nPIISample,Products,Category,LOCATION,0.85\n```\n\nThe true positives are clear: names detected as PERSON, emails as EMAIL_ADDRESS, phone numbers as PHONE_NUMBER, addresses as LOCATION. Confidence scores help reviewers prioritize — well-structured PII like emails consistently scores 0.85, while borderline cases like false positives on the Products table score below 0.5.\n\nBut the results also reveal the limitations of the current approach — and they are not limited to edge cases:\n\n**Products — not a clean pass.** The Products table was designed as a no-PII control, containing only product names, categories, prices, and stock quantities. Yet the scanner reports `PERSON`\n\nin ProductName and `LOCATION`\n\nin Category. Product names like \"Wireless Mouse\" and categories like \"Sports\" are misidentified by the NLP model because the small spaCy model lacks the contextual understanding to distinguish generic nouns from personal names or place names. This is the most striking false positive in the results: a table with zero PII produces two findings, demonstrating exactly where the small model trade-off hurts.\n\n**Diagnosis flagged as LOCATION.** Medical diagnoses like \"Hypertension\" and \"Diabetes Type 2\" are misclassified as LOCATION. This is another NLP false positive — the small model confuses medical terminology with geographic references.\n\n**SSN detected as PHONE_NUMBER.** The Patients.SSN column contains values like `123-45-6789`\n\n(US SSN) and `123.456.789-00`\n\n(Brazilian CPF). Presidio has dedicated recognizers for both `US_SSN`\n\nand `CPF`\n\n, but the small spaCy models sometimes assign a higher confidence score to the PHONE_NUMBER recognizer for these digit-heavy patterns. The scanner reports the highest-scoring entity — which in this case is the wrong one.\n\n**Date columns flagged as DATE_TIME.** Values like `1985-03-15`\n\ntrigger the DATE_TIME recognizer. Whether dates of birth and admission dates constitute PII is context-dependent: under HIPAA they are, under some interpretations of GDPR they might not be (on their own). The scanner makes no policy judgment — it reports what it finds.\n\n**One PII type per column.** The scanner's `scan_column()`\n\nmethod returns the first PII type found in a column. If a column contains both email addresses and phone numbers (as FeedbackText does), only the first type detected gets reported. This is by design for the MVP — a full inventory might list all detected types per column.\n\nThe false positives and misclassifications stem from a deliberate architectural choice: using spaCy's **small** models (`_sm`\n\nsuffix) rather than medium (`_md`\n\n) or large (`_lg`\n\n) variants.\n\n| Variant | Size (EN) | Accuracy | Memory | Load Time |\n|---|---|---|---|---|\n`en_core_web_sm` |\n~12 MB | Lower | ~100 MB | Fast |\n`en_core_web_md` |\n~40 MB | Higher | ~300 MB | Moderate |\n`en_core_web_lg` |\n~560 MB | Highest | ~1 GB | Slow |\n\nThe small models were chosen for the MVP because they keep the Docker image lean, startup fast, and run comfortably within the memory constraints of a containerized IRIS instance. For a proof-of-concept that needs to demonstrate feasibility, this is the right trade-off.\n\nBut the trade-off is real. Small models have less training data, fewer word vectors, and coarser entity boundaries. In practice, this means:\n\n`PERSON`\n\nin ProductName and `LOCATION`\n\nin Category). Common nouns like \"Wireless Mouse\" or \"Sports\" are misidentified because the small model lacks the word vectors to distinguish them from personal names or place names. Similarly, medical diagnoses like \"Hypertension\" are misclassified as LOCATION.Upgrading to medium or large models would improve accuracy significantly, but at a cost:\n\nAn alternative path is replacing spaCy with transformer-based models (e.g., HuggingFace BERT or RoBERTa fine-tuned for NER), which offer state-of-the-art accuracy. Presidio supports this via its `NlpEngineProvider`\n\n— you can configure a Transformers-backed engine instead of spaCy. But transformer models carry even heavier resource requirements: GPU inference for acceptable latency, multiple gigabytes of memory, and significantly longer processing times per text.\n\nThe architecture of this MVP — with the PIIIdentifier fully isolated from the scanner — makes this upgrade path straightforward. Swap the NLP engine configuration, and the rest of the pipeline continues to work unchanged.\n\n`SELECT TOP N`\n\n) rather than full table scans. Configurable sample size and schema exclusions let you control scope and impact.`notes`\n\ncolumn that is known to contain PII might be intentionally excluded from the report to avoid noise.`SELECT TOP 100`\n\nsample will miss it. Random sampling (e.g., `TABLESAMPLE`\n\n) would be more robust but is not yet implemented.The project runs on InterSystems IRIS Community Edition in Docker. Clone the repository, build the image, and start the container:\n\n```\ndocker compose build\ndocker compose up -d\n```\n\nThe sample database is populated automatically during the build. To run your first scan:\n\n```\ndocker compose exec iris irispython -m irisapp.pii_discovery\n```\n\nThe report will be written to `pii_report.csv`\n\nin the project root. Open it, review the findings, and compare them against the sample data to understand what the scanner catches — and what it doesn't.\n\nYou can check the sample database [here](http://localhost:55038/csp/sys/exp/%25CSP.UI.Portal.SQL.Home.zen?$NAMESPACE=IRISAPP), then choosing the `PIISample`\n\nschema. Use default IRIS Community Version credentials (_system/SYS).\n\nFrom there, try the `--populate`\n\nflag to reset the sample data, change the sample size with `-s`\n\n, or point the scanner at a different namespace with `-n`\n\n. The `--populate`\n\nflag is particularly useful: it resets the sample tables and runs the scan in one step, making iteration fast.\n\n*This is an MVP — a proof of concept that demonstrates the compute-to-data approach for PII discovery inside InterSystems IRIS. The small NLP models are a starting point, not a ceiling. The architecture is built to grow.*\n\n*This article was developed with the assistance of Artificial Intelligence tools for drafting and language refinement. All technical validation and final review were performed by the author.*", "url": "https://wpnews.pro/news/discovering-pii-inside-intersystems-iris", "canonical_source": "https://dev.to/intersystems/discovering-pii-inside-intersystems-iris-1i2l", "published_at": "2026-06-16 15:34:39+00:00", "updated_at": "2026-06-16 15:47:22.919791+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "developer-tools", "ai-safety", "ai-policy"], "entities": ["InterSystems IRIS", "Microsoft Presidio", "spaCy", "GDPR", "LGPD", "HIPAA", "Embedded Python"], "alternates": {"html": "https://wpnews.pro/news/discovering-pii-inside-intersystems-iris", "markdown": "https://wpnews.pro/news/discovering-pii-inside-intersystems-iris.md", "text": "https://wpnews.pro/news/discovering-pii-inside-intersystems-iris.txt", "jsonld": "https://wpnews.pro/news/discovering-pii-inside-intersystems-iris.jsonld"}}