{"slug": "stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test", "title": "Stratagems #4: P Walked Into an AI Monitoring POC. P Didn't Run a Single Test.", "summary": "An independent evaluator, known only as P, was brought in to assess two AI monitoring platforms, MonitorAI and SentryWave, during a proof-of-concept at industrial IoT firm FirmCore. Instead of asking technical questions or running tests, P set up a read-only data pipeline to silently observe the vendors' performance over three months, refusing to share evaluation metrics in advance to avoid biasing the results.", "body_md": "Exhaust the enemy's strength without fighting. Weaken the strong by nurturing the soft.\n\n— The 36 Stratagems, \"[Wait at Leisure While the Enemy Labors]\"\n\nP flipped the business card over and wrote one letter on the back: **P**.\n\nThen P walked into the conference room.\n\nP didn't do opening lines.\n\nP doesn't have a name — not yet, not in this series anyway. But if you've read the earlier stories, you'd recognize the signature.\n\n** The first story** — P's own article got flagged as \"low quality\" by the company's AI moderation system. P dug into the internal API, pulled 347 flagged records — effective accuracy came out to 38%. More false positives than correct identifications.\n\n** The second story** — an AI payment gateway processing $2.8 billion. The CTO backed it with formal verification, claimed it was \"mathematically bulletproof.\" P spent eight months quietly building an adversarial testing pipeline, and proved the gateway would approve illegal transactions.\n\nP won both times. P left zero fingerprints both times.\n\nAfter those two jobs, P stopped working for other people.\n\nThis time, P got brought in as an independent evaluator.\n\nThe customer was a mid-sized industrial IoT firm called **FirmCore**. Their production-line gear had been running for almost a decade. The monitoring system was going down once a month, and management had finally had enough. They decided to bring in an AI monitoring platform.\n\nA good call — right up until they decided to run two vendors through POC at the same time and pick a winner.\n\n\"We want to see who can actually cover our failure modes,\" the VP said in the meeting. \"We've also brought in an independent evaluator.\"\n\nP was that evaluator.\n\nThe two AI monitoring companies were **MonitorAI** and **SentryWave**. MonitorAI's pre-sales team went first, slides blazing with **\"99.3% fault coverage, validated across 3 manufacturing customers.\"** SentryWave followed right behind: **\"99.7% coverage, 7-day deployment\"** — bigger numbers, bolder font.\n\nP sat in the corner, laptop open. P didn't write a single word.\n\nThe meeting wrapped. The client team gathered around P. \"So, what do you think?\"\n\n**\"Nothing worth saying right now. Give it three months.\"**\n\nP didn't ask either company a single technical question. Didn't pull their data, didn't check their architecture, didn't ask to see prior POC records.\n\n\"Three months?\" the VP pressed. \"The POC contract is three months. What if you can't deliver a conclusion by then?\"\n\nP didn't look up. **\"Three months from now, I'll give you data. Right now I can't — the data hasn't spoken yet.\"**\n\n\"At least give me a monthly update.\"\n\nP didn't say yes. Didn't say no either.\n\nThe VP was quiet for a few seconds, then nodded.\n\nP added one more thing: **\"I'll set up an independent data pipeline for the evaluation. I won't tell you in advance which metrics I'm tracking — if I do, you'll just stare at those numbers, and then I won't be measuring you anymore.\"**\n\nThe VP gave P another look, then nodded.\n\nP spent a week going through FirmCore's security review. Read-only replica access, config registry access, model metadata read access — every layer required a sign-off. The security team triple-checked \"read-only, no impact on production, no change requests\" and cleared it. Since it was just read replicas, fitting into an existing data audit framework, the approval followed a standard template and took exactly one week.\n\nEnd of week one, P had a **read-only replica pipeline** running alongside FirmCore's production environment. Connected to more than just the real-time event stream — it was also hooked into the config registry and model metadata. No writes, no config changes, no production impact. Through the config tables, P could see exactly which data sources each vendor had connected to, without waiting for every failure type to trigger naturally. FirmCore's own ops team was drowning in production-line alerts every day — nobody was going to go digging through vendor config registries. That wasn't their job, and they didn't have the time.\n\nMonitorAI's dashboard went live. Three numbers on the big screen: real-time fault detection rate **97.8%**, false positive rate **2.1%**, average response time **1.2 seconds**.\n\nTheir slide deck had said 99.3%. P noticed the gap — and said nothing.\n\nThe VP posted a screenshot on Slack: **\"Now this is AI monitoring.\"**\n\nSentryWave didn't flinch. Their dashboard showed **98.2%** detection rate, **1.8%** false positives, **0.9 seconds** average response — every number just a hair better than MonitorAI's.\n\nThe client project manager sent out a weekly progress email. Two tables, numbers polished to a high shine.\n\nP read those emails. Then P looked at the data in the read-only pipeline.\n\nBehind MonitorAI's **97.8%** coverage number: out of **61 known failure modes, only 29 were actually covered**. The remaining 32 — MonitorAI hadn't connected to the corresponding data sources at all. But the dashboard didn't say \"uncovered,\" because their AI model counted false positives from similar patterns as \"detected.\"\n\nSentryWave's **98.2%** was subtler. They'd connected to 34 data sources, but on 5 of them, the detection thresholds were set absurdly low — generating **200+ false positives per day**. Those false alerts got auto-sorted into \"low priority\" and never appeared on the VP's dashboard, while FirmCore's ops team quietly dealt with them in the background. P knew: if SentryWave didn't fix the root cause, those false positives would keep coming for the full three months. Another 5 sources had thresholds set conservatively. P judged that if SentryWave later raised those thresholds further to suppress the false positive rate, detection on those categories would almost certainly drop below 50%.\n\nBy the end of the first week, the pipeline data already told the whole story. P closed the laptop. The rest of the time was just waiting — three months in which P would tell no one that the answer was already known. P had gotten used to this long ago: If you know too early, no one believes you.\n\n**Not time to speak yet.**\n\nMonitorAI started bleeding.\n\nTheir model hit **data drift** in production. One of the production lines had sensor frequencies that didn't match the POC phase — MonitorAI's training data hadn't covered that frequency band, and the model started flagging normal fluctuations as faults.\n\nWeek one: false positive rate climbed from **2.1%** to **7.4%**.\n\nWeek two: **15.2%**.\n\nWeek three: FirmCore's ops team muted MonitorAI's alert notifications.\n\nMonitorAI's engineers pulled all-nighters retraining models, shipping patches. But every time they fixed one batch of false positives, a new batch surfaced — their training set simply didn't have data from that production line. Patches couldn't fix a fundamental coverage gap.\n\nSentryWave seized the moment. They fired off an email to the VP with the subject line: **\"Stable, Reliable AI Monitoring — SentryWave's 30-Day Zero False Positive Upgrade.\"**\n\nP saw the email. P's reaction was nothing. The read-only pipeline showed exactly how SentryWave got to zero false positives: by **raising detection thresholds**. Fault detection rate dropped from **98.2%** to **79.4%**, but no one noticed, because no one was checking for the faults that weren't being detected.\n\nP wrote a single line in the notebook:\n\n**\"79.4%. It's safe now. And useless.\"**\n\nThen P closed the notebook and went to get a coffee.\n\nBoth companies were struggling.\n\nMonitorAI's production-line data drift hadn't been solved. Their second model update pushed false positives back down to **6.8%**, but the cost was detection sensitivity — the detection rate fell from 97.8% to 83.1%, and genuinely new failure modes started slipping through. A patch nudged it back up to 86.4%, but new modes were still being missed. A bearing fault on one of the lines went unnoticed for 48 hours — until it triggered a cascading shutdown.\n\nThe outage lasted 4 hours. Estimated loss: roughly **$47,000**.\n\nSentryWave didn't have data drift — because they'd chosen the most conservative integration approach, connecting only to the most stable data sources. The problem was, those stable sources happened to not cover FirmCore's most critical production lines. And those 5 failure categories P had flagged in week one as likely to drop below 50% — in month three, their detection rates landed exactly between 34% and 48%, matching P's assessment from three months earlier to the decimal point.\n\nFinal week of month three, a SentryWave config change triggered an **alert storm**. 3 AM. **1,400+ alerts** in 20 minutes. 1,380 of them were false positives. FirmCore's on-call engineer got woken up seven times. PagerDuty went nuclear.\n\nThe next morning, SentryWave's explanation was \"a one-time event during config migration.\" But the ops lead said one thing in the postmortem, and it made its way to the VP:\n\n**\"I'd rather have no AI monitoring. At least I wouldn't be waking up at 3 AM to confirm a fake alert.\"**\n\nThe VP didn't say anything. He looked at the increasingly ugly numbers in the weekly progress emails, then turned his gaze toward the corner where P had been sitting, silent, the entire time.\n\n**\"Your evaluation. Where is it?\"**\n\nP pulled a USB drive out of the bag. Plugged it in. Opened a folder.\n\nInside: four tables. That was it.\n\n**Table One — MonitorAI's real coverage:**\n\n```\nClaimed coverage: 97.8% → 99.3%\nActual failure modes covered: 29/61\nTrue coverage: 47.5%\nThe 32 uncovered modes include: high-frequency sensor fluctuation, cross-line cascading failures, multi-sensor joint anomalies\n```\n\n**Table Two — SentryWave's real coverage:**\n\n```\nClaimed coverage: 98.2% → 99.7%\nData sources connected: 34/61 (5 with thresholds too low → effectively 0, another 5 with detection below 50% after threshold increases)\nEffectively covered: 24/61\nTrue coverage: **39.3%**\nReason for gaps: connected only to the most stable data sources, avoiding critical production lines; sacrificed detection sensitivity to suppress false positive rate\n```\n\n**Table Three — Three-month trend comparison:**\n\n```\nMonitorAI: 97.8% → data drift → 83.1% → patch → 86.4% → missed detections\nSentryWave: 98.2% → raised thresholds → 79.4% → alert storm → muted\n\nP's parallel data: 47.5% / 39.3% (true failure mode coverage — not the fluctuating detection rate on the dashboards. Static. Never changed from day one.)\n```\n\n**Table Four — Complete failure mode inventory:**\n\nAll 61 known failure modes, each annotated with which company covered it, coverage quality, and — **which ones neither company covered at all.**\n\n\"There are **12** failure modes that neither company covers,\" P said, voice flat. **\"Those 12 modes appeared in 9 of your P0 incidents over the last two years. So even if both POCs had been successful, you'd still crash in exactly the same places.\"**\n\nNobody in the room said anything.\n\nThe VP read through all four tables twice. Then he asked a question that, in three whole months, no one had thought to ask:\n\n**\"What's our real fault coverage? Do we have data on that?\"**\n\nP flipped to the last page of the notebook and slid it across the table.\n\nIt was a table P had drawn up three months earlier — **89 production incidents** from FirmCore's last two years, categorized by failure mode, annotated by severity, marked for AI monitoring coverage.\n\n**Coverage: 0/89.** Not because AI monitoring didn't exist — but because the 12 failure modes that actually took FirmCore down were exactly the 12 modes neither vendor covered. The other 49 covered modes? Hadn't triggered once in two years. Between the industry-standard fault taxonomy and the real failure distribution on the production floor, the gap was as wide as the factory floor itself. Those 12 high-frequency modes were specific to FirmCore's production environment — their sensor layout, cascading logic, ops workflows were completely different from any manufacturing customer the two vendors had ever touched.\n\n\"I could have given you these four tables three months ago,\" P said. **\"Would you have believed me then?\"**\n\nThe VP didn't answer. He didn't need to.\n\nMonitorAI and SentryWave's POCs were terminated the same week. Not because of P's report — but because both companies' own data had already proven they couldn't deliver.\n\nP was packing up to leave when the VP caught up.\n\n\"I've saved your report. One question — the real coverage numbers you cited. Where did you get the data? FirmCore never gave you production access.\"\n\nP paused.\n\n\"I never asked for production write access. **All I applied for was read-only replica access — no production impact, no change requests.**\"\n\n\"When did you set it up?\"\n\n\"First week. Security review took a week.\"\n\n\"First week? You didn't tell anyone?\"\n\nP looked at him the way you'd look at a student who just realized someone copied his homework.\n\n**\"I'm an independent evaluator. If I tell you in advance what I'm measuring, you'll just stare at those numbers. After three months, I wouldn't be getting your real performance — I'd be getting your exam scores.\"**\n\nThe VP's mouth opened, then closed.\n\n**\"The people who understand data are too busy writing slides, and the people who understand slides can't read data. From day one of this POC, no one thought to ask about coverage. I just sat on the answer no one wanted to hear for three months.\"**\n\nP tucked the USB drive into the bag and turned to leave. At the door, P looked back one last time:\n\n**\"Next time a vendor tells you they have 99% coverage — wait three months. See if their numbers still say that.\"**\n\nP walked toward the building's main entrance. Passing the front desk, P's eyes caught a stack of POC archive files — the kind of process documents nobody ever reads, sandwiched between signature pages and disclaimers. P wouldn't have noticed it at all, except the edge of a card was sticking out from the last page.\n\nP flipped to the back.\n\nA business card was tucked inside.\n\nNot MonitorAI. Not SentryWave.\n\n**ACL**\n\n**Automated Compliance Lab**\n\n*\"Compliance is not a cost — compliance is competitive advantage.\"*\n\nA name was handwritten on the back. No context. Like someone had placed it there deliberately.\n\nP looked at it. Pocketed the card.\n\nP didn't know what ACL was. But P knew one thing — **someone put that card there so P would find it.**\n\nP stepped outside. The sun was bright.\n\nP took a different route back to the station. Didn't pass \"The Third Cup\" café. But P still glanced in that direction.\n\n**This is \"Wait at leisure while the enemy labors\" — when everyone else is desperate to prove themselves, you do nothing. Let the numbers do the talking.**\n\n```\n[36 Stratagems Tactical Database v3.1] Loaded\n[Target Match] Wait at Leisure While the Enemy Labors\n[Analysis Mode] Full-situation scan\n━━━━━━━━━━━━━━━━━━━━\nTactical match: 94.2%\nProtagonist: P\nAction: During three months of POC between two AI monitoring platforms, P did not evaluate, interrupt, or question — just set up a read-only data pipeline to record real data in parallel. Let both platforms expose their own flaws, then produced conclusions P had already reached three months earlier.\nGoal: Evaluate the true failure-mode coverage of two AI monitoring platforms\nResult: Achieved — both POCs terminated, customer cancelled procurement process, P's report archived\n\nCounter-Detection:\n  - Strategic assumption risk: If MonitorAI had fixed data drift through online learning, or SentryWave had recalibrated thresholds without disclosing it — P's three months of silence would have failed. The \"wait at leisure\" strategy rests on the assumption that the opponent will make mistakes.\n  - Information asymmetry mirror: P hid evaluation metrics from the client, while SentryWave hid false positives in \"low priority\" queues — same strategy, different targets, identical mechanism.\n  - Read-only pipeline limitation: P's pipeline only observed without intervening. Could not actively inject test inputs to verify counterfactual scenarios after threshold adjustments. P's judgment that 5 SentryWave categories would drop below 50% after threshold increases was inference, not empirical proof.\n  - External monitoring signal: The ACL business card appearing in POC archive files suggests P may have been under observation during this evaluation. P noticed the card but did not investigate further. ACL's internal audit records show that in similar AI monitoring POC evaluations, cases with both data drift and threshold manipulation coexisting account for approximately 67% — closely matching this evaluation's dual-vendor failure pattern. But this is ACL's own data. P had no reason to know.\n\nCore Insights:\n  - P performed zero traditional \"evaluation\" actions in three months — P just waited.\n  - Neither company's failure was accidental — MonitorAI's data drift and SentryWave's threshold manipulation were structural product-design issues that P had already identified from the read-only pipeline by the end of week one.\n  - P chose not to speak because \"speaking would have changed nothing.\" The VP and the client, operating under information asymmetry, would not have trusted a negative conclusion delivered on day one.\n  - **The core of \"wait at leisure\" is not doing nothing — it's waiting for the opponent to expose their own weaknesses. And human nature, under competitive pressure, rushes to prove itself — which is the fastest way to expose weakness.**\n```\n\n*Next stratagem: Loot a Burning House*\n\n*P.S. English isn't my first language. I use AI to polish the writing and smooth out the rough edges. Thanks for reading. ☕ Buy me a coffee*", "url": "https://wpnews.pro/news/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test", "canonical_source": "https://dev.to/xulingfeng/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didnt-run-a-single-test-1ejk", "published_at": "2026-07-01 12:44:35+00:00", "updated_at": "2026-07-01 12:48:53.296214+00:00", "lang": "en", "topics": ["ai-products", "ai-tools", "ai-infrastructure", "ai-safety", "developer-tools"], "entities": ["FirmCore", "MonitorAI", "SentryWave", "P"], "alternates": {"html": "https://wpnews.pro/news/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test", "markdown": "https://wpnews.pro/news/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test.md", "text": "https://wpnews.pro/news/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test.txt", "jsonld": "https://wpnews.pro/news/stratagems-4-p-walked-into-an-ai-monitoring-poc-p-didn-t-run-a-single-test.jsonld"}}