Stratagems #4: P Walked Into an AI Monitoring POC. P Didn't Run a Single Test.

wpnews.pro

Exhaust the enemy's strength without fighting. Weaken the strong by nurturing the soft.

— The 36 Stratagems, "[Wait at Leisure While the Enemy Labors]"

P flipped the business card over and wrote one letter on the back: P.

Then P walked into the conference room.

P didn't do opening lines.

P doesn't have a name — not yet, not in this series anyway. But if you've read the earlier stories, you'd recognize the signature.

** The first story** — P's own article got flagged as "low quality" by the company's AI moderation system. P dug into the internal API, pulled 347 flagged records — effective accuracy came out to 38%. More false positives than correct identifications.

** The second story** — an AI payment gateway processing $2.8 billion. The CTO backed it with formal verification, claimed it was "mathematically bulletproof." P spent eight months quietly building an adversarial testing pipeline, and proved the gateway would approve illegal transactions.

P won both times. P left zero fingerprints both times.

After those two jobs, P stopped working for other people.

This time, P got brought in as an independent evaluator.

The customer was a mid-sized industrial IoT firm called FirmCore. Their production-line gear had been running for almost a decade. The monitoring system was going down once a month, and management had finally had enough. They decided to bring in an AI monitoring platform.

A good call — right up until they decided to run two vendors through POC at the same time and pick a winner.

"We want to see who can actually cover our failure modes," the VP said in the meeting. "We've also brought in an independent evaluator."

P was that evaluator.

The two AI monitoring companies were MonitorAI and SentryWave. MonitorAI's pre-sales team went first, slides blazing with "99.3% fault coverage, validated across 3 manufacturing customers." SentryWave followed right behind: "99.7% coverage, 7-day deployment" — bigger numbers, bolder font.

P sat in the corner, laptop open. P didn't write a single word.

The meeting wrapped. The client team gathered around P. "So, what do you think?"

"Nothing worth saying right now. Give it three months."

P didn't ask either company a single technical question. Didn't pull their data, didn't check their architecture, didn't ask to see prior POC records.

"Three months?" the VP pressed. "The POC contract is three months. What if you can't deliver a conclusion by then?"

P didn't look up. "Three months from now, I'll give you data. Right now I can't — the data hasn't spoken yet."

"At least give me a monthly update."

P didn't say yes. Didn't say no either.

The VP was quiet for a few seconds, then nodded.

P added one more thing: "I'll set up an independent data pipeline for the evaluation. I won't tell you in advance which metrics I'm tracking — if I do, you'll just stare at those numbers, and then I won't be measuring you anymore."

The VP gave P another look, then nodded.

P spent a week going through FirmCore's security review. Read-only replica access, config registry access, model metadata read access — every layer required a sign-off. The security team triple-checked "read-only, no impact on production, no change requests" and cleared it. Since it was just read replicas, fitting into an existing data audit framework, the approval followed a standard template and took exactly one week.

End of week one, P had a read-only replica pipeline running alongside FirmCore's production environment. Connected to more than just the real-time event stream — it was also hooked into the config registry and model metadata. No writes, no config changes, no production impact. Through the config tables, P could see exactly which data sources each vendor had connected to, without waiting for every failure type to trigger naturally. FirmCore's own ops team was drowning in production-line alerts every day — nobody was going to go digging through vendor config registries. That wasn't their job, and they didn't have the time.

MonitorAI's dashboard went live. Three numbers on the big screen: real-time fault detection rate 97.8%, false positive rate 2.1%, average response time 1.2 seconds.

Their slide deck had said 99.3%. P noticed the gap — and said nothing.

The VP posted a screenshot on Slack: "Now this is AI monitoring."

SentryWave didn't flinch. Their dashboard showed 98.2% detection rate, 1.8% false positives, 0.9 seconds average response — every number just a hair better than MonitorAI's.

The client project manager sent out a weekly progress email. Two tables, numbers polished to a high shine.

P read those emails. Then P looked at the data in the read-only pipeline.

Behind MonitorAI's 97.8% coverage number: out of 61 known failure modes, only 29 were actually covered. The remaining 32 — MonitorAI hadn't connected to the corresponding data sources at all. But the dashboard didn't say "uncovered," because their AI model counted false positives from similar patterns as "detected."

SentryWave's 98.2% was subtler. They'd connected to 34 data sources, but on 5 of them, the detection thresholds were set absurdly low — generating 200+ false positives per day. Those false alerts got auto-sorted into "low priority" and never appeared on the VP's dashboard, while FirmCore's ops team quietly dealt with them in the background. P knew: if SentryWave didn't fix the root cause, those false positives would keep coming for the full three months. Another 5 sources had thresholds set conservatively. P judged that if SentryWave later raised those thresholds further to suppress the false positive rate, detection on those categories would almost certainly drop below 50%.

By the end of the first week, the pipeline data already told the whole story. P closed the laptop. The rest of the time was just waiting — three months in which P would tell no one that the answer was already known. P had gotten used to this long ago: If you know too early, no one believes you.

Not time to speak yet.

MonitorAI started bleeding.

Their model hit data drift in production. One of the production lines had sensor frequencies that didn't match the POC phase — MonitorAI's training data hadn't covered that frequency band, and the model started flagging normal fluctuations as faults.

Week one: false positive rate climbed from 2.1% to 7.4%.

Week two: 15.2%.

Week three: FirmCore's ops team muted MonitorAI's alert notifications.

MonitorAI's engineers pulled all-nighters retraining models, shipping patches. But every time they fixed one batch of false positives, a new batch surfaced — their training set simply didn't have data from that production line. Patches couldn't fix a fundamental coverage gap.

SentryWave seized the moment. They fired off an email to the VP with the subject line: "Stable, Reliable AI Monitoring — SentryWave's 30-Day Zero False Positive Upgrade."

P saw the email. P's reaction was nothing. The read-only pipeline showed exactly how SentryWave got to zero false positives: by raising detection thresholds. Fault detection rate dropped from 98.2% to 79.4%, but no one noticed, because no one was checking for the faults that weren't being detected.

P wrote a single line in the notebook:

"79.4%. It's safe now. And useless."

Then P closed the notebook and went to get a coffee.

Both companies were struggling.

MonitorAI's production-line data drift hadn't been solved. Their second model update pushed false positives back down to 6.8%, but the cost was detection sensitivity — the detection rate fell from 97.8% to 83.1%, and genuinely new failure modes started slipping through. A patch nudged it back up to 86.4%, but new modes were still being missed. A bearing fault on one of the lines went unnoticed for 48 hours — until it triggered a cascading shutdown.

The outage lasted 4 hours. Estimated loss: roughly $47,000.

SentryWave didn't have data drift — because they'd chosen the most conservative integration approach, connecting only to the most stable data sources. The problem was, those stable sources happened to not cover FirmCore's most critical production lines. And those 5 failure categories P had flagged in week one as likely to drop below 50% — in month three, their detection rates landed exactly between 34% and 48%, matching P's assessment from three months earlier to the decimal point.

Final week of month three, a SentryWave config change triggered an alert storm. 3 AM. 1,400+ alerts in 20 minutes. 1,380 of them were false positives. FirmCore's on-call engineer got woken up seven times. PagerDuty went nuclear.

The next morning, SentryWave's explanation was "a one-time event during config migration." But the ops lead said one thing in the postmortem, and it made its way to the VP:

"I'd rather have no AI monitoring. At least I wouldn't be waking up at 3 AM to confirm a fake alert."

The VP didn't say anything. He looked at the increasingly ugly numbers in the weekly progress emails, then turned his gaze toward the corner where P had been sitting, silent, the entire time.

"Your evaluation. Where is it?"

P pulled a USB drive out of the bag. Plugged it in. Opened a folder.

Inside: four tables. That was it.

Table One — MonitorAI's real coverage:

Claimed coverage: 97.8% → 99.3%
Actual failure modes covered: 29/61
True coverage: 47.5%
The 32 uncovered modes include: high-frequency sensor fluctuation, cross-line cascading failures, multi-sensor joint anomalies

Table Two — SentryWave's real coverage:

Claimed coverage: 98.2% → 99.7%
Data sources connected: 34/61 (5 with thresholds too low → effectively 0, another 5 with detection below 50% after threshold increases)
Effectively covered: 24/61
True coverage: **39.3%**
Reason for gaps: connected only to the most stable data sources, avoiding critical production lines; sacrificed detection sensitivity to suppress false positive rate

Table Three — Three-month trend comparison:

MonitorAI: 97.8% → data drift → 83.1% → patch → 86.4% → missed detections
SentryWave: 98.2% → raised thresholds → 79.4% → alert storm → muted

P's parallel data: 47.5% / 39.3% (true failure mode coverage — not the fluctuating detection rate on the dashboards. Static. Never changed from day one.)

Table Four — Complete failure mode inventory:

All 61 known failure modes, each annotated with which company covered it, coverage quality, and — which ones neither company covered at all.

"There are 12 failure modes that neither company covers," P said, voice flat. "Those 12 modes appeared in 9 of your P0 incidents over the last two years. So even if both POCs had been successful, you'd still crash in exactly the same places."

Nobody in the room said anything.

The VP read through all four tables twice. Then he asked a question that, in three whole months, no one had thought to ask:

"What's our real fault coverage? Do we have data on that?"

P flipped to the last page of the notebook and slid it across the table.

It was a table P had drawn up three months earlier — 89 production incidents from FirmCore's last two years, categorized by failure mode, annotated by severity, marked for AI monitoring coverage.

Coverage: 0/89. Not because AI monitoring didn't exist — but because the 12 failure modes that actually took FirmCore down were exactly the 12 modes neither vendor covered. The other 49 covered modes? Hadn't triggered once in two years. Between the industry-standard fault taxonomy and the real failure distribution on the production floor, the gap was as wide as the factory floor itself. Those 12 high-frequency modes were specific to FirmCore's production environment — their sensor layout, cascading logic, ops workflows were completely different from any manufacturing customer the two vendors had ever touched.

"I could have given you these four tables three months ago," P said. "Would you have believed me then?"

The VP didn't answer. He didn't need to.

MonitorAI and SentryWave's POCs were terminated the same week. Not because of P's report — but because both companies' own data had already proven they couldn't deliver.

P was packing up to leave when the VP caught up.

"I've saved your report. One question — the real coverage numbers you cited. Where did you get the data? FirmCore never gave you production access."

P d.

"I never asked for production write access. All I applied for was read-only replica access — no production impact, no change requests."

"When did you set it up?"

"First week. Security review took a week."

"First week? You didn't tell anyone?"

P looked at him the way you'd look at a student who just realized someone copied his homework.

"I'm an independent evaluator. If I tell you in advance what I'm measuring, you'll just stare at those numbers. After three months, I wouldn't be getting your real performance — I'd be getting your exam scores."

The VP's mouth opened, then closed.

"The people who understand data are too busy writing slides, and the people who understand slides can't read data. From day one of this POC, no one thought to ask about coverage. I just sat on the answer no one wanted to hear for three months."

P tucked the USB drive into the bag and turned to leave. At the door, P looked back one last time:

"Next time a vendor tells you they have 99% coverage — wait three months. See if their numbers still say that."

P walked toward the building's main entrance. Passing the front desk, P's eyes caught a stack of POC archive files — the kind of process documents nobody ever reads, sandwiched between signature pages and disclaimers. P wouldn't have noticed it at all, except the edge of a card was sticking out from the last page.

P flipped to the back.

A business card was tucked inside.

Not MonitorAI. Not SentryWave.

ACL

Automated Compliance Lab

"Compliance is not a cost — compliance is competitive advantage."

A name was handwritten on the back. No context. Like someone had placed it there deliberately.

P looked at it. Pocketed the card.

P didn't know what ACL was. But P knew one thing — someone put that card there so P would find it.

P stepped outside. The sun was bright.

P took a different route back to the station. Didn't pass "The Third Cup" café. But P still glanced in that direction.

This is "Wait at leisure while the enemy labors" — when everyone else is desperate to prove themselves, you do nothing. Let the numbers do the talking.

[36 Stratagems Tactical Database v3.1] Loaded
[Target Match] Wait at Leisure While the Enemy Labors
[Analysis Mode] Full-situation scan
━━━━━━━━━━━━━━━━━━━━
Tactical match: 94.2%
Protagonist: P
Action: During three months of POC between two AI monitoring platforms, P did not evaluate, interrupt, or question — just set up a read-only data pipeline to record real data in parallel. Let both platforms expose their own flaws, then produced conclusions P had already reached three months earlier.
Goal: Evaluate the true failure-mode coverage of two AI monitoring platforms
Result: Achieved — both POCs terminated, customer cancelled procurement process, P's report archived

Counter-Detection:
  - Strategic assumption risk: If MonitorAI had fixed data drift through online learning, or SentryWave had recalibrated thresholds without disclosing it — P's three months of silence would have failed. The "wait at leisure" strategy rests on the assumption that the opponent will make mistakes.
  - Information asymmetry mirror: P hid evaluation metrics from the client, while SentryWave hid false positives in "low priority" queues — same strategy, different targets, identical mechanism.
  - Read-only pipeline limitation: P's pipeline only observed without intervening. Could not actively inject test inputs to verify counterfactual scenarios after threshold adjustments. P's judgment that 5 SentryWave categories would drop below 50% after threshold increases was inference, not empirical proof.
  - External monitoring signal: The ACL business card appearing in POC archive files suggests P may have been under observation during this evaluation. P noticed the card but did not investigate further. ACL's internal audit records show that in similar AI monitoring POC evaluations, cases with both data drift and threshold manipulation coexisting account for approximately 67% — closely matching this evaluation's dual-vendor failure pattern. But this is ACL's own data. P had no reason to know.

Core Insights:
  - P performed zero traditional "evaluation" actions in three months — P just waited.
  - Neither company's failure was accidental — MonitorAI's data drift and SentryWave's threshold manipulation were structural product-design issues that P had already identified from the read-only pipeline by the end of week one.
  - P chose not to speak because "speaking would have changed nothing." The VP and the client, operating under information asymmetry, would not have trusted a negative conclusion delivered on day one.
  - **The core of "wait at leisure" is not doing nothing — it's waiting for the opponent to expose their own weaknesses. And human nature, under competitive pressure, rushes to prove itself — which is the fastest way to expose weakness.**

Next stratagem: Loot a Burning House

P.S. English isn't my first language. I use AI to polish the writing and smooth out the rough edges. Thanks for reading. ☕ Buy me a coffee

source & further reading

dev.to — original article Stop Letting AI Agents Raw-Dog Your Filesystem: Building SafeMCP we built a 'failed' column on purpose, then caught our own agent triggering it Stop Over-Optimizing Performance: The Modern Full-Stack Toolkit in 2026

Stratagems #4: P Walked Into an AI Monitoring POC. P Didn't Run a Single Test.

Run your AI side-project on zahid.host