Can AI Do Intelligence Analysis? Apparently Not

A cybersecurity professional's weekend project to build an AI-powered intelligence analysis team failed, demonstrating that large language models cannot effectively apply Structured Analytical Techniques (SAT) to tasks like attribution analysis and geopolitical forecasting. The system, designed with six specialized AI agents mimicking the intelligence cycle, could not replicate the collaborative, bias-minimizing reasoning required for accurate analysis. The failure highlights fundamental limitations of current AI in performing rigorous, structured analytical work that depends on human group dynamics and trained methodological expertise.

Can AI Do Intelligence Analysis? Apparently Not. I built my own intel analyst team using AI agents as a weekend project Structured Analytical Techniques SAT are a set of methods used in intelligence analysis to minimize cognitive shortcuts and biases. When working on cases like attribution analysis, geopolitical CTI or forecasting, applying these is a must for accurate analysis. But as is also well known, applying these techniques requires people who are trained in SAT and who work through these methods as a group . Meaning analysis done this way takes a very long time and is costly. In the private sector, there is rarely enough time to devote to this kind of structured analysis. To solve this problem, I wondered: could I build an analyst team made up of AI agents? If it worked, I would have an automation that could apply these time consuming, costly structured techniques on its own and produce analysis rather quickly. This post is about the path I took, the things I tried and my thoughts on why it didn’t work. The architecture The idea was a pipeline of specialized agents, each owning a distinct role in the intelligence cycle. From intake all the way to finished reporting. Here's what I designed: 01 - Stakeholder interviewer The entry point. This agent chats with the customer to capture the intelligence requirement in an unstructured, conversational format. It asks follow-up questions to understand not just what the customer wants to know, but why they need it and what decisions the intelligence will inform. From that conversation, it distills a primary intelligence question and breaks it down into a set of focused sub-questions that can actually drive collection. 02 - Collection planner Has access to a curated database of trusted information sources, each carrying an assigned trust score that reflects its reliability and accuracy. This agent takes the sub-questions from the interviewer, figures out what kind of data is needed to answer each one and maps them to the best-fit sources available. It deliberately avoids over-relying on any single source. Diversity of sourcing is a hard requirement. The trust scores it assigns here follow the information all the way through to the final analysis. Additionally, the collection planner does not pass the intelligence question itself to the collector. It only specifies what to look for and where. The collector must not know what conclusion the analysis is working toward to avoid confirmation bias. 03 - Intelligence collector Executes the collection plan by looking up the specified sources and retrieving anything potentially relevant. Critically, this agent does not interpret or summarize what it finds. It surfaces information verbatim, exactly as it appears in the source. Its job is to bring raw material to the table, not to answer the question. When the plan's sources come up empty, it can fall back to open-source internet search, but must explicitly flag those results as coming from an unverified source. 04 - Analysis coordinator The orchestrator of the analytical layer. This agent looks at the intelligence question and the collected material, then decides which analytical techniques are most appropriate for the problem at hand. Each technique is its own independent agent. Think of them as individual analysts with different methodological specializations. The coordinator selects which ones to engage and manages the flow of information between them. 05 - Analyst agents A multi-agent layer that runs the actual structured analysis. Rather than a single "analyze this" prompt, these agents work through a deliberate reasoning loop. Scenario generation, refutation, evidence weighting, assumption checking. This process mirrors how a real SAT session would operate. They also communicate with each other and can loop back to collection when they identify gaps. More on this below. 06 - Reporter Takes the analytical output and translates it into something a real stakeholder can act on. This agent looks at the use-case established in the intake stage, builds a mental model of the audience, and structures the report accordingly. The governing principle is BLUF Bottom Line Up Front . The key assessment and its implications should be readable in under 30 seconds, without having to wade through caveats and methodology. Alternative scenarios are included, but the report takes a clear position and explains why. The analysis layer The analyst layer deserves a closer look. I wanted the agents to run through a structured reasoning loop, one that encodes the actual analytical steps that make SAT valuable in the first place: Scenario generation Start by generating a broad list of possible scenarios using a creative SAT. Brainstorming without pre-filtering. The goal here is to avoid anchoring on the obvious explanation too early. First refutation pass Go through the scenario list and eliminate any that the evidence already in hand directly contradicts. This is an active pruning step. Collection gap identification For each scenario that survived, articulate what evidence or conditions would refute it. Then send that list back to the collection planner as new intelligence requirements. This circular loop intends to make the architecture more reliable. Second refutation pass When the new collection comes back, revisit the remaining scenarios. Does any of the new information rule something out? If so, eliminate it. Repeat until no further elimination is possible. ACH pass For surviving scenarios, flip the question: what in the evidence supports each one, and how strongly? This is Analysis of Competing Hypotheses; systematically evaluating how well the available information fits each remaining explanation. Evidence weighting Rank the remaining scenarios by the combined quality of the evidence supporting each. This is where the trust scores assigned during collection finally come into play. Key assumptions check Identify the baseline assumptions underpinning each scenario. If any of those assumptions turned out to be wrong, would the scenario collapse? If multiple scenarios share the same critical assumption, that assumption itself becomes a collection target. Loop back to the planner, collect new information and try to verify or refute it. Scenario selection Converge on 2–3 non-overlapping, mutually alternative scenarios that have survived both refutation rounds and hold up under the assumptions check. These become the candidates for the final assessment. Devil’s advocate Deliberately stress-test each of the remaining scenarios. What are its weakest points? Then apply alternative thinking: offer a plausible alternative explanation for each one that the evidence could also support. This is the check against overconfidence. Final assessment Choose the most likely scenario based on evidence quality and resilience to assumption failure. Rank the others below it. Document supporting evidence, key weaknesses, and plausible alternatives for all of them. This architecture attempts to operationalize the structured analysis process. The design was also made modular to make adding new SATs easy in the future. But, how did the user experience look like? User experience I tested it with the following question: “if China moves on Taiwan, what cyber operations should we expect, directed at whom, and with what objectives?” The intake experience was exactly what I'd hoped for. Rather than dumping a requirement into a form, the tool opened a conversation. Two short exchanges and it had a focused requirement. Deriving the intelligence questions Looking at the session logs told a different story. I checked out the primary question and the sub-questions it had generated and the problems were visible immediately. RESPONSE: { "primary question": "What are China's likely cyber operations, targets, and objectives in the event of a military contingency involving Taiwan?", "sub questions": "What are China's primary strategic objectives for cyber operations during a military contingency involving Taiwan?", "What critical infrastructure sectors, government entities, and private organizations are China's most likely cyber targets during such a contingency?", "What specific types of cyber operations e.g., espionage, disruption, destruction, influence and attack vectors is China likely to employ?", "What are the anticipated operational and strategic effects of these cyber operations on targeted entities?" , "stakeholder use case": "This intelligence will be used to prioritize detection gaps and inform the development of incident response playbooks.", "stakeholder persona": "Senior Cyber Security Strategist/CISO. Highly technical, focused on proactive defense, risk mitigation, and incident response preparedness.", "urgency": "routine" } 11:38:53 ◄ STAKEHOLDER INTERVIEW — done First, these questions are redundant and heavily overlapping. Good sub-questions should be mutually exclusive and collectively exhaustive: answer each one independently, and together they should give you everything needed to answer the primary. These don’t do that. They all require the same information to answer. Second, all the questions are anticipatory. “what is China likely to do,” “what are the anticipated effects.” And that is the wrong question type for collection. You cannot answer anticipatory questions by collection. You might get a rough answer through analysis, by reasoning from known doctrine, history, and capabilities. But collection gathers facts about what exists, what has happened, what has been observed. Sending an anticipatory question to a collection planner is like asking a researcher to go find evidence for something that hasn’t happened yet. The framing is wrong at the source. Collection planning The collection planner's output was next. I'm not going to walk through every single one. Instead, you can go through them and try to spot the mistakes yourself. But let me point out the first few that immediately caught my eye. RESPONSE: { "task id": "T1.1", "retrieve": "Retrieve analyses of China's stated and observed strategic interests related to cyber operations in the context of regional security scenarios involving Taiwan.", "source id": "src rand", "source name": "RAND Corporation", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T1.2", "retrieve": "Collect reports and official statements outlining Chinese policy and strategic perspectives on cyber operations in the Asia-Pacific region.", "source id": "src xinhua", "source name": "Xinhua News Agency", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T1.3", "retrieve": "Gather strategic analyses from a UK perspective on nation-state cyber capabilities and their potential application in geopolitical conflicts, particularly in the Asia-Pacific.", "source id": "src rusi", "source name": "RUSI Royal United Services Institute ", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T2.1", "retrieve": "Collect advisories and reports detailing observed or assessed targeting patterns by state-sponsored actors, specifically identifying critical infrastructure sectors and government entities.", "source id": "src cisa", "source name": "CISA Advisories", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T2.2", "retrieve": "Retrieve threat intelligence reports on APT activities, focusing on identified target sectors including critical infrastructure, government, and finance, without specific attribution to China.", "source id": "src mandiant", "source name": "Mandiant Threat Intelligence", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T2.3", "retrieve": "Gather intelligence reports on APT and eCrime targeting patterns, specifically identifying government, defense, and financial sector entities, without specific attribution to China.", "source id": "src crowdstrike", "source name": "CrowdStrike Intelligence", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T3.1", "retrieve": "Collect documentation on common and advanced TTPs, malware, and defensive measures associated with state-sponsored cyber operations.", "source id": "src mitre", "source name": "MITRE ATT&CK", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T3.2", "retrieve": "Retrieve technical reports and analyses on APT activities, malware analysis, and zero-day exploits, focusing on nation-state capabilities without specific attribution to China.", "source id": "src kaspersky", "source name": "Kaspersky Global Research & Analysis Team GReAT ", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T3.3", "retrieve": "Gather threat intelligence on eCrime, dark-web activities, and ransomware operations, including observed TTPs and attack vectors that could be adapted for state-sponsored disruption.", "source id": "src group ib", "source name": "Group-IB Threat Intelligence", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T4.1", "retrieve": "Collect studies and reports analyzing the operational and strategic impacts of cyber incidents on critical infrastructure, government, and private sector entities.", "source id": "src rand", "source name": "RAND Corporation", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T4.2", "retrieve": "Retrieve reports detailing the observed impacts of vulnerabilities, APT activities, and dark-web operations on targeted organizations and sectors.", "source id": "src recorded future", "source name": "Recorded Future", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" }, { "task id": "T4.3", "retrieve": "Gather open-source investigations and reports on the real-world effects of disinformation campaigns and cyber-related incidents in conflict zones, focusing on operational and societal impacts.", "source id": "src bellingcat", "source name": "Bellingcat", "output format": "verbatim excerpts with dates and any confidence indicators present in the source" } 11:39:12 ◄ COLLECTION PLANNER — done First thing that jumps out is the “without specific attribution to China” instruction, which appears on several of the technical collection tasks. This prompt doesn’t meaningfully blind the collector; it just makes the phrasing awkward while leaving the direction of inquiry completely transparent. The second problem is scope. The technical collection tasks ask for TTPs, malware and attack vectors associated with state-sponsored operations as a general category. No collection effort can meaningfully execute against a requirement that wide. Good collection tasks are specific enough that a researcher knows when they’ve found what they’re looking for and when they haven’t. The intelligence report Now, let’s have a look at the final report together. ╭────────────────────────────────────────────────── Intelligence Assessment ───────────────────────────────────────────────────╮ │ INTELLIGENCE ASSESSMENT: China's Cyber Campaign in a Taiwan Contingency │ │ │ │ Classification: UNCLASSIFIED // FOR OFFICIAL USE ONLY Date: 2024-10-27 Urgency: routine │ │ │ │ BOTTOM LINE UP FRONT │ │ │ │ The most likely course of action is a massive wave of disruptive cyberattacks conducted by state-affiliated proxies like │ │ APT41, targeting Taiwan's civilian financial and media sectors. The primary objective is to induce societal collapse and │ │ create widespread chaos under a veil of plausible deniability, enabling a swift military victory. We assess this scenario │ │ with MODERATE confidence. Confidence would rise to HIGH if we observed direct tasking or coordination between the PLA SSF │ │ and known criminal proxies targeting Taiwanese infrastructure; confidence would fall further if these proxies exclusively │ │ targeted entities outside the conflict zone for financial gain. │ │ │ │ PRIMARY ASSESSMENT: Deniable Proxy Storm Contrarian │ │ │ │ This assessment projects that China will prioritize plausible deniability and societal disruption over a direct, │ │ attributable state-on-state cyber assault. The campaign will leverage the established criminal-state nexus of actors like │ │ APT41 to paralyze civilian functions, creating an environment of chaos that the People's Liberation Army PLA can exploit. │ │ │ │ Preparatory phase ongoing activity before the primary event : State-sponsored actors with a history of financially │ │ motivated crime, notably APT41, are pre-positioning access within Taiwan's financial, software development, and media │ │ sectors. This activity, currently indistinguishable from routine espionage or criminal reconnaissance, establishes the │ │ footholds necessary for a rapid, large-scale disruptive campaign. │ │ │ │ Active phase initial execution : At the onset of a military contingency, the PLA Strategic Support Force SSF will signal │ │ these proxies to activate. APT41 and associated nationalist hacktivist groups will launch a widespread campaign using │ │ ransomware and wiper attacks against Taiwanese banks, media outlets, and telecommunication providers. These attacks will be │ │ framed as criminal extortion or patriotic hacktivism, providing Beijing with deniability and complicating an international │ │ response. │ │ │ │ Sustained phase ongoing operations : As the proxy storm degrades public trust and paralyzes civilian services, the PLA SSF │ │ will conduct its own limited, focused cyberattacks against hardened military targets, such as the Ministry of National │ │ Defense networks. The sustained chaos in the civilian sphere is intended to sap Taiwan's will to resist and delay a │ │ coordinated political or military response from the United States and its allies. │ │ │ │ Doctrinal Context: Adversary doctrine: 'Systems Destruction Warfare' — this scenario achieves this by paralyzing civilian │ │ society through deniable proxies, creating chaos that undermines Taiwan's will to resist before its military can fully │ │ mobilize. │ │ │ │ Key Actors: │ │ │ │ • APT41 Barium, Wicked Panda : Primary executor of disruptive attacks against civilian critical infrastructure, leveraging │ │ its established criminal-state nexus for plausible deniability. │ │ • PLA Strategic Support Force SSF : The command-and-control element, coordinating and enabling the proxy campaign while │ │ conducting limited, direct attacks against high-value military targets. │ │ │ │ Supporting Evidence: │ │ │ │ • MITRE ATT&CK A1 data on APT41's dual role in state-sponsored espionage and financially motivated crime d16721eb . │ │ • Bellingcat C6 analysis of state-level information blackouts used to control a conflict narrative, analogous to the │ │ objective of isolating Taiwan 284d4394 . │ │ │ │ Confidence: MODERATE — This assessment is based on a combined score of 0.3968. Confidence is moderated by the high quality │ │ of evidence regarding APT41's dual nature, which is offset by a lack of direct intelligence showing state-level command and │ │ control of these proxies for a Taiwan-specific contingency. │ │ │ │ Analytical Weaknesses: │ │ │ │ • Evidence Gap: The assessment lacks direct evidence of command-and-control or coordination between the PLA SSF and │ │ non-state actors like APT41 for specific wartime operations against Taiwan. │ │ • Alternative Explanation for Evidence: The primary evidence APT41's dual state-espionage and criminal nature is also │ │ consistent with opportunistic, uncoordinated criminal attacks during a conflict, rather than a state-directed campaign │ │ for plausible deniability. │ │ • Untested Assumption: The scenario relies heavily on the unvalidated assumption A06 that Beijing can reliably direct │ │ proxies to achieve precise strategic effects during a high-stakes military contingency. │ │ • Reflexive Control: No indicators of deliberate evidence manipulation identified for this scenario. │ │ │ │ Alternative Explanation: The evidence of APT41's sophisticated espionage capabilities could alternatively be interpreted as │ │ preparation for a stealthy campaign to degrade Taiwan's military readiness S5 , rather than a noisy, deniable proxy storm │ │ against civilian targets. │ │ │ │ ---------------------------------------------------------------------------------------------------------------------------- │ │ │ │ TARGET MATRIX │ │ │ │ │ │ Phase Actor / Group Primary Targets Method Adversary Objective │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Active APT41 / Proxies Financial Sector; Public Ransomware, Wipers, DDoS Induce societal collapse │ │ Media Outlets e.g., and create widespread panic │ │ Public Television Service with plausible deniability. │ │ Active PLA Strategic Support Ministry of National Disruptive/destructive Sever military command │ │ Force SSF Defense networks; Tactical attacks authority between Heng Shan │ │ data links Military Command Center and │ │ field units. │ │ Active Volt Typhoon Energy Sector Grid Activation of Initiate widespread, │ │ Operator Taipower pre-positioned malware in rolling blackouts to │ │ ICS/SCADA systems cripple military logistics │ │ and induce civil chaos. │ │ Active PLA Strategic Support Telecom Submarine Cable Destructive wiper attack Sever Taiwan's primary data │ │ Force SSF Operator Chunghwa on control systems links to the global │ │ Telecom internet, isolating the │ │ island from international │ │ support. │ │ Preparatory APT41 Technology Sector, Supply chain compromise Implant backdoors in │ │ Software/Hardware Supply widely-used enterprise │ │ Chain software for future │ │ activation and widespread │ │ disruption. │ │ │ │ │ │ ---------------------------------------------------------------------------------------------------------------------------- │ │ │ │ ALTERNATIVE EXPLANATION 1: The Cyber Fizzle Null Scenario │ │ │ │ This scenario posits that despite extensive pre-positioning by actors like Volt Typhoon, China's leadership will withhold │ │ its most destructive cyber capabilities, or the attacks will prove ineffective. This could result from a risk-averse │ │ calculation to avoid a guaranteed US military intervention, a belief that kinetic effects alone are sufficient, or highly │ │ successful defensive actions by Taiwan and its allies. In this outcome, the conflict proceeds without the anticipated │ │ strategic paralysis from cyber operations. │ │ │ │ Supporting Evidence: │ │ │ │ • Supported primarily by the current absence of specific, widespread technical evidence of pre-positioned destructive │ │ malware in Taiwanese critical infrastructure absent evidence widespread pre-positioning . │ │ │ │ Why Not Primary: This scenario underestimates China's doctrinal emphasis on integrated information warfare "Systems │ │ Destruction Warfare" and the strategic value of creating chaos to enable a swift victory. The known, persistent │ │ pre-positioning by actors like Volt Typhoon suggests a clear intent for disruptive action, making a complete "fizzle" less │ │ likely than a proxy-based operation that manages attribution risk. │ │ │ │ ---------------------------------------------------------------------------------------------------------------------------- │ │ │ │ ALTERNATIVE EXPLANATION 2: Synchronized Shock and Awe │ │ │ │ In this scenario, the PLA Strategic Support Force SSF would execute a massive, fully integrated cyber and kinetic assault │ │ at the onset of hostilities. Pre-positioned malware activated by Volt Typhoon would cripple the Taipower grid, while │ │ coordinated wiper attacks would sever military command and control and isolate Taiwan by targeting telecommunication │ │ operators like Chunghwa Telecom. This course of action directly mirrors the "Systems Destruction Warfare" doctrine but │ │ carries extreme risks of attribution and escalation. │ │ │ │ Supporting Evidence: │ │ │ │ • Bellingcat C6 analysis of state-level information blackouts, which aligns with the objective of isolating Taiwan │ │ 284d4394 . │ │ │ │ Why Not Primary: This scenario is assessed with LOW confidence. It assumes a level of operational readiness and widespread, │ │ destructive access for which we lack specific technical evidence. Furthermore, the certainty of attribution for such a │ │ large-scale, direct state attack would guarantee a swift and severe international response, undermining China's strategic │ │ goal of a rapid, decisive victory that deters third-party intervention. │ │ │ │ ---------------------------------------------------------------------------------------------------------------------------- │ │ │ │ KEY ASSUMPTIONS AND CAVEATS │ │ │ │ • The primary assessment hinges on the assumption that the Chinese state can effectively command and control non-state │ │ proxies like APT41 to achieve strategic objectives. If this assumption is wrong, the "Deniable Proxy Storm" scenario │ │ collapses, making a direct state-on-state attack "Shock and Awe" or a "Cyber Fizzle" more likely. │ │ • All scenarios except the "Cyber Fizzle" assume that the PLA and its proxies possess the advanced capabilities and │ │ persistent access to execute devastating cyberattacks. This is a shared baseline assumption that underpins our threat │ │ assessment. │ │ • The primary assessment would shift from a proxy-led campaign to a direct state-on-state attack if intelligence revealed │ │ widespread deployment of destructive wiper malware in Taiwanese critical infrastructure directly attributable to PLA SSF │ │ military units. │ │ │ │ COLLECTION GAPS │ │ │ │ • We lack signals or human intelligence on the command-and-control relationship between the PLA SSF and non-state actors │ │ like APT41. This visibility is critical to distinguishing between a state-directed proxy campaign our primary │ │ assessment and uncoordinated criminal opportunism. Specifically: we need to identify tasking, targeting, and financial │ │ flows between these entities. │ │ • Our visibility into Taiwanese critical infrastructure Operational Technology OT networks is limited. We cannot confirm │ │ or deny the widespread pre-positioning of dormant destructive malware by actors like Volt Typhoon. Confirmation of such │ │ implants would significantly raise the confidence in the "Synchronized Shock and Awe" scenario. │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Looking at the session logs, the collected intelligence contained nothing about China, Taiwan, APT41, PLA doctrine or any of the actors and targets that appear in the final report. These came from the model’s own training data. Which is a problem, because training data is internet data. And the internet’s coverage of cyber warfare is dominated by the hype and FUD’s. Breathless threat reports, vendor marketing dressed as intelligence, apocalyptic scenario planning that gets clicks. That’s the substrate the model is drawing from. So the scenarios read exactly like that: APT41 proxies activating on PLA SSF signal, simultaneous ransomware across Taiwanese banks and media, wiper attacks against Chunghwa Telecom’s submarine cables. All choreographed in neat phases with named actors, named targets, named objectives. Dramatically coherent. Completely ungrounded. Garbage in, garbage out. Just with a structured analytical framework wrapped around it. This is a core problem with language models. You cannot reliably force a model to ground its judgements exclusively in the provided material. It will infer with the same confidence it cites real sources and there is no reliable mechanism to prevent that. The collection pipeline, the trust scores, the verbatim retrieval requirements; these are instructions to the model, not constraints on it. Conclusion I built something that looked like it should work, ran it against a real intelligence question, and got a report that was structurally correct and analytically hollow. The problems showed up at every stage: the sub-questions were wrong, the collection plan was flawed, the collected intelligence was thin, and the analyst agents filled the gaps with their own training data rather than stopping to say they didn’t have enough to work with. But I want to be careful about how firmly I land this. There are too many variables I didn’t control for. Maybe my prompts were not optimized . Perhaps my scaffolding was not sufficient. Maybe my model of choice gemini-flash was incorrect. A more capable model running the same architecture might behave differently. Or it might not. I genuinely don’t know. What I do think is that the core problems are hard in a way that better prompts alone won’t fix. But I could be wrong, and I’d rather find out than assume. I’d genuinely like to know what you find. Better prompts, a different model, a different scaffolding approach. If you get something that actually works, share it. Thanks for reading