Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

Researchers introduced MechSim, a neuro-symbolic reasoning framework that enables large language model agents to analyze the mechanisms and assumptions underlying scientific simulators rather than treating them as black boxes. The framework represents simulators through a structured schema of variables, dependencies, and execution traces, allowing LLMs to generate evidence-grounded explanations linking outcomes to their underlying mechanisms. In evaluations across high-stakes domains, MechSim improved mechanism-level explanation quality, simulator analysis, and the reliability of downstream decision-making.

arXiv:2606.04505v1 Announce Type: new Abstract: Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treating them as black-box interfaces rather than as structured mechanistic systems that can be reasoned about. As a result, current approaches lack the ability to identify, represent, and reason about the assumptions and mechanisms underlying simulator behavior, limiting transparency, auditability, and decision justification. We introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework for executable scientific simulators. Unlike prior neuro-symbolic approaches that primarily reason over static symbolic structures, MechSim enables LLM agents to reason about the mechanisms, assumptions, and execution behavior of scientific simulators. Our framework represents simulators through a shared structured schema capturing assumptions, variables, mechanism dependencies, and execution traces. On top of this representation, LLM agents operate as constrained reasoning engines that generate structured, evidence-grounded explanations linking simulator outcomes to their underlying mechanisms. We evaluate our approach across multiple high-stakes domains and show that it improves mechanism-level explanation quality, simulator analysis, and downstream decision-making reliability.