#
Introducing MAI-Thinking-1 Today we are introducing MAI-Thinking-1, Microsoft AI’s reasoning model. It is a medium-sized model that stands among the strongest models in its weight class. It matches leading models on key software engineering benchmarks, demonstrates advanced mathematical reasoning capabilities, and is preferred to Sonnet 4.6 in our blind human side-by-side evaluations. We trained it from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.
MAI-Thinking-1 is a step in our broader work to build towards Humanist Superintelligence: advanced AI capabilities designed to serve people and organizations, not to replace them. The model matters on both axes: what it can do, and how it was built.
The Hill-Climbing Machine #
More than a single model, we are excited to introduce our Hill-Climbing Machine: a co-designed pipeline built to make every component of model development climbable, so capabilities improve continually and reliably over time. The aim is a repeatable system that can absorb better data, stronger rewards, more capable environments, and more compute.
Three main pillars guide our philosophy.
First, capabilities should be learned, not inherited. Although faster to acquire, inherited intelligence lacks the steerability essential for real world usage: an imitator is fundamentally tied to the design choices of its teacher and struggles to adapt to new situations. MAI-Thinking-1 was trained without distillation from third party models, forcing our model to truly learn the tasks at hand.
Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, with AI-generated content excluded from pre-training. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.
Third, self-sufficiency across the entire stack. All the way from co-design of our models with MSFT’s own accelerators through to our reinforcement learning framework, we have focused efforts on in-house training infrastructure. This is a crucial part of building our hill-climbing machine, to ensure we can fully optimize and shape our systems end-to-end to best serve our needs.
Medium-sized model, with strong software engineering performance #
MAI-Thinking-1 is a 35B-active, ~1T-total parameters, sparse Mixture of Experts model, a smaller inference footprint than much larger models. Despite this, our model is toe-to-toe with Claude Opus 4.6 on SWE-Bench Pro. That matters for developers and enterprises because model size determines where advanced coding assistance can be deployed, how often it can be used, and whether it can move from exceptional tasks into daily workflows.
We have invested heavily in the training environments needed for agentic coding. Each verified environment is deterministic, executable, and graded by real test suites. This gives the model practice on the kind of multi-step work developers actually do: reading code, editing files, running tests, observing failures, and recovering from intermediate mistakes.
Advanced mathematical reasoning capabilities #
MAI-Thinking-1 reaches 97.0% on AIME 2025, and 94.5% on AIME 2026, showing strong mathematical and scientific reasoning for its weight class. Strong performance here gives us confidence that our training loop can create real reasoning gains – climbing all the way from the ground up – from our own data, rewards, and evaluation process, enabling this intelligence to generalize to other domains over time.
Preferred in human side-by-sides vs. Sonnet 4.6 #
People care about whether a model understands the task, follows instructions, uses the right level of detail, writes clearly, and respects their time.
We built a blind side by side human evaluation with one of our partners, Surge, using their pool of professional raters to measure various models on these traits. This comprised of 1,276 evaluations designed to test the model’s capabilities in a large variety of tasks across both single-turn and multi-turn conversations, with a focus on measuring how helpful the response to the user is and whether it actually advances the user’s goals. In these evaluations, users preferred MAI-Thinking-1 over Claude Sonnet 4.6.
This has been a core focus of post-training. We want the model to be capable without being brittle, concise without being incomplete, and helpful without overreaching. Human preference data gives us a direct signal on whether benchmark improvements translate into better experiences for users.
Enterprise ready #
MAI-Thinking-1 is built with enterprise readiness in mind. It supports long context with a 256k token window (enough to fit a 600 page document), function calling, and the flexibility to add developer instructions. We trained the model to follow multiple layers of instructions and aligned its default style to enterprise needs. It’s compatible with the widely used Chat Completions API. All MAI models come with enterprise-grade security and compliance through Microsoft Foundry.
Results #
We report results in two views: post-trained MAI-Thinking-1 evaluations, and pre-training metrics for our base model.
**Table 1. MAI-Thinking-1 metrics ** Post-trained model evaluation results on public STEM and agentic coding benchmarks. Other model numbers are taken from respective official model cards. Scores are percentages unless otherwise noted; dashes indicate unavailable model values.
Table 2. Pre-training metrics
Putting humans first #
We are building towards Humanist Superintelligence: advanced AI capabilities designed to serve people and organizations, not replace them. Our models must remain subordinate technologies under human control with the goal of upholding human autonomy and being helpful. That means our models must not refuse legitimate requests under the guise of safety and compliance as then they are not truly serving humans.
Striking the delicate balance between being helpful and safe is not easy. For MAI-Thinking-1, we aimed to achieve this balance by treating unsafe compliance and unnecessary refusal as defects in the same reward construction where aggregation is based on severity of potential of harm. Safety is trained with the same reinforcement learning infrastructure used for capability, so safety rewards are part of the same hill-climbing loop ensuring safety is always aligned to the capabilities and not incidental.
As a result, we see that our model can balance ensuring a safety bar on sensitive unsafe requests while also being helpful on non-sensitive content.
#
Build the Future With Us
We’re a lean, fast-moving lab made up of some of the world’s most talented minds. We have an exciting roadmap of compute at MAI, which is ramping quickly and extensively. And we have an ambitious mission we truly believe in. We’re also fortunate to partner with incredible product teams giving our models the chance to reach billions of users and create immense positive impact. If you’re a brilliant, highly-ambitious and low ego individual, you’ll fit right in—come and join us as we work on our next generation of models!
#
Related Stories