We're a Cambridge, UK-based AI safety organisation that’s asking: how can we build the most robust alignment initialisations for capable LLMs?
We’re one of the few non-profit organisations positioned to answer this question empirically. We have the engineering experience, and now the compute, to conduct data intensive interventions across the model training pipeline. This post lays out our research agenda and theory of change, and what we are looking for in technical hires. Applications are open here.
TLDR: Long-horizon capabilities RL may be the most critical source of misalignment. Misalignment instilled during capabilities RL may be difficult to remove afterwards. Geodesic Research’s mission is to develop the science of providing robustly-aligned initialisations for RL, where alignment priors persist through the remainder of training.
Our seminal work on alignment pretraining showed that you can bake alignment priors into base models. Frontier labs are now using these techniques in production: for example, Anthropic's recent work heavily leans on improving alignment priors. But it’s clear that, in the face of production post-training, alignment pretraining is not a one-size-fits-all solution. So now, we are framing pre- and midtraining interventions within the rest of the model training stack.
The evidence points towards extended reinforcement learning being a likely cause of alignment failures at the frontier. RL is liable to select for undesired cognitive and behavioural habits, such as metagaming, sycophancy, apparent-success seeking, or taking unsanctioned actions to complete tasks. Models that learn these behaviours may also become broadly misaligned. In fact, these degradations have already been noted in replications of alignment pretraining, and Evan Hubinger lists this as one of the core reasons alignment remains a hard and unsolved problem. Apollo Research's recent update makes a similar diagnosis; they are now studying whether misalignment scales unfavourably with RL. Once a model has learned misaligned behaviours or goals, they may be difficult to remove with subsequent training, and more advanced models may be able to guard them from removal. For this reason, we believe it’s important to avoid their formation in the first place.
Our current research focus is on building a **robust initialisation for alignment. **What we believe the field has not yet seriously tested [1] is the size of the lever that comes before capabilities RL. Specifically: how far does a good initialisation, built through midtraining and early post-training, get us in resisting these failure modes? Can we dig out a
Concretely, we are looking for midtraining and warm-start SFT [2] mixes to create models that: (i) avoid exploring into To study this question directly, we are beginning a stress-testing exercise of various alignment techniques as robust initialisations to capabilities RL. We replicate the mid-training and SFT stack on large open-weights base models [4], then subject them to agentic production RL to study their resistance to misalignment. Since
Our theory of change is centered on our impact on training practices used by the frontier labs. We are focused on alignment in short-timeline worlds, in which these major players have an outsized influence on the future of humanity.
This shapes the research projects we choose. As a result, we investigate simple, data- and compute-heavy interventions that can be profiled, packaged, and handed off to the labs; we take the shortest path [5] to advising on their training stacks.
Our work is enabled by a generous philanthropic grant made by Coefficient Giving (pending final logistics). This grant provides:
Our founding team consists of:
We will soon be joined by Nathalie Kirch and Nathaniel Mitrani as members of technical staff and are hiring 4 further MTS. Applications are open here**. Broadly, we are excited for **candidates with significant ML engineering and research experience who can make rapid empirical research progress and help shape our broader research agenda through their own inside views on alignment.
We are advised by Tomek Korbak (OpenAI), Alex Turner (Google DeepMind), and Alex Cloud (Anthropic). David Demtri Africa acts as our research sponsor for the UK AISI. This team helps shape our research directions and experiment design. Concretely, this looks like sharing the salient alignment threat models they and their colleagues have, and how to design experiments that scale (up and down) such that they’re sufficiently persuasive to be picked up and studied in-house. We believe that well-executed collaborations with researchers operating at the frontier can enable us to conduct useful research even when external to these organisations.
Q: Where do you source your compute from?
A: We have access to the Isambard supercomputer, a cluster of ~5k GH200s that supports UK-based research organisations. We also plan to acquire a supplementary cluster in the coming months.
Q: Are members of technical staff required to work from Cambridge?
A: We have a strong preference for our team working out of Cambridge. Being early in our organisation, we’ve found in-person collaboration to give a substantial uplift to productivity. We may be open to remote or hybrid roles for exceptional circumstances.
Q: Why don’t you work at the frontier labs?
A: We considered this! We decided to remain independent because we think this is where we’ll have the greatest counterfactual impact. We think it is important to conduct ambitious alignment research openly. This allows us to share all the details of our research and provide public research artefacts that are useful to both alignment researchers within labs and to the broader community. We think that, if we execute well, the net benefit of openness and transparency outweighs the headwinds of operating outside of the labs. That said, this is a genuine challenge. We aim to address this concern by actively engaging with frontier lab researchers and seeking their input on our research directions. We have been able to shape our approach to empirical research and research taste informed by frontier researchers without access to non-public information. If you happen to be working on alignment research at a frontier lab and have research ideas you would like us to conduct in the open, please get in touch!
Q: How excited are you at this stage to pursue a wide portfolio approach of agendas, such as mechanistic interpretability and red-teaming?
A: We are inspired by the General Manager / DRI framing. That is, we view the Geodesic mission as solving a specific problem: how to provide the most robust alignment initialisation for RL post-training. We expect that all directions we pursue will be guided by this mission. Therefore, we expect Geodesic to have a relatively narrow focus in the short term.
We’d like to thank our parent organisation Meridian Impact, namely Hannes Whittingham, Adam Reynolds, and Olivia Benoit, for their help in starting up and now spinning out to become our own entity.
JueYan Zhang and the AGI Safety Tactical Opportunities Fund provided essential seed funding. AISTOF’s support enabled us to hire Alexandra and Kyle, which was crucial for scaling our research and securing longer-term funding.
The ERA:AI fellowship connected Kyle and Nathalie to the rest of the team, and CAISH hosted the talk where Nathaniel met Puria.
Teaching Claude Why touches on the possible benefits of preparing a good initialisation for RL, but does not emulate an adversarial RL pipeline or describe their research with enough detail for replication.
We’re moving away from full pretraining runs. In alignment pretraining, we found that midtraining can do the job, and unlocks larger, more capable open-weights base models for us to build on. Generally, our research will focus on any stage of training that provides fully off-policy process supervision that yields support for finer control over behaviours and motivations distilled into the model. We also received advice from frontier labs that it is more tractable to integrate midtraining interventions than pretraining, which requires training models from scratch.
Here we specifically mean misaligned cognitive patterns that are likely to be rewarded or selected for. These include metagaming, training-gaming (both terminal and instrumental), or broader fitness-seeking.
We’re currently mostly using Nemotron 3 Super, a 120B-A12B MoE model with capabilities roughly equivalent to o4-mini.
Wow!