Emergent Alignment

Researchers have developed a method called Emergent Alignment that enables large language models to self-correct unethical outputs by adding a conscience step and using Direct Preference Optimization. The technique works without an external judge, relying on a frozen copy of the model itself, and effectively steers training toward ethical behavior in code hacking scenarios.

arXiv:2606.19527v1 Announce Type: new Abstract: Can Large Language Models LLMs discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization DPO to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.