Alignment Research Blog

Informal updates from the OpenAI team
2026
CoVal: Learning values-aware rubrics from the crowd
An experimental dataset of crowd-written rubrics that surfaces why people prefer one model output over another.
Why we are excited about confessions
Deeper analysis of confession training and comparisons to chain-of-thought monitoring.
2025
Helpful assistant features suppress emergent misalignment
Emergent misalignment not only activates misaligned personas, but also suppresses helpful assistant personas.
Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations
A pipeline to uncover unknown misaligned behavior and scale the creation of realistic evaluations.
Debugging misaligned completions with sparse-autoencoder latent attribution
Efficiently finding features that cause behaviors.
A Practical Approach to Verifying Code at Scale
We train and deploy an AI review agent optimised for precision and real-world use, enabling oversight to scale with autonomous code generation.
Hello World
Introducing our blog on alignment research.