Alignment Research Blog

Feb 6

Discovering unknown AI misalignments in real-world usage

Reasoning models can find and understand unknown misaligned behaviors from how users respond.

Jan 14

CoVal: Learning values-aware rubrics from the crowd

An experimental dataset of crowd-written rubrics that surfaces why people prefer one model output over another.

Jan 14

Why we are excited about confessions

Deeper analysis of confession training and comparisons to chain-of-thought monitoring.

Dec 22

Helpful assistant features suppress emergent misalignment

Emergent misalignment not only activates misaligned personas, but also suppresses helpful assistant personas.

Dec 18

Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations

A pipeline to uncover unknown misaligned behavior and scale the creation of realistic evaluations.

Dec 1

Debugging misaligned completions with sparse-autoencoder latent attribution

Efficiently finding features that cause behaviors.

Dec 1

A Practical Approach to Verifying Code at Scale

We train and deploy an AI review agent optimised for precision and real-world use, enabling oversight to scale with autonomous code generation.

Dec 1

Hello World

Introducing our blog on alignment research.