Alignment Research Blog

May 7

Investigating the consequences of accidentally grading CoT during RL

We found limited accidental CoT grading in some released models, fixed the affected reward pathways, and found no clear evidence that monitorability degraded.

Apr 30

Auto-review of agent actions without synchronous human oversight

Auto-review offers a safer default for deploying coding agents, using a separate agent to approve or deny boundary-crossing actions.

Apr 23

Open Sourcing Monitorability Evaluations

We open-source datasets and code from our Monitoring Monitorability paper, and share a new filtering strategy for noise-dominated intervention evaluation instances.

Apr 6

Introducing the OpenAI Safety Fellowship

A pilot program to support independent safety and alignment research and develop the next generation of talent

Mar 27

How far does alignment midtraining generalize?

Preliminary experiments on alignment and misalignment midtraining, reasoning posttraining, and generalization to chat and agentic evals.

Mar 25

Introducing Model Spec Evals

A new evaluation suite for measuring how well models follow the OpenAI Model Spec.

Mar 21

Training agents to self-report misbehavior

We train agents to call a reporting tool when they covertly misbehave, sharply reducing undetected attacks.

Mar 19

How we monitor internal coding agents for misalignment

A look at how we monitor internal coding agents to detect misaligned behavior early.

Mar 16

Metagaming matters for training, evaluation, and oversight

Metagaming can complicate how we interpret behavior, and current models still give us a chance to study it directly.

Mar 11

Interpreting Black Box Reward Models

ARGO distills black-box reward models into interpretable rubrics using reinforcement learning.

Feb 6

Discovering unknown AI misalignments in real-world usage

Reasoning models can find and understand unknown misaligned behaviors from how users respond.

Jan 14

CoVal: Learning values-aware rubrics from the crowd

An experimental dataset of crowd-written rubrics that surfaces why people prefer one model output over another.

Jan 14

Why we are excited about confessions

Deeper analysis of confession training and comparisons to chain-of-thought monitoring.

Dec 22

Helpful assistant features suppress emergent misalignment

Emergent misalignment not only activates misaligned personas, but also suppresses helpful assistant personas.

Dec 18

Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations

A pipeline to uncover unknown misaligned behavior and scale the creation of realistic evaluations.

Dec 1

Debugging misaligned completions with sparse-autoencoder latent attribution

Efficiently finding features that cause behaviors.

Dec 1

A Practical Approach to Verifying Code at Scale

We train and deploy an AI review agent optimised for precision and real-world use, enabling oversight to scale with autonomous code generation.

Dec 1

Hello World

Introducing our blog on alignment research.