Post-training Research

We work on frontier challenges in AI post-training with an open science approach

Latest Research

GenRM: Generative Reward Models for AI Alignment

GenRM: Generative Reward Models for AI Alignment

We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedback. Our research focuses on improving AI systems' ability to understand and adhere to human values and preferences across diverse contexts. By leveraging Chain-of-Thought (CoT) reasoning and innovative training techniques, GenRM aims to create more robust, generalizable, and ethically aligned AI systems.

Get Early Access

Access frontier generative AI post-training capabilities as a research partnertner

Open Source

Explore our models, contribute to research, and join our growing community of AI researchers and practitioners.

PERSONA: A Reproducible Testbed for Pluralistic Alignment

PERSONA: A Reproducible Testbed for Pluralistic Alignment

PERSONA introduces a reproducible testbed designed to evaluate and improve LLM pluralistic alignment through 1,586 synthetic personas derived from US census data. The framework encompasses 3,868 prompts and 317,200 feedback pairs, establishing both PERSONA Bench for systematic evaluation of language models' role-playing capabilities and a comprehensive dataset for developing future alignment benchmarks.

Suppressing pink elephants with direct principle feedback

Suppressing pink elephants with direct principle feedback

This research introduces Direct Principle Feedback (DPF), a simplified variant of Constitutional AI that enables real-time control of language models at inference time. The approach achieves GPT-4-level performance in controlled entity substitution, significantly outperforming both Llama-2-13B-Chat and prompted baselines.

Join the team

Join the team

Research Team

Rafael Mitkov Rafailov

Rafael Mitkov Rafailov

Research Scientist

Selected Work

Alon Albalak

Alon Albalak

Research Scientist

Selected Work

Collaborators

eleuther-logostanford-logo

Recent Publications

Our most recent 3 publications

3 publications

2024-10-3

GenRM: Generative Reward Models for AI Alignment

Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak

2024-07-24

PERSONA: A Reproducible Testbed for Pluralistic Alignment

Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, Chelsea Finn

2024-02-12

Suppressing pink elephants with direct principle feedback

Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman


Interested in Collaboration?

We're always open to new collaborations and ideas. If you're interested in working with us or have any questions, please reach out!