We work on frontier challenges in AI post-training with an open science approach
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedback. Our research focuses on improving AI systems' ability to understand and adhere to human values and preferences across diverse contexts. By leveraging Chain-of-Thought (CoT) reasoning and innovative training techniques, GenRM aims to create more robust, generalizable, and ethically aligned AI systems.
Access frontier generative AI post-training capabilities as a research partnertner
Explore our models, contribute to research, and join our growing community of AI researchers and practitioners.
PERSONA introduces a reproducible testbed designed to evaluate and improve LLM pluralistic alignment through 1,586 synthetic personas derived from US census data. The framework encompasses 3,868 prompts and 317,200 feedback pairs, establishing both PERSONA Bench for systematic evaluation of language models' role-playing capabilities and a comprehensive dataset for developing future alignment benchmarks.
This research introduces Direct Principle Feedback (DPF), a simplified variant of Constitutional AI that enables real-time control of language models at inference time. The approach achieves GPT-4-level performance in controlled entity substitution, significantly outperforming both Llama-2-13B-Chat and prompted baselines.
Research Scientist
2024
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedbac...
2024
This paper introduces Direct Preference Optimization (DPO), a novel approach for training language models that leverages preference data....
2021
This work presents Combo, a novel approach for offline reinforcement learning that combines model-based and conservative policy optimization technique...
Research Scientist
2024
We introduce Generative Reward Models (GenRM), a novel approach to AI alignment that combines the strengths of human feedback and AI-generated feedbac...
2023
This paper introduces RWKV, a novel architecture that combines the efficiency of RNNs with the expressiveness of Transformers....
2023
This work presents Logic-LM, a method for enhancing language models with symbolic reasoning capabilities....
Co-founder, Research Scientist
Research Scientist
Research Scientist
Research Scientist
Research Scientist
Our most recent 3 publications
3 publications
2024-10-3
GenRM: Generative Reward Models for AI AlignmentDakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, Alon Albalak
2024-07-24
PERSONA: A Reproducible Testbed for Pluralistic AlignmentLouis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, Chelsea Finn
2024-02-12
Suppressing pink elephants with direct principle feedbackLouis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
We're always open to new collaborations and ideas. If you're interested in working with us or have any questions, please reach out!