
Rafael Mitkov Rafailov
Research Scientist · Stanford
Rafael Mitkov Rafailov, the “DPO guy,” is a researcher and practitioner in machine learning best known for inventing Direct Preference Optimization (DPO). Since its release, DPO has redefined how large language models are aligned with human and synthetic feedback, replacing heavyweight RLHF pipelines with a simple, scalable alternative. DPO is now at the core of instruction-tuned models from leading labs like Meta’s LLaMA, Mistral’s open-weight releases. Beyond DPO, Rafael’s research covers reinforcement learning, imitation learning, and preference modeling, where he develops alignment methods that are both theoretically elegant and deployable at scale.
















