Abstract
Generative Reward Models (GenRM) is a novel framework that combines RLHF and RLAIF to better align LLMs with human preferences, outperforming classical methods by up to 45%. We introduce Chain-of-Thought Generative Reward Models (CoT-GenRM) as a hybrid approach that combines the best of both worlds, with a crucial emphasis on reasoning. Our STaR-DPO method shows significant improvements in both in-distribution and out-of-distribution tasks.