Persona Evaluation

persona-bench: An Evaluation Harness for Personalization & Reproducible Pluralistic Alignment

Human vs AI Personalization Challenge

You'll be competing against a frontier AI model in crafting personalized responses.

Disclaimer: Some questions may touch on sensitive topics. Please engage thoughtfully and respectfully. If you feel uncomfortable with any question, feel free to skip it.

Current Language Model's Ability to Successfully Personalize for a Known Demographic Varies Widely

Models

Method

Chart

Group

Sort

Metric

Want to see how your model performs?

Did you prompt?

Evaluate your model's chat style across 1,000+ personas

Did you tune?

Fine-tuning can break your model's ability to personalize for specific sub-demographics...

Developer? Bulk evals

Connect to our held-out evaluation

Prompt Evaluation Tool

Evaluate your prompt performance.

API Usage

Current Token Count: 0 / 1,000,000

Evaluation Type: comparison

This API-exclusive feature offers the most rigorous and grounded evaluation. Use it when you need high-confidence results, especially for benchmarking against known standards or for critical applications where personalization accuracy is paramount. Your score is relative to 1/2. Below 1/2, you perform worse than a human curated set of LLM responses. Likewise, above 1/2, you're doing better than a human curated set of LLM responses.

System Prompt

Insert Persona Attributes:

Run Name

Evaluation Type

Seed

Number of Questions

Sign Up to Use the Evaluation Tool

Create an account to access the full features of our Evaluation Tool.

New users receive complimentary credits to get started!