FEB 2024

Suppressing Pink Elephants with Direct Principle Feedback

By the SynthLabs Research Team

10 MIN READ • RESEARCH

Abstract

Existing methods for controlling language models often struggle with the “Pink Elephant Problem” - when instructed to avoid discussing a certain entity, they frequently do the opposite and bring up that very topic. We present Direct Principle Feedback (DPF), a novel fine-tuning method that enables models to reliably avoid specified topics while maintaining conversational quality. Our approach achieves performance comparable to GPT-4 on this challenging task using only synthetic data.

The Pink Elephant Problem

Large Language Models (LLMs) often struggle with a paradoxical behavior we call the “Pink Elephant Problem” - when explicitly instructed not to mention a specific topic, they frequently do the opposite and bring up that very topic. This phenomenon mirrors the psychological challenge of trying not to think about something once it's been mentioned. For AI systems deployed in real-world applications, this represents a critical controllability failure that undermines user trust and system reliability.

The Pink Elephant Problem represents a fundamental challenge in LLM controllability - models often mention forbidden topics precisely because they were told to avoid them.

When users provide negative constraints - instructions about what not to discuss - models often fail spectacularly. This occurs even with state-of-the-art models: we found that baseline instruction-tuned models like OpenHermes either showed no improvement or actually became more likely to mention forbidden topics when explicitly told to avoid them.

This problem is particularly acute because it affects inference-time controllability - the ability to dynamically impose new behavioral constraints without retraining the model. Unlike static safety measures built into models during training, users often need flexible control over what topics to avoid based on context, cultural sensitivities, or specific use cases.

Direct Principle Feedback: A Simplified Approach

We developed Direct Principle Feedback (DPF) as a streamlined approach to Constitutional AI that directly addresses the Pink Elephant Problem. DPF simplifies the traditional Constitutional AI pipeline by eliminating the intermediate ranking step and directly applying preference optimization to original and revised response pairs.

Methodology Comparison

Direct Principle Feedback simplifies the Constitutional AI pipeline by eliminating intermediate ranking steps and directly applying preference optimization to critique-revision pairs.

Constitutional AI (Traditional)

Generate Responses

Critique Responses

Create Revisions

Generate Multiple Candidates

Rank Candidates

Apply Preference Learning

Complex multi-step process with ranking and candidate generation

Direct Principle Feedback (DPF)

Generate Responses

Critique & Revise

Direct Preference Optimization

50% Fewer Steps

DPF achieves comparable performance to Constitutional AI while reducing the training pipeline complexity by half.

Streamlined approach with direct preference optimization

Click on any step to learn more about the process

The traditional Constitutional AI process involves multiple steps: generating responses, critiquing them, creating revisions, generating multiple candidates, ranking them, and finally applying preference learning. DPF cuts through this complexity by treating the original response as the “dispreferred” example and the revised response as the “preferred” example, then directly applying Direct Preference Optimization (DPO).

This approach leverages the inherent preference structure in the critique-and-revision process. Since the revision explicitly addresses the problematic aspects of the original response according to a specific principle (avoiding the Pink Elephant), it naturally represents the desired behavior without requiring additional ranking steps.

Synthetic Dataset Generation at Scale

A crucial component of our research was generating a large-scale synthetic dataset of 162,000 multi-turn conversations spanning 29 diverse domains. This dataset was designed to capture natural conversations where chatbots fail to avoid specific topics, along with improved versions that successfully redirect conversations.

Multi-turn Conversations

Diverse Domains

Entity Pairs

Dataset Split

Training (96%)

Validation (2%)

Test (2%)

Our comprehensive synthetic dataset enables robust training for the Pink Elephant avoidance task

Our generation process involved several sophisticated steps:

Topic and Entity Pair Creation

We used GPT-4 to generate 200 diverse conversation topics and approximately 2,500 "Pink Elephant - Grey Elephant" pairs. These pairs consist of related but distinct entities where one should be avoided (Pink Elephant) while the other represents an acceptable alternative (Grey Elephant).

Dialogue Planning and Generation

Rather than directly generating conversations, we employed a two-step process using StableBeluga2-70B. First, we generated detailed plans for how conversations would naturally evolve from discussing the Grey Elephant to inappropriately mentioning the Pink Elephant.

Critique and Revision

For each problematic dialogue, our system generated both a critique explaining why mentioning the Pink Elephant was inappropriate and a revision that successfully avoided the forbidden topic while maintaining conversational flow.

Quality Control

The dataset underwent rigorous filtering using multiple similarity metrics to ensure quality. We excluded dialogues if they mentioned the Pink Elephant before the final turn, failed to mention it in the original final turn, or still mentioned it after revision.

Results: Achieving GPT-4 Level Performance

Our results demonstrated significant improvements in Pink Elephant avoidance while maintaining general capabilities, achieving performance comparable to GPT-4.

We fine-tuned OpenHermes 7B and 13B models using DPO on our synthetic dataset. The results were striking:

Model Performance Comparison

Lower “with prompt” scores indicate better suppression capability

Model	Base Score	With Prompt	Delta (↑)
GPT-4 (baseline)	0.33	0.13	+0.20
OpenHermes-13B w/ DPFDPF	0.34	0.15	+0.19
OpenHermes-7B w/ DPFDPF	0.34	0.17	+0.17
Llama-2-13B-Chat	0.33	0.25	+0.08
OpenHermes-13B	0.34	0.34	0.00
OpenHermes-7B	0.33	0.36	-0.03

Baseline model

DPFDirect Principle Feedback

These results demonstrate that DPF-trained models achieve performance comparable to GPT-4 on the Pink Elephant avoidance task, representing a significant improvement over baseline instruction-tuned models.

Qualitative analysis revealed that DPF-trained models produced more coherent and natural redirections when avoiding forbidden topics, gracefully steering conversations toward acceptable alternatives rather than simply refusing to engage.

Broader Implications and Future Directions

This research addresses critical needs in LLM deployment and control. The ability to dynamically impose behavioral constraints at inference time enables much more flexible and context-aware AI systems. Rather than requiring separate models for different use cases or extensive retraining for new constraints, a single model can adapt to diverse requirements.

The implications extend beyond simple topic avoidance. Our methodology could be applied to other behavioral specifications such as preventing hallucination in specific domains, enforcing particular communication styles, or adhering to cultural or regulatory requirements that vary by deployment context.

For practical applications, this work enables more trustworthy conversational AI systems that can reliably respect user boundaries and organizational policies. A global AI product could be adapted to local sensitivities without requiring entirely new models, making AI deployment more efficient and culturally appropriate.

Technical Innovations

Beyond the core DPF method, we introduced several technical innovations. Our use of dialogue planning as an intermediate step in synthetic data generation significantly improved the naturalness and quality of training conversations compared to direct generation approaches.

We also developed a comprehensive evaluation framework, including the validation of GPT-4 as a reliable evaluator for this task, which provides a template for assessing similar behavioral modifications in LLMs. Our multi-metric filtering approach for dataset quality control offers best practices for synthetic data curation.

Future work could extend this approach to more complex constraints, such as simultaneously avoiding multiple topics or handling hierarchical avoidance rules. We also suggest exploring how well the learned avoidance behavior generalizes to categories of entities rather than specific pairs.

Join Our Mission

Research & Engineering

Join our AI Research team to:

Develop novel approaches to AI alignment and controllability
Create synthetic datasets for challenging behavioral tasks
Build evaluation frameworks for AI safety

Academic Collaboration

We're always open to new collaborations on AI controllability.

Partner on research into AI controllability and alignment
Develop new evaluation methodologies for behavioral constraints
Explore applications to diverse domains and use cases