Benchmarking and Improving LLM Robustness for Personalized Generation

Motivation

Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension.


Figure 1a: The example is from Mistral 7B_Instruct. When prompted with certain preferences, the model's response aligns with the user preference, but fails the question as the preference affects the model's reasoning.	Figure 1b: The model hallucinates a non-existent restaurant to match the user’s preference. Without jointly evaluating factuality and preference alignment we risk overestimating model capabilities.

Problem Formulation

Let x be a user query, P = {p₁, …, pₙ} the set of user features/preference set, and M a language model.
Given input (x, P), the model outputs:

$y = M(x, P)$

The following are the binary functions:

$\text{Acc}(y) = 1$ if y is factually correct w.r.t. x; else 0.
$\text{PrefRel}(x, P) = 1$ if some $p_x \in P$ is relevant to x; else 0.
$\text{Followed}(y, P) = 1$ if y incorporates a relevant $p_x \in P$; else 0.

We say a model (M) is said to be robust iff:

(1) It maintains factual accuracy while conditioning on the relevant pᵢ ∈ P for any given query x.
(2) It ignores irrelevant user features within the feature set P for any given query x.

$$\text{Robust}(x,P,y)=\begin{cases}\text{Acc}(y)\land\text{Followed}(y,P)&\text{if};\text{PrefRel}(x,P)=1\\text{Acc}(y)&\text{if};\text{PrefRel}(x,P)=0;\text{or};P=\emptyset\end{cases}$$

Evaluation

Dataset

We show our dataset curation pipeline below (see paper for more details)

Our version is available to download @ data/robuset_main.csv

Metrics

We introduce four complementary error-based metrics. Lower values (closer to zero) across all metrics indicate more robust, stable, and consistent behavior.

Breakage Rate

Measures how often personalization causes the model to fail on inputs that it handles correctly without any preference conditioning.

Formally,

$$\text{Breakage Rate} = 1 - \mathbb{E}_{x \in Q^*}[\text{Acc}_{\text{pref}}(y)]$$

Alignment Failure

Measures among examples where the model answers correctly without personalization, how often the model fails to align with user preferences.

$$\text{Alignment Failure} = 1 - \mathbb{E}_{x \in Q^*}[\text{Followed}(y, P)].$$

Robustness Error

Robustness Error is the union of breakage and alignment failure sets and measures how often the model either fails to answer it correctly or aligns with user preference. Formally:

$$ \text{Robustness Error} = 1 - \mathbb{E}_{x \in Q^*} \left[ \text{Acc}_{\text{pref}}(y) \ \cap\ \text{Followed}(y, P) \right] \\ = 1 - \mathbb{E}_{x \in Q^*} \left[\text{Robust}(x, P, y)\right] $$

Performance Variation

Measures the divergence in correctness with and without personalization.

Similar to Jaccard distance, we define it as:

$$\text{Performance Variation} = 1 - \frac{|\mathcal{A}_{\text{pref}} \cap \mathcal{A}_{\text{no-pref}}|}{|\mathcal{A}_{\text{pref}} \cup \mathcal{A}_{\text{no-pref}}|},$$

Research Questions/Results

Some of the key research questions and results are summarized below:

Q: Are LLMs robust when we include a relevant user preference?

Answer: No. Models exhibit varying levels of breakage and alignment failures, that can lead to a combined robustness as bad as upto 34% in some of the less robust models, and even as high as 9% in some of the more robust models.

Q: How robust are LLMs when both relevant and irrelevant preferences are present?

Answer: Irrelevant preferences amplify robustness errors.

Q: What types of failure patterns do models exhibit?

Answer: Question and preference categories significantly influence robustness.

Improving Robustness: Pref-Aligner

We introduce Pref-Aligner, a two-stage agentic framework, which decouples generation from personalization with an agent specialized for each task. In the first stage, a generation agent responds to user queries without considering their defined preferences (if any). In the second stage, the aligner agent takes the unconditioned response from the generation agent, the user preference(s), and produces an aligned response (if needed). That way, we eliminate the inconsistencies resulting from preference signals during initial generation.

Results show that our framework consistently improves robustness across the representative models we evaluated - Llama3-8B, Llama3-70B, Mistral-8x7b, and Gemma-9B models. Notably, the breakage rate for Llama-70B drops from 5.6% to 1.3% in relevant preference settings and remain consistent even in mixed and irrelevant preference settings, highlighting the effectiveness of our proposed framework in diverse conditions.

Model	Method	Robustness Error ($\downarrow$)
Llama3-8B	Naive Prompting	20.9
	Pref-Aligner (ours)	18.1
Llama3-70B	Naive Prompting	9.0
	Pref-Aligner (ours)	6.5
Mixtral-8x7B	Naive Prompting	26.1
	Pref-Aligner (ours)	18.9
Gemma-2-9B	Naive Prompting	12.6
	Pref-Aligner (ours)	6.8

Table: Robustness Error comparison between Naive Prompting (Zero-Shot) and Pref-Aligner across four models. Pref-Aligner consistently reduces robustness error across all models, achieving a minimum relative reduction of 13% (Llama3-70B) and up to 46% (Gemma-2-9B).

Method	Relevant ($\downarrow$)	Mixed ($\downarrow$)	Irrelevant ($\downarrow$)
Naive Prompting	5.6	6.9	5.5
Pref-Aligner (ours)	1.1	1.2	1.2

Table: Breakage Rate: Pref-Aligner Results compared to Zero-Shot for Llama-70B in three preference relevance settings. Pref-Aligner shows significant performance improvement over naive across all settings. Also, this performance remains consistent irrespective of preference setting.

Conclusion

Current LLMs are not fully robust: preference signals can be

ignored (misalignment) Or
degrade factual reliability (breakage)

This work provides important insights into an often overlooked aspect of personalization evaluation: factual correctness, as well as provides practical insights on model selection for user-adaptive applications.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.vscode		.vscode
data		data
evaluations		evaluations
paper_images		paper_images
preferences		preferences
results		results
utils		utils
.gitignore		.gitignore
README.md		README.md
bfg.jar		bfg.jar
commonsense_qa_prefs.csv		commonsense_qa_prefs.csv
enums.py		enums.py
main.py		main.py
mcq_policy_errors.jsonl		mcq_policy_errors.jsonl
mmlu_preferences.csv		mmlu_preferences.csv
pref-aligner-conda.yml		pref-aligner-conda.yml
process_results.ipynb		process_results.ipynb
requirements.txt		requirements.txt
results_helper.py		results_helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Benchmarking and Improving LLM Robustness for Personalized Generation

Motivation

Problem Formulation