Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user’s preferences, we argue that factuality is an equally important yet often overlooked dimension.
Let x be a user query, P = {p₁, …, pₙ} the set of user features/preference set, and M a language model.
Given input (x, P), the model outputs:
$y = M(x, P)$
The following are the binary functions:
-
$\text{Acc}(y) = 1$ if y is factually correct w.r.t. x; else 0. -
$\text{PrefRel}(x, P) = 1$ if some$p_x \in P$ is relevant to x; else 0. -
$\text{Followed}(y, P) = 1$ if y incorporates a relevant$p_x \in P$ ; else 0.
We say a model (M) is said to be robust iff:
(1) It maintains factual accuracy while conditioning on the relevant pᵢ ∈ P for any given query x.
(2) It ignores irrelevant user features within the feature set P for any given query x.
We show our dataset curation pipeline below (see paper for more details)
Our version is available to download @ data/robuset_main.csv
We introduce four complementary error-based metrics. Lower values (closer to zero) across all metrics indicate more robust, stable, and consistent behavior.
Breakage Rate
Measures how often personalization causes the model to fail on inputs that it handles correctly without any preference conditioning.
Formally,
Alignment Failure
Measures among examples where the model answers correctly without personalization, how often the model fails to align with user preferences.
Robustness Error
Robustness Error is the union of breakage and alignment failure sets and measures how often the model either fails to answer it correctly or aligns with user preference. Formally:
Performance Variation
Measures the divergence in correctness with and without personalization.
Similar to Jaccard distance, we define it as:
Some of the key research questions and results are summarized below:
Q: Are LLMs robust when we include a relevant user preference?
Answer: No. Models exhibit varying levels of breakage and alignment failures, that can lead to a combined robustness as bad as upto 34% in some of the less robust models, and even as high as 9% in some of the more robust models.
Q: How robust are LLMs when both relevant and irrelevant preferences are present?
Answer: Irrelevant preferences amplify robustness errors.
Q: What types of failure patterns do models exhibit?
Answer: Question and preference categories significantly influence robustness.
We introduce Pref-Aligner, a two-stage agentic framework, which decouples generation from personalization with an agent specialized for each task. In the first stage, a generation agent responds to user queries without considering their defined preferences (if any). In the second stage, the aligner agent takes the unconditioned response from the generation agent, the user preference(s), and produces an aligned response (if needed). That way, we eliminate the inconsistencies resulting from preference signals during initial generation.
Results show that our framework consistently improves robustness across the representative models we evaluated - Llama3-8B, Llama3-70B, Mistral-8x7b, and Gemma-9B models. Notably, the breakage rate for Llama-70B drops from 5.6% to 1.3% in relevant preference settings and remain consistent even in mixed and irrelevant preference settings, highlighting the effectiveness of our proposed framework in diverse conditions.
Model | Method | Robustness Error ( |
---|---|---|
Llama3-8B | Naive Prompting | 20.9 |
Pref-Aligner (ours) | 18.1 | |
Llama3-70B | Naive Prompting | 9.0 |
Pref-Aligner (ours) | 6.5 | |
Mixtral-8x7B | Naive Prompting | 26.1 |
Pref-Aligner (ours) | 18.9 | |
Gemma-2-9B | Naive Prompting | 12.6 |
Pref-Aligner (ours) | 6.8 |
Table: Robustness Error comparison between Naive Prompting (Zero-Shot) and Pref-Aligner across four models. Pref-Aligner consistently reduces robustness error across all models, achieving a minimum relative reduction of 13% (Llama3-70B) and up to 46% (Gemma-2-9B).
Method | Relevant ( |
Mixed ( |
Irrelevant ( |
---|---|---|---|
Naive Prompting | 5.6 | 6.9 | 5.5 |
Pref-Aligner (ours) | 1.1 | 1.2 | 1.2 |
Table: Breakage Rate: Pref-Aligner Results compared to Zero-Shot for Llama-70B in three preference relevance settings. Pref-Aligner shows significant performance improvement over naive across all settings. Also, this performance remains consistent irrespective of preference setting.
Current LLMs are not fully robust: preference signals can be
- ignored (misalignment) Or
- degrade factual reliability (breakage)
This work provides important insights into an often overlooked aspect of personalization evaluation: factual correctness, as well as provides practical insights on model selection for user-adaptive applications.