CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

🌐 Project | 📑 Paper | 🤗 Hugging Face Models | 🤗 Hugging Face Datasets

Large language models (LLMs) often struggle with context fidelity, producing inconsistent or hallucinated answers even when relevant information is present.
We propose CARE, a native retrieval-augmented reasoning framework that integrates in-context evidence directly into the reasoning chain.
This work represents a step toward making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

Results Overview

Figure 1: Comparison of model performance across different settings. CARE demonstrates improved results over baselines on multiple QA benchmarks.

Method Overview

Figure 2: A schematic illustration of the training data and process. The upper part shows SFT data generation (fact injection and special tokens), while the lower part shows the SFT training process together with reinforcement learning (RL) using multiple rewards.

🔧 Installation

Requirements:

Python 3.9+
requirements.txt includes:
- transformers>=4.51.0
- flash-attn>=2.4.3
- vllm>=0.8.3

Clone and install:

git clone https://github.com/FoundationAgents/CARE
cd CARE
pip install -r requirements.txt
pip install -e .

📥 Data and Model Download

Use the provided helper script to download Qwen models and datasets (DROP, MuSiQue):

python CARE/scripts/load_script/load_dataset.py

python CARE/scripts/load_script/load_model.py

This will save all resources under CARE/datasets/.

🚀 Reinforcement Learning

We provide ready-to-run training examples. For Qwen2.5-7B with DROP + MuSiQue:

bash CARE/scripts/training_examples/run_qwen2_5_7b_retrieve_mix_musique.sh

Edit the script to change:

MODEL_PATH → local checkpoint path or Hugging Face repo id.
data.train_files / data.extra_files / data.val_files → datasets.
SYSTEM_PROMPT → reasoning style prompt.
trainer.max_steps / trainer.n_gpus_per_node → training setup.

📊 Results

Benchmark Comparison

Model	Method	MFQA	HotpotQA	2WikiMQA	MuSiQue	Average
LLaMA-3.1 8B	Original	45.57	54.64	45.87	32.08	44.54
	R1-Searcher	28.44	53.71	67.10	41.41	47.67
	CRAG	44.04	37.88	25.95	24.10	32.99
	CARE	49.94	63.09	75.29	51.00	59.83
Qwen2.5 7B	Original	46.94	58.47	46.96	30.78	45.79
	R1-Searcher	28.36	55.43	65.79	47.09	49.17
	CARE	48.11	63.45	70.11	45.57	56.81
Qwen2.5 14B	Original	47.58	61.94	59.05	37.99	51.64
	CRAG	50.89	44.74	34.68	28.17	39.62
	CARE	48.81	67.75	78.68	51.27	61.63

Table 1: Evaluation on real-world QA datasets. Results are grouped by the base LLM. The best and second-best results are shown in bold and underline, respectively. Slash (/) indicates unavailable checkpoints or unsupported models.

Ablation Study

Setting	SFT	RL	Retrieval	Curriculum	MFQA	HotpotQA	2WikiMQA	MuSiQue	CofCA	Average
Baseline	✗	✗	✗	✗	46.64	58.47	46.96	30.78	58.38	48.25
SFT Only	✓	✗	✗	✗	42.24	47.08	61.51	33.82	59.21	48.77
No Retrieval	✓	✓	✗	✗	37.66	62.59	70.57	43.85	57.26	54.39
No Curriculum	✓	✓	✓	✗	38.33	64.10	70.69	47.49	60.60	56.24
CARE	✓	✓	✓	✓	48.11	63.45	70.11	45.57	64.56	58.36

Table 2: Ablation studies on QA tasks based on Qwen2.5-7B. The best and second-best results are shown in bold and underline, respectively. “Ret.” indicates the retrieval reward, and “Cur.” indicates curriculum learning.

📌 Whether to enable curriculum learning can be controlled in
verl/trainer/config.py.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
assets		assets
docs		docs
evaluation/LongBench		evaluation/LongBench
scripts		scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Results Overview

Method Overview

🔧 Installation

📥 Data and Model Download

🚀 Reinforcement Learning

📊 Results

Benchmark Comparison

Ablation Study

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

FoundationAgents/CARE

Folders and files

Latest commit

History

Repository files navigation

CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Results Overview

Method Overview

🔧 Installation

📥 Data and Model Download

🚀 Reinforcement Learning

📊 Results

Benchmark Comparison

Ablation Study

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages