Skip to content

FoundationAgents/CARE

CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

🌐 Project   |    📑 Paper    |    🤗 Hugging Face Models   |    🤗 Hugging Face Datasets

Large language models (LLMs) often struggle with context fidelity, producing inconsistent or hallucinated answers even when relevant information is present.
We propose CARE, a native retrieval-augmented reasoning framework that integrates in-context evidence directly into the reasoning chain.
This work represents a step toward making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

Results Overview

CARE Results

Figure 1: Comparison of model performance across different settings. CARE demonstrates improved results over baselines on multiple QA benchmarks.

Method Overview

Figure 2: A schematic illustration of the training data and process. The upper part shows SFT data generation (fact injection and special tokens), while the lower part shows the SFT training process together with reinforcement learning (RL) using multiple rewards.


🔧 Installation

Requirements:

  • Python 3.9+
  • requirements.txt includes:
    • transformers>=4.51.0
    • flash-attn>=2.4.3
    • vllm>=0.8.3

Clone and install:

git clone https://github.com/FoundationAgents/CARE
cd CARE
pip install -r requirements.txt
pip install -e .

📥 Data and Model Download

Use the provided helper script to download Qwen models and datasets (DROP, MuSiQue):

python CARE/scripts/load_script/load_dataset.py
python CARE/scripts/load_script/load_model.py

This will save all resources under CARE/datasets/.


🚀 Reinforcement Learning

We provide ready-to-run training examples. For Qwen2.5-7B with DROP + MuSiQue:

bash CARE/scripts/training_examples/run_qwen2_5_7b_retrieve_mix_musique.sh

Edit the script to change:

  • MODEL_PATH → local checkpoint path or Hugging Face repo id.
  • data.train_files / data.extra_files / data.val_files → datasets.
  • SYSTEM_PROMPT → reasoning style prompt.
  • trainer.max_steps / trainer.n_gpus_per_node → training setup.

📊 Results


Benchmark Comparison

Model Method MFQA HotpotQA 2WikiMQA MuSiQue Average
LLaMA-3.1 8B Original 45.57 54.64 45.87 32.08 44.54
R1-Searcher 28.44 53.71 67.10 41.41 47.67
CRAG 44.04 37.88 25.95 24.10 32.99
CARE 49.94 63.09 75.29 51.00 59.83
Qwen2.5 7B Original 46.94 58.47 46.96 30.78 45.79
R1-Searcher 28.36 55.43 65.79 47.09 49.17
CARE 48.11 63.45 70.11 45.57 56.81
Qwen2.5 14B Original 47.58 61.94 59.05 37.99 51.64
CRAG 50.89 44.74 34.68 28.17 39.62
CARE 48.81 67.75 78.68 51.27 61.63

Table 1: Evaluation on real-world QA datasets. Results are grouped by the base LLM. The best and second-best results are shown in bold and underline, respectively. Slash (/) indicates unavailable checkpoints or unsupported models.

Ablation Study

Setting SFT RL Retrieval Curriculum MFQA HotpotQA 2WikiMQA MuSiQue CofCA Average
Baseline 46.64 58.47 46.96 30.78 58.38 48.25
SFT Only 42.24 47.08 61.51 33.82 59.21 48.77
No Retrieval 37.66 62.59 70.57 43.85 57.26 54.39
No Curriculum 38.33 64.10 70.69 47.49 60.60 56.24
CARE 48.11 63.45 70.11 45.57 64.56 58.36

Table 2: Ablation studies on QA tasks based on Qwen2.5-7B. The best and second-best results are shown in bold and underline, respectively. “Ret.” indicates the retrieval reward, and “Cur.” indicates curriculum learning.

📌 Whether to enable curriculum learning can be controlled in
verl/trainer/config.py.

About

CARE: Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages