🌐 Project | 📑 Paper | 🤗 Hugging Face Models | 🤗 Hugging Face Datasets
Large language models (LLMs) often struggle with context fidelity, producing inconsistent or hallucinated answers even when relevant information is present.
We propose CARE, a native retrieval-augmented reasoning framework that integrates in-context evidence directly into the reasoning chain.
This work represents a step toward making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.
Figure 1: Comparison of model performance across different settings. CARE demonstrates improved results over baselines on multiple QA benchmarks.
Figure 2: A schematic illustration of the training data and process. The upper part shows SFT data generation (fact injection and special tokens), while the lower part shows the SFT training process together with reinforcement learning (RL) using multiple rewards.
Requirements:
- Python 3.9+
- requirements.txt includes:
transformers>=4.51.0
flash-attn>=2.4.3
vllm>=0.8.3
Clone and install:
git clone https://github.com/FoundationAgents/CARE
cd CARE
pip install -r requirements.txt
pip install -e .
Use the provided helper script to download Qwen models and datasets (DROP, MuSiQue):
python CARE/scripts/load_script/load_dataset.py
python CARE/scripts/load_script/load_model.py
This will save all resources under CARE/datasets/
.
We provide ready-to-run training examples. For Qwen2.5-7B with DROP + MuSiQue:
bash CARE/scripts/training_examples/run_qwen2_5_7b_retrieve_mix_musique.sh
Edit the script to change:
- MODEL_PATH → local checkpoint path or Hugging Face repo id.
- data.train_files / data.extra_files / data.val_files → datasets.
- SYSTEM_PROMPT → reasoning style prompt.
- trainer.max_steps / trainer.n_gpus_per_node → training setup.
Model | Method | MFQA | HotpotQA | 2WikiMQA | MuSiQue | Average |
---|---|---|---|---|---|---|
LLaMA-3.1 8B | Original | 45.57 | 54.64 | 45.87 | 32.08 | 44.54 |
R1-Searcher | 28.44 | 53.71 | 67.10 | 41.41 | 47.67 | |
CRAG | 44.04 | 37.88 | 25.95 | 24.10 | 32.99 | |
CARE | 49.94 | 63.09 | 75.29 | 51.00 | 59.83 | |
Qwen2.5 7B | Original | 46.94 | 58.47 | 46.96 | 30.78 | 45.79 |
R1-Searcher | 28.36 | 55.43 | 65.79 | 47.09 | 49.17 | |
CARE | 48.11 | 63.45 | 70.11 | 45.57 | 56.81 | |
Qwen2.5 14B | Original | 47.58 | 61.94 | 59.05 | 37.99 | 51.64 |
CRAG | 50.89 | 44.74 | 34.68 | 28.17 | 39.62 | |
CARE | 48.81 | 67.75 | 78.68 | 51.27 | 61.63 |
Table 1: Evaluation on real-world QA datasets. Results are grouped by the base LLM. The best and second-best results are shown in bold and underline, respectively. Slash (/) indicates unavailable checkpoints or unsupported models.
Setting | SFT | RL | Retrieval | Curriculum | MFQA | HotpotQA | 2WikiMQA | MuSiQue | CofCA | Average |
---|---|---|---|---|---|---|---|---|---|---|
Baseline | ✗ | ✗ | ✗ | ✗ | 46.64 | 58.47 | 46.96 | 30.78 | 58.38 | 48.25 |
SFT Only | ✓ | ✗ | ✗ | ✗ | 42.24 | 47.08 | 61.51 | 33.82 | 59.21 | 48.77 |
No Retrieval | ✓ | ✓ | ✗ | ✗ | 37.66 | 62.59 | 70.57 | 43.85 | 57.26 | 54.39 |
No Curriculum | ✓ | ✓ | ✓ | ✗ | 38.33 | 64.10 | 70.69 | 47.49 | 60.60 | 56.24 |
CARE | ✓ | ✓ | ✓ | ✓ | 48.11 | 63.45 | 70.11 | 45.57 | 64.56 | 58.36 |
Table 2: Ablation studies on QA tasks based on Qwen2.5-7B. The best and second-best results are shown in bold and underline, respectively. “Ret.” indicates the retrieval reward, and “Cur.” indicates curriculum learning.
📌 Whether to enable curriculum learning can be controlled in
verl/trainer/config.py
.