-
Notifications
You must be signed in to change notification settings - Fork 2.2k
🌪️ [GFPO]: implement GFPO in GRPOTrainer #3989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR! Just went through the paper. I feel like the idea can be implemented without any modification in the codebase, leveraging the fact that, when def math_reward_func(completions, **kwargs):
rewards = []
for idx, completion in enumerate(completions):
if i%2 == 0:
correct = check_math_solution(prompt, completion)
reward = 1.0 if correct else -1.0
rewards.append(reward)
else:
rewards.append(None)
return rewards can you confirm my intuition? |
I believe that filtering out unsatisfied completions based on masking reward does not align with the core solution proposed by GFPO. The reason given in the paper why GFPO chose to filter the completions without changing rewards is: In section two
In section three
The Idea of GFPO is data filtration
GFPO's metrics validate reward integrity
|
Sorry but I still don't get the difference. How is it different from doing this for example: from collections import defaultdict
def reward_func(prompts, completions_ids, **kwargs):
num_remains_in_group = 2
rewards = [1.0] * len(prompts) # default reward
# Group indices by prompt
groups = defaultdict(list)
for idx, prompt in enumerate(prompts):
groups[prompt].append(idx)
# For each group, deactivate the k longest completions
for prompt, indices in groups.items():
# Sort indices in this group by completion length (descending)
sorted_indices = sorted(indices, key=lambda i: len(completions_ids[i]))
# Deactivate top-k
for i in sorted_indices[num_remains_in_group:]:
rewards[i] = None
return rewards
prompts = ["P1", "P1", "P1", "P2", "P2", "P2"]
completions_ids = [
[11, 12, 13],
[14, 15, 16, 17], # longest in group, reward=None
[18, 19],
[21, 22, 23, 24, 25], # longest in group, reward=None
[26, 27],
[28, 29, 30]
]
print(reward_func(prompts, completions_ids))
# [1.0, None, 1.0, None, 1.0, 1.0] |
For example, your rewards come from several reward functions - some are rule-based while others are model-based. You aggregate the rewards to get the overall reward for the completion, then filter the completions based on response length and token efficiency (reward/length). Is this approach feasible according to the example you provided above? |
I think so, you'd just replace rewards = [1.0] * len(prompts) by something like rewards = [reward_func1(p, c) + reward_func2(p, c) for p, c in zip(prompts, completions)] The only limitation is that you the trainer won't be able to log these rewards separately, since the aggregation is made inside one reward function |
I have some reservations about your approach. trl/trl/trainer/grpo_trainer.py Lines 1008 to 1010 in cb84da0
even when reward are set to None, the next aggregation would change reward to zero. trl/trl/trainer/grpo_trainer.py Line 1403 in cb84da0
Relative advantages will still be computed based on both satisfied and filtered completions, which will significantly affect the mean and standard deviation of rewards. For example, if you mask reward and get top 2 out of 8 completions, rewards in group would look like this: [0., 0., 2., 0., 0., 0., 6., 0.] .(but if you filter and get top 2 completions, rewards in the new group should look like this: [2., 6.] )
Ultimately, completions that should be filtered out in the first place will still undergo both forward and backward propagation. In effect, you're training the model on all completions (good completions with right reward, filtered completions with zero reward) and the relative advantages are not right. trl/trl/trainer/grpo_trainer.py Lines 1560 to 1571 in cb84da0
I believe the core concept of GFPO is to introduce data filtration as a critical new dimension, complementing reward mechanisms for better completion preference alignment. |
What do you think?@qgallouedec |
@LeonEricsson My thoughts don't quite align with @qgallouedec . Could you please take a look together with us? |
I agree with this. trl/trl/trainer/grpo_trainer.py Line 1441 in e8b8499
However, believe that we can still implement GFPO as a reward function with minimal changes to the trainer. Note: In practice, Assuming: def reward_length(prompts, completions, **kwargs):
num_remains_in_group = 2
rewards = [reward_func1(p, c) + reward_func2(p, c) for p, c in zip(prompts, completions)]
groups = defaultdict(list)
for idx, prompt in enumerate(prompts):
groups[prompt].append(idx)
for _, indices in groups.items():
sorted_indices = sorted(indices, key=lambda i: len(completions[i]))
for i in sorted_indices[num_remains_in_group:]:
rewards[i] = None
return rewards Here, filtered samples have trl/trl/trainer/grpo_trainer.py Lines 1045 to 1059 in e8b8499
but I propose we treat this as a GFPO-specific case and filter these during advantage calculation. A boolean config flag could distinguish intentional GFPO filtering from accidental @Peter-Chou what do you think? |
@LeonEricsson Thanks for your reply, but I still think implementing the filter function by masking rewards through reward function is not the ideal approach. For example, if the reward is derived from aggregating outputs of multiple reward models (some rule-based and some model-based), this approach would not align with the current procedure I described earlier
You need to maintain the reward model and the relevant tokenizer within the reward function in each node (using a reward model with ZeRO-3 when DeepSpeed's ZeRO-3 is activated introduces extra complexity). Additionally, the filter function serves as another method to incorporate preferences into completions—it can consider factors beyond just completion length, as noted in the paper. Separating the filter function from the reward function is preferable. I don't think making the reward function similar to a Swiss Army knife is a good idea. The purpose of the reward function is both specific and straightforward. |
I see your point, this is a fair argument. Decoupling filtering from the reward functions seems like the sustainable solution. |
Hey, we now have an experimental submodule. I think GFPO could be added to this submodule for now. See https://huggingface.co/docs/trl/main/en/experimental. Example: #3898 |
@qgallouedec Thank you for your advice. But I wonder is it really necessary to move the GFPO implementation into the If GFPO must be placed in |
Sorry it was probably not clear, the suggestion was not about having a callback, but rather having a code in I'd recommend something like: # trl/experimental/gfpo/grpo_config.py
from ...trainer.grpo_config import GRPOConfig as _GRPOConfig
class GRPOConfig(_GRPOConfig):
num_remains_in_group: Optional[int] = field(
default=None,
metadata={
"help": "number inputs remains after group filter function, `'num_remains_in_group'` must be >=2 if given."
},
)
# trl/experimental/gfpo/grpo_trainer.py
from ...trainer.grpo_trainer import GRPOTrainer as _GRPOTrainer
class GRPOTrainer(_GRPOTrainer):
def __init__(self, model, reward_funcs, group_filter_func, args, train_dataset, eval_dataset, processing_class, reward_processing_classes, callbacks, optimizers, peft_config):
super().__init__(model, reward_funcs, args, train_dataset, eval_dataset, processing_class, reward_processing_classes, callbacks, optimizers, peft_config)
self.group_filter_func = group_filter_func
def _generate_and_score_completions(self, inputs):
... (or
I completely agree. The proposed solution requires copying a lot of GRPO code—but, as surprising as it sounds, that’s actually intentional: the experimental submodule is meant to stress-test how customizable TRL really is. |
961f57b
to
7199f2b
Compare
@qgallouedec I integrated the GFPO implementation into |
💯 agree, that why we need to refactor GRPO to avoid this. |
Can you add a short subsection in the doc (section Experimental) as well |
minor fix
minor fix
@qgallouedec Yes, The GFPO documentation has been added to |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
This PR implement the GFPO in GRPOTrainer, which is proposed in the paper Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.
GFPO is aimed to train a LLM that demonstrates efficient COT (Chain of Thought) without significant performance degradation.
Before submitting
Pull Request section?
to it if that's the case.