feat: allow BPETrainer to be seeded with a set of initial tokens #1862
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces an optional parameter
initial_tokens
(orseed_tokens
) to the BPETrainer class. The basic idea is that these tokens will act as "seeds", and initial merges/tokens will be added before the main training starts to ensure that all of the initial tokens are present. Training will then proceed creating merges as usual (including with these initial tokens). Essentially this is just "jump starting" the process with a desired set of tokens, ensuring they never get broken down in undesirable ways.The motivation for this is that we have been building custom tokenizers for working with assembly code, where we have very structured sequences of instructions. It is desirable for us to be able to merge sub-instructions into single tokens, and so our general approach is to normalize instructions, e.g.:
"mov r14, rdi"
might be normalized to"mov[SP]r14,[SP]rdi<EOI>"
This is largely fine, but we might also sometimes split on certain values to ensure tokens don't merge over them (see example below). What we have noticed is that if we get an "unlucky" sequence of merges, this can lead to very undesirable behaviour. Here is a motivating demo script:
which gives us:
There's nothing unexpected here (and of course I've constructed an artificial/unrealistic training corpus just for demonstration purposes), it's just undesirable to split these base tokens up in this way (in the actual corpus we trained on, we noticed some rare cases where we got an unlucky merge order and
"<sp//><eoi>"
in some contexts was being completely broken down into individual tokens.)If we run with our new option:
initial_tokens=["<//sp>", "<eoi>", "<value>", "</value>"]
We get:
So we still allow merges between the initial tokens, but remove any possibility of getting very undesirable merges. It also makes training significantly faster for us, since the corpus is effectively initialized with a load of merges that would have been expensive to find from scratch.
I realize this is quite a niche use case, but we've tested it fairly thoroughly and it shouldn't interfere with any existing functionality. If you are open to merging this it would obviously be useful from our POV not having to maintain a separate fork in the future, and I am happy to make any changes/add in some detailed tests if necessary.