Skip to content

Conversation

henrycharlesworth
Copy link

@henrycharlesworth henrycharlesworth commented Sep 6, 2025

This PR introduces an optional parameter initial_tokens (or seed_tokens) to the BPETrainer class. The basic idea is that these tokens will act as "seeds", and initial merges/tokens will be added before the main training starts to ensure that all of the initial tokens are present. Training will then proceed creating merges as usual (including with these initial tokens). Essentially this is just "jump starting" the process with a desired set of tokens, ensuring they never get broken down in undesirable ways.

The motivation for this is that we have been building custom tokenizers for working with assembly code, where we have very structured sequences of instructions. It is desirable for us to be able to merge sub-instructions into single tokens, and so our general approach is to normalize instructions, e.g.:
"mov r14, rdi" might be normalized to "mov[SP]r14,[SP]rdi<EOI>"
This is largely fine, but we might also sometimes split on certain values to ensure tokens don't merge over them (see example below). What we have noticed is that if we get an "unlucky" sequence of merges, this can lead to very undesirable behaviour. Here is a motivating demo script:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

if __name__ == "__main__":
    tokenizer = Tokenizer(models.BPE())
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.Split("<value>", behavior="merged_with_previous"),
        pre_tokenizers.Split("</value>", behavior="merged_with_next"),
    ])

    trainer = trainers.BpeTrainer(
        vocab_size=100,
        min_frequency=2,
        show_progress=False
    )

    training_data = [
        "rax<//sp><value>3</value>]<//sp><eoi>",
        "</value>",
        "</value>",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "]<//",
        "]<//",
        "]<//",
        "]<//",
        "]<//",
        ",<//sp><eoi>",
        ",<//sp><eoi>"
        "rax<//sp><value>3</value>"
    ]
    tokenizer.train_from_iterator(training_data, trainer=trainer)

    out = tokenizer.encode("rax<//sp><value>3</value>]<//sp><eoi>")
    print(out.tokens)

which gives us:

['rax<//sp><value>', '3', '</value>', ']<', '//sp><', 'eoi>']

There's nothing unexpected here (and of course I've constructed an artificial/unrealistic training corpus just for demonstration purposes), it's just undesirable to split these base tokens up in this way (in the actual corpus we trained on, we noticed some rare cases where we got an unlucky merge order and "<sp//><eoi>" in some contexts was being completely broken down into individual tokens.)

If we run with our new option:
initial_tokens=["<//sp>", "<eoi>", "<value>", "</value>"]

We get:

['rax<//sp><value>', '3', '</value>', ']', '<//sp><eoi>']

So we still allow merges between the initial tokens, but remove any possibility of getting very undesirable merges. It also makes training significantly faster for us, since the corpus is effectively initialized with a load of merges that would have been expensive to find from scratch.

I realize this is quite a niche use case, but we've tested it fairly thoroughly and it shouldn't interfere with any existing functionality. If you are open to merging this it would obviously be useful from our POV not having to maintain a separate fork in the future, and I am happy to make any changes/add in some detailed tests if necessary.

@henrycharlesworth henrycharlesworth changed the title allow BPETrainer to be seeded with a set of initial tokens feat: allow BPETrainer to be seeded with a set of initial tokens Sep 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant