feat: allow BPETrainer to be seeded with a set of initial tokens #1862

henrycharlesworth · 2025-09-06T07:52:20Z

This PR introduces an optional parameter initial_tokens (or seed_tokens) to the BPETrainer class. The basic idea is that these tokens will act as "seeds", and initial merges/tokens will be added before the main training starts to ensure that all of the initial tokens are present. Training will then proceed creating merges as usual (including with these initial tokens). Essentially this is just "jump starting" the process with a desired set of tokens, ensuring they never get broken down in undesirable ways.

The motivation for this is that we have been building custom tokenizers for working with assembly code, where we have very structured sequences of instructions. It is desirable for us to be able to merge sub-instructions into single tokens, and so our general approach is to normalize instructions, e.g.:
"mov r14, rdi" might be normalized to "mov[SP]r14,[SP]rdi<EOI>"
This is largely fine, but we might also sometimes split on certain values to ensure tokens don't merge over them (see example below). What we have noticed is that if we get an "unlucky" sequence of merges, this can lead to very undesirable behaviour. Here is a motivating demo script:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

if __name__ == "__main__":
    tokenizer = Tokenizer(models.BPE())
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.Split("<value>", behavior="merged_with_previous"),
        pre_tokenizers.Split("</value>", behavior="merged_with_next"),
    ])

    trainer = trainers.BpeTrainer(
        vocab_size=100,
        min_frequency=2,
        show_progress=False
    )

    training_data = [
        "rax<//sp><value>3</value>]<//sp><eoi>",
        "</value>",
        "</value>",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "//sp>,",
        "]<//",
        "]<//",
        "]<//",
        "]<//",
        "]<//",
        ",<//sp><eoi>",
        ",<//sp><eoi>"
        "rax<//sp><value>3</value>"
    ]
    tokenizer.train_from_iterator(training_data, trainer=trainer)

    out = tokenizer.encode("rax<//sp><value>3</value>]<//sp><eoi>")
    print(out.tokens)

which gives us:

['rax<//sp><value>', '3', '</value>', ']<', '//sp><', 'eoi>']

There's nothing unexpected here (and of course I've constructed an artificial/unrealistic training corpus just for demonstration purposes), it's just undesirable to split these base tokens up in this way (in the actual corpus we trained on, we noticed some rare cases where we got an unlucky merge order and "<sp//><eoi>" in some contexts was being completely broken down into individual tokens.)

If we run with our new option:
initial_tokens=["<//sp>", "<eoi>", "<value>", "</value>"]

We get:

['rax<//sp><value>', '3', '</value>', ']', '<//sp><eoi>']

So we still allow merges between the initial tokens, but remove any possibility of getting very undesirable merges. It also makes training significantly faster for us, since the corpus is effectively initialized with a load of merges that would have been expensive to find from scratch.

I realize this is quite a niche use case, but we've tested it fairly thoroughly and it shouldn't interfere with any existing functionality. If you are open to merging this it would obviously be useful from our POV not having to maintain a separate fork in the future, and I am happy to make any changes/add in some detailed tests if necessary.

allow BPETrainer to be seeded with a set of initial tokens

25487d6

henrycharlesworth changed the title ~~allow BPETrainer to be seeded with a set of initial tokens~~ feat: allow BPETrainer to be seeded with a set of initial tokens Sep 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow BPETrainer to be seeded with a set of initial tokens #1862

feat: allow BPETrainer to be seeded with a set of initial tokens #1862

Uh oh!

henrycharlesworth commented Sep 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: allow BPETrainer to be seeded with a set of initial tokens #1862

Are you sure you want to change the base?

feat: allow BPETrainer to be seeded with a set of initial tokens #1862

Uh oh!

Conversation

henrycharlesworth commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

henrycharlesworth commented Sep 6, 2025 •

edited

Loading