Skip to content

Proposal: Replace regex in whitespace.rs with manual code for speed improvements #1825

@8ria

Description

@8ria

Hi Hugging Face Tokenizers Team,

I’ve been exploring the whitespace.rs pre-tokenizer code and noticed it relies on regex for splitting tokens by whitespace.

I wanted to propose replacing this regex-based approach with a manual implementation using byte-level scanning (e.g., memchr or custom iterator), which could significantly improve performance by reducing overhead and allocations.

Key points:

  • The manual approach is expected to bring substantial speed improvements, especially on large volumes of text.
  • However, this change may not always produce exactly the same tokenization results in edge cases compared to the current regex (e.g., certain Unicode whitespace or unusual characters).
  • This trade-off might be acceptable depending on the use case, or it could be offered as an optional alternative pre-tokenizer mode.
  • I’m happy to prototype and benchmark this approach to demonstrate the performance benefits.

Would you be open to discussing this optimization and its potential impact? I’d love to hear your thoughts on whether this would align with the goals of the tokenizers repo and the importance of output consistency vs speed in this context.

Thanks for your time and for creating such an amazing project!

Best regards,
AndriaK

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions