Proposal: Replace regex in `whitespace.rs` with manual code for speed improvements

Hi Hugging Face Tokenizers Team,

I’ve been exploring the `whitespace.rs` pre-tokenizer code and noticed it relies on regex for splitting tokens by whitespace.

I wanted to propose replacing this regex-based approach with a manual implementation using byte-level scanning (e.g., `memchr` or custom iterator), which could significantly improve performance by reducing overhead and allocations.

**Key points:**

* The manual approach is expected to bring **substantial speed improvements**, especially on large volumes of text.
* However, this change **may not always produce exactly the same tokenization results** in edge cases compared to the current regex (e.g., certain Unicode whitespace or unusual characters).
* This trade-off might be acceptable depending on the use case, or it could be offered as an optional alternative pre-tokenizer mode.
* I’m happy to prototype and benchmark this approach to demonstrate the performance benefits.

Would you be open to discussing this optimization and its potential impact? I’d love to hear your thoughts on whether this would align with the goals of the `tokenizers` repo and the importance of output consistency vs speed in this context.

Thanks for your time and for creating such an amazing project!

Best regards,
AndriaK


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Replace regex in `whitespace.rs` with manual code for speed improvements #1825

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Replace regex in whitespace.rs with manual code for speed improvements #1825

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Replace regex in `whitespace.rs` with manual code for speed improvements #1825