-
Notifications
You must be signed in to change notification settings - Fork 966
Open
Description
Hi Hugging Face Tokenizers Team,
I’ve been exploring the whitespace.rs
pre-tokenizer code and noticed it relies on regex for splitting tokens by whitespace.
I wanted to propose replacing this regex-based approach with a manual implementation using byte-level scanning (e.g., memchr
or custom iterator), which could significantly improve performance by reducing overhead and allocations.
Key points:
- The manual approach is expected to bring substantial speed improvements, especially on large volumes of text.
- However, this change may not always produce exactly the same tokenization results in edge cases compared to the current regex (e.g., certain Unicode whitespace or unusual characters).
- This trade-off might be acceptable depending on the use case, or it could be offered as an optional alternative pre-tokenizer mode.
- I’m happy to prototype and benchmark this approach to demonstrate the performance benefits.
Would you be open to discussing this optimization and its potential impact? I’d love to hear your thoughts on whether this would align with the goals of the tokenizers
repo and the importance of output consistency vs speed in this context.
Thanks for your time and for creating such an amazing project!
Best regards,
AndriaK
Metadata
Metadata
Assignees
Labels
No labels