Skip to content

Conversation

ngoldbaum
Copy link

I'm opening this to see what the maintainer's appetite is for this sort of change.

I'm particularly curious if the maintainers are OK with the change to migrate from __getstate__ and __setstate__ to use __getnewargs__. The latter doesn't require a &mut self reference. I had to add an optional bytes argument to __new__, which seemed like it would be fine but I guess technically it is a public API change.

See https://pyo3.rs/v0.26.0/class.html#frozen-classes-opting-out-of-interior-mutability for more about this. In the future it's likely that PyO3 will make frozen pyclasses the default.

From the perspective of multithreaded use and the free-threaded build, this change makes it impossible for people to see runtime borrow-checker errors if two threads race to mutate and read the state of a PyEncoder, or a single thread tries to mutably borrow a PyEncoder twice. The latter is no longer a problem because it's no longer possible to mutably borrow PyEncoder. It's still possible to mutate the internal state, but all state is synchronized (in the sense of Send and Sync) by the mutex.

@Narsil
Copy link
Collaborator

Narsil commented Sep 4, 2025

Not particularly fan of adding a Mutex. That's basically negating the interest of having multithreaded code in the first place.

Yes that makes things "threadsafe", but if they are not usable in parallel, that doesn't make them that interesting does it ? Given the size of a tokenizer, I think that making them explicitly not threadsafe (compile time in Rust, runtime in Python) is a much better solution that making it work with a mutex.

Now I know there are already many mutexes in that library, but imho that's no reason to add even more. This library should be responsible for the parallelization with Rayon, which imho does a much better job at parallelization than even freethreaded Python. If users want to be in control in Python, they should have duplicates of the tokenizers in each thread to avoid locks.

tl;dr: frozen classes: great. Mutex: not great.

@ngoldbaum
Copy link
Author

tl;dr: frozen classes: great. Mutex: not great.

Thanks, this feedback is really useful! There are a few pyclasses in tokenizers where I can trivially mark them as frozen, I'll send in a PR for that.

I think there are two paths forward for the classes that rely on mutation:

  • Document that shared mutation might lead to runtime borrow-checker errors. Also add tests using the Python threading module for supported workflows. This might look like a pattern where you build an object on one thread and then share it between others to do a computation.

  • Add new APIs that follow the builder pattern. For example, you could add new PyEncoderBuilder and PyEncoderFinal structs - the PyEncoderBuilder would allow be a mutable pyclass and explicitly not support multiple threads building a shared encoder (enforced by runtime borrow-checking) and then the PyEncoderFinal struct would be a frozen pyclass. In the future, you could deprecate PyEncoder to avoid all the complexity of supporting both approaches, but there would be an interim period where you'd have some additional complexity in the implementation. I'm not sure if there's any appetite for that kind of API change.

I'm going to close this PR and work on testing and documentation instead of focusing on the tokenizers implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants