Mark PyEncoder as frozen and use interior mutability #1860

ngoldbaum · 2025-09-02T20:40:48Z

I'm opening this to see what the maintainer's appetite is for this sort of change.

I'm particularly curious if the maintainers are OK with the change to migrate from __getstate__ and __setstate__ to use __getnewargs__. The latter doesn't require a &mut self reference. I had to add an optional bytes argument to __new__, which seemed like it would be fine but I guess technically it is a public API change.

See https://pyo3.rs/v0.26.0/class.html#frozen-classes-opting-out-of-interior-mutability for more about this. In the future it's likely that PyO3 will make frozen pyclasses the default.

From the perspective of multithreaded use and the free-threaded build, this change makes it impossible for people to see runtime borrow-checker errors if two threads race to mutate and read the state of a PyEncoder, or a single thread tries to mutably borrow a PyEncoder twice. The latter is no longer a problem because it's no longer possible to mutably borrow PyEncoder. It's still possible to mutate the internal state, but all state is synchronized (in the sense of Send and Sync) by the mutex.

Narsil · 2025-09-04T13:35:32Z

Not particularly fan of adding a Mutex. That's basically negating the interest of having multithreaded code in the first place.

Yes that makes things "threadsafe", but if they are not usable in parallel, that doesn't make them that interesting does it ? Given the size of a tokenizer, I think that making them explicitly not threadsafe (compile time in Rust, runtime in Python) is a much better solution that making it work with a mutex.

Now I know there are already many mutexes in that library, but imho that's no reason to add even more. This library should be responsible for the parallelization with Rayon, which imho does a much better job at parallelization than even freethreaded Python. If users want to be in control in Python, they should have duplicates of the tokenizers in each thread to avoid locks.

tl;dr: frozen classes: great. Mutex: not great.

ngoldbaum · 2025-09-04T15:47:02Z

tl;dr: frozen classes: great. Mutex: not great.

Thanks, this feedback is really useful! There are a few pyclasses in tokenizers where I can trivially mark them as frozen, I'll send in a PR for that.

I think there are two paths forward for the classes that rely on mutation:

Document that shared mutation might lead to runtime borrow-checker errors. Also add tests using the Python threading module for supported workflows. This might look like a pattern where you build an object on one thread and then share it between others to do a computation.
- We might be able to make the user experience here nicer in PyO3 by updating the runtime borrow checking scheme so it can tell the difference between a single thread borrowing a value twice and two threads racing. See Explore whether we can generate more friendly runtime borrow-checker errors PyO3/pyo3#5412.
Add new APIs that follow the builder pattern. For example, you could add new PyEncoderBuilder and PyEncoderFinal structs - the PyEncoderBuilder would allow be a mutable pyclass and explicitly not support multiple threads building a shared encoder (enforced by runtime borrow-checking) and then the PyEncoderFinal struct would be a frozen pyclass. In the future, you could deprecate PyEncoder to avoid all the complexity of supporting both approaches, but there would be an interim period where you'd have some additional complexity in the implementation. I'm not sure if there's any appetite for that kind of API change.

I'm going to close this PR and work on testing and documentation instead of focusing on the tokenizers implementation.

ngoldbaum added 2 commits September 2, 2025 14:24

Replace __getstate__ and __setstate__ with __getnewargs__ in PyEncoder

53a8267

Mark PyEncoder class as frozen and hide state behind a Mutex

8187350

ngoldbaum mentioned this pull request Sep 2, 2025

Add 3.13t CI using pytest-run-parallel #1809

Closed

fix clippy

9f312bb

ngoldbaum closed this Sep 4, 2025

ngoldbaum mentioned this pull request Sep 5, 2025

Mark immutable pyclasses as frozen #1861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mark PyEncoder as frozen and use interior mutability #1860

Mark PyEncoder as frozen and use interior mutability #1860

Uh oh!

ngoldbaum commented Sep 2, 2025

Uh oh!

Narsil commented Sep 4, 2025

Uh oh!

ngoldbaum commented Sep 4, 2025

Uh oh!

Uh oh!

Mark PyEncoder as frozen and use interior mutability #1860

Mark PyEncoder as frozen and use interior mutability #1860

Uh oh!

Conversation

ngoldbaum commented Sep 2, 2025

Uh oh!

Narsil commented Sep 4, 2025

Uh oh!

ngoldbaum commented Sep 4, 2025

Uh oh!

Uh oh!