Skip to content

Conversation

oko256
Copy link
Contributor

@oko256 oko256 commented Sep 8, 2025

Issue #2300 describes a randomly occuring crash in zpoller. I investigated this and indeed I have only seen it happen on Windows, but the reason for it could easily cause it to happen on other platforms too.

When zsock_resolve() tries to resolve the passed pointer type, it first checks first against zactor and zsock, which work fine because the tag is the first variable in those structs and small enough. But that's not necessarily the case with ZMQ socket object. And when a native SOCKET is passed to resolve function, the zmq_getsockopt() tries to check the tag by accessing memory that is outside the actual variable that contained the native socket number, which in turn causes (very often) an access violation on Windows.

The solution I came up with is to reorder the checks so that native socket is checked before the zmq_getsockopt(), which seemed to work. Any ideas if this could break some edge case?

The other commit is a small fix to ztimerset selftest which also failed every now and then on Windows. This is due to Windows Sleep() having only a typical resolution of 16 ms, and the sleep in the test was very exact. I just gave it some margin to sleep long enough.

Especially in Windows (at least on builds using MSVC), `zsock_resolve`
randomly (but quite often) segfaults if the passed pointer is actually
pointing to a native `SOCKET` instead of zactor, zsock, or a ZMQ socket.
This happens because `zmq_getsockopt` tries to check if, after casting,
the pointer points to an object with specific tag (`check_tag()`).
Unfortunately, reading this area of memory is often an access violation
if the pointer actually pointed to a `SOCKET` (which is a smaller data
type).

It seems that checking if the pointer is to a native socket first before
checking if it is a ZMQ socket fixes this, so these checks were
reordered.

Resolves zeromq#2300
On Windows, the `Sleep()` function resolution (which is used by
`zclock_sleep()`) can typically be as bad as 16 ms. This caused the
selftest to randomly fail because it waited exactly the remaining time
to the timer expiration before asserting the result. Made this more
reliable by adding some margin to the sleep.
@oko256
Copy link
Contributor Author

oko256 commented Sep 9, 2025

Likely resolves #2298 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant