Root causing our "weird" crash logs

So, this is not the first time I've decided I'm determined to get to the bottom of why we always have some "weird" crash reports, and it probably won't be the last time, but I'd like to discuss it again anyway.

Tagging @asmagill for his excellent thoughts.

Here are the "weird" crash reports received thus far for 0.9.84:

- https://sentry.io/organizations/hammerspoon/issues/2264596621/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264749660/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264676679/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2263890376/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2264076080/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2263909041/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2252842063/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2262301597/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2260433876/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d
- https://sentry.io/organizations/hammerspoon/issues/2254695465/?project=5220516&query=is%3Aunresolved+release%3Alatest&statsPeriod=14d

Here are some evidence points I have so far:

1. They all start with an OS callback that leads to us calling `protectedCallAndTraceback`. There are a couple of others that happen during the first load of `init.lua`, but we'll ignore those for now.
2. Generally speaking they are all something to do with pointers not being what Lua is expecting (either NULL or garbage), which strongly suggests (imo) that this is some kind of lifecycle issue - ie they're storing a reference to something that's been subsequently deallocated.
3. We don't get very many of these - in the week since 0.9.84 was released these particular crashes only number around 20, out of thousands of user sessions. Sentry doesn't break down a ton of detail here (in part because we deliberately don't enable active session tracking), but they report well over 99% of sessions are crash free. Still, it annoys me to know that these crashes all hint at some kind of lifecycle bug we haven't found yet.
4. Looking at the backtraces, they mostly fall into either hs.timer or hs.eventtap callbacks, but there are others, which suggests to me that it's not a specific bug in either of those modules, but rather a systemic flaw.
5. We have protections against (at least):
    * trying to call Lua off the main thread
    * lua_State having been replaced since the callback was set up, at least in hs.timer (via the relatively new checkLuaSkinInstance in 0.9.83)
    * Lua stack overgrowth via the `_lua_stackguard` macros
6. Looking at the breadcrumb logs from the crashes, it's not generally the case that the crash happens imminently after a LuaSkin reload
7. Because they all start from C, they call `[LuaSkin sharedWithState:NULL]` which guarantees them the main Lua State, they're not coming in on a co-routine state (this kind of crash report predates co-routines anyway AFAIK, but we don't have historical data to back that up). This would also be covered by the LuaSkin UUID check, at least in hs.timer.

So, it seems relatively safe to say that these aren't happening when Lua is being torn down, and their relative consistency makes me suspect they're not stack/heap corruption, although I can't rule that out. I have some hypotheses:

1.  They are happening at some transition period between a co-routine lua_State and the main lua_State (I have no idea how this works under the hood, maybe it's not possible)
2.  Somewhere we're running the Objective C event loop in one of our modules, causing it to advance and process incoming events while we're in the middle of a Lua->C situation and the Lua state gets confused
3.  We just have one or more memory corruption bugs in some C code somewhere
4.  We have a *very* subtle Lua API interaction bug
5.  We're hitting genuine bugs in Lua (seems unlikely that they would persist for this long without other much fancier users of Lua, noticing)

My bet is on 2, 3 or 4, but I don't have any evidence to support that yet.

So....... thoughts? :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Root causing our "weird" crash logs #2759

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Root causing our "weird" crash logs #2759

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions