Skip to content

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #16850

`raft::state_machine_manager` uses background apply fiber to
individually apply batches to state machines which are behind the main
apply fiber. When background apply fiber is active it reads and apply
batches up to current committed offset. When background apply fiber is
active it acquires the mutex. When mutex is acquired the main apply
fiber do not consider the stm as up to date.

The code was prone to very rare race condition as the background apply
was finished in one continuation but the units were release in
subsequent `finally` block. Normally this approach is harmless as the
semaphore is waited for and it will be signaled after the `finally`
fiber is executed. In `state_machine_manager` we only check if the
semaphore is available, this makes the solution vulnerable to timing.

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit df97290)
Added state machine manager test waiting for batches to be applied after
each replicate. This test is designed to detect a situation in which
background apply fiber finishes but managed stm is still behind the
others.

Signed-off-by: Michal Maslanka <[email protected]>
(cherry picked from commit a55d01a)
@vbotbuildovich vbotbuildovich added this to the v23.3.x-next milestone Mar 5, 2024
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants