Skip to content

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Aug 26, 2025

I realized there are more tests that assume auto-recovery ability of WAL write IO error. I need to think more about this and wonder why the previous stress test didn't fail much with the CF inconsistency.

Context/Summary:
When atomic_flush = false with multiple column families, when encountering WAL related IO error, individual CF flushing during auto recovery can create data inconsistencies (caught by track_and_verify_wals=1) where some column families advance past the corruption point while others remain behind, preventing successful database restart. Therefore we disable auto recovery by setting a higher severity Status::Severity::kFatalError and such testing combination in db crash test.

This PR also fixes a bug in stress test that we considered Status::Severity::kFatalError as retryable.

Test plan:

  • Rehearsal stress test

@meta-cla meta-cla bot added the CLA Signed label Aug 26, 2025
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D81056359.

@hx235 hx235 force-pushed the debug_track_verify_wal_error branch from 70bb732 to 53ad5c8 Compare August 26, 2025 23:14
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D81056359.

@hx235 hx235 marked this pull request as draft August 27, 2025 07:19
@hx235 hx235 changed the title Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test [WIP]Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test Aug 27, 2025
@hx235 hx235 changed the title [WIP]Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test [DRAFT] Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants