-
Notifications
You must be signed in to change notification settings - Fork 293
cloudstorage: fix a bug causing context cancelled errors and stuck sink #12276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @zurakutsia. Thanks for your PR. I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Welcome @zurakutsia! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @zurakutsia, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've addressed a critical bug in the TiCDC cloud storage sink, specifically impacting Azure Blob storage, which was causing frequent "context cancelled" errors and leading to a "stuck sink" state. The core issue stemmed from an incorrect context cancellation mechanism within the Open method of extStorageWithTimeout, where the context was prematurely canceled, invalidating subsequent read operations. My fix refactors this by ensuring the context associated with the reader is only canceled when the reader itself is explicitly closed, preventing these disruptive errors and stabilizing the sink.
Highlights
- Bug Fix for Cloud Storage Sink: Resolved an issue where the TiCDC cloud storage sink, particularly on Azure Blob, experienced "context cancelled" errors and "stuck sink" behavior due to premature context cancellation during file Open operations.
- Context Handling Refinement: Modified the extStorageWithTimeout.Open method to no longer immediately cancel the context. Instead, a new readerWithCancel wrapper is introduced, which defers the context cancellation until the reader's Close() method is invoked, ensuring the reader's lifetime is respected.
- Enhanced Test Coverage: Added new unit tests (TestExtStorageOpenDoesNotCancelReaderContext and TestExtStorageOpenReaderRespectsCallerCancel) to specifically validate the corrected context behavior, ensuring the reader's context remains valid for its duration and properly responds to external cancellations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a critical bug causing context cancellation errors by ensuring the context for a storage reader remains valid throughout its lifecycle. The approach of wrapping the reader to manage the context's cancellation in the Close
method is sound. The accompanying tests are well-designed and validate the fix effectively. I have one suggestion to enhance the robustness of the new readerWithCancel.Close()
method.
/ok-to-test |
Can you create an issue to describe what you encountered and how to reproduce this bug? @zurakutsia |
/retest |
2 similar comments
/retest |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asddongmen, hongyunyan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/retest |
|
/retest |
1 similar comment
/retest |
/merge |
@hongyunyan: We have migrated to builtin 👉 Please use
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/run-check-issue-triage-complete |
/check-issue-triage-complete |
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created to branch |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created to branch |
/cherrypick release-7.5-20250617-v7.5.6 |
@hongyunyan: new pull request created to branch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: close #12277
Fixes TiCDC cloud storage sink flapping (start/stop spam) on Azure Blob caused by premature context cancellation during reads.
Symptoms included:
The regression seems to be introduced by 0e6782b71. Switching to GetExternalStorageWithDefaultTimeout wrapped
Open
with a timeout and canceled it immediately, breaking subsequent Read() calls that reuse the Open() context.What is changed and how it works?
Before: Open wrapped the passed context with
context.WithTimeout
and deferredcancel()
. Since many backends bind the reader’s lifetime to the Open context, the deferred cancel immediately invalidated the reader’s future Read() calls, causing “context canceled” errors.Now: Open passes the caller's context through without wrapping or canceling it. This prevents premature cancellation while keeping existing timeouts for other APIs.
extStorageWithTimeout.Open
returns areaderWithCancel
wrapper struct which callscancel()
onClose()
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note