Skip to content

Conversation

vldmit
Copy link
Contributor

@vldmit vldmit commented Feb 13, 2025

Close #1574
Problem Statement

etcd client would only run Sync after AutoSyncInterval (30 seconds), which makes tikv client vulnerable to the failure of endpoints provided in addrs before the first sync happens. Specific failure scenario:

  1. tidb is initialized with a --path=endpoint
  2. tidb successfully established connection to the endpoint
  3. n < 30 seconds after, endpoint fails (e.g. k8s control plane is upgrading the pod)
  4. etcd client is no longer connected
  5. Safe checkpoint expires and CheckVisibility in KVStore start to error out

Fix

We explicitly synchronize client endpoints with endpoints from etcd membership during client initialization phase. etcd client would continue to do periodic sync, we just force first sync to happen in the init phase.

@ti-chi-bot ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Feb 13, 2025
Copy link

ti-chi-bot bot commented Feb 13, 2025

Welcome @vldmit!

It looks like this is your first PR to tikv/client-go 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to tikv/client-go. 😃

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 13, 2025
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Feb 14, 2025
@zyguan
Copy link
Contributor

zyguan commented Feb 17, 2025

@cfzjywxk PTAL

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Feb 17, 2025
Copy link

ti-chi-bot bot commented Feb 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, zyguan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ti-chi-bot bot commented Feb 17, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-02-14 01:27:25.878162368 +0000 UTC m=+579088.274384427: ☑️ agreed by zyguan.
  • 2025-02-17 03:08:27.519627564 +0000 UTC m=+844349.915849626: ☑️ agreed by cfzjywxk.

@cfzjywxk
Copy link
Contributor

/retest

@ti-chi-bot ti-chi-bot bot merged commit 279dcd5 into tikv:master Feb 17, 2025
11 checks passed
@cfzjywxk
Copy link
Contributor

@vldmit Thanks for the contribution, the #1574 could be closed.
The client-go dependency for tidb would be updated later to make it take effect.

rleungx added a commit to rleungx/client-go that referenced this pull request Mar 6, 2025
rleungx added a commit to rleungx/client-go that referenced this pull request Mar 6, 2025
@rleungx
Copy link
Member

rleungx commented Mar 6, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline
After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline

But I haven't investigated why it failed. /cc @zyguan

@zyguan
Copy link
Contributor

zyguan commented Mar 7, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline

But I haven't investigated why it failed. /cc @zyguan

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

@rleungx
Copy link
Member

rleungx commented Mar 10, 2025

I'm trying to upgrade the client-go to the latest version in TiDB pingcap/tidb#59757. But some CI are failed. The log keeps printing missing address, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7104/pipeline After I revert this commit in #1605, it succeeds, see https://do.pingcap.net/jenkins/blue/organizations/jenkins/pingcap%2Ftidb%2Fpull_lightning_integration_test/detail/pull_lightning_integration_test/7120/pipeline
But I haven't investigated why it failed. /cc @zyguan

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

For some mock client, the addr is empty and sync will report an error, which fails the test.

@you06
Copy link
Contributor

you06 commented Mar 12, 2025

@rleungx It seems local.NewBackend uses config.PDAddr (instead of pdAddrs) here, which is empty due to it's generated by genConfig. This PR just exposes the issue.

What about wrapping Client.Sync and only run it when there are non-empty addresses?

@zyguan
Copy link
Contributor

zyguan commented Mar 12, 2025

I've confirmed the issue with @D3Hunter, it should use pdAddrs here. I also pushed some commits to address the issue yesterday, however they've been reverted by recent force-push.

@rleungx
Copy link
Member

rleungx commented Mar 12, 2025

I've confirmed the issue with @D3Hunter, it should use pdAddrs here. I also pushed some commits to address the issue yesterday, however they've been reverted by recent force-push.

Oh, my bad, I wrongly force pushed a commit and overwrote yours. I have changed it back.

serprex added a commit to PeerDB-io/tikv-client-go that referenced this pull request Jun 30, 2025
* *: bump pd client (tikv#1575)

 

Signed-off-by: Ryan Leung <[email protected]>

* OWNERS: Auto Sync OWNERS files from community membership (tikv#1576)

 

Signed-off-by: Ti Chi Robot <[email protected]>

* sync etcd endpoints immediately after initializing the client (tikv#1573)

 

Signed-off-by: Vlad Dmitriev <[email protected]>

* p-dml: resolve locks concurrently (tikv#1584)

close tikv#1577

Signed-off-by: you06 <[email protected]>

* memdb: prevent iterator invalidation (tikv#1563)

ref pingcap/tidb#59153

Signed-off-by: ekexium <[email protected]>

* locate: refactor RegionRequestSender.SendReqCtx (tikv#1565)

 

Signed-off-by: zyguan <[email protected]>

* locate: fix the default settings of circuit breaker (tikv#1593)

ref tikv/pd#8678

Signed-off-by: Ryan Leung <[email protected]>

* util: define and implement core interfaces for async api (tikv#1591)

ref tikv#1586

Signed-off-by: zyguan <[email protected]>

* Add a retry when getting ts from PD for validating read ts (tikv#1600)

 

Signed-off-by: MyonKeminta <[email protected]>

* pdclient: Add caller info to pd client (tikv#1516)

ref tikv/pd#8593

Signed-off-by: okJiang <[email protected]>

* locate: fix TestTiKVClientReadTimeout (tikv#1601)

 

Signed-off-by: zyguan <[email protected]>

* *: update pd client (tikv#1605)

 

Signed-off-by: Ryan Leung <[email protected]>

* Validate ts only for stale read (tikv#1607)

ref pingcap/tidb#59402

Signed-off-by: ekexium <[email protected]>

* execdetails: export scheduler write details (tikv#1606)

 

Signed-off-by: Neil Shen <[email protected]>

* Update pd client (tikv#1615)

Signed-off-by: disksing <[email protected]>

* ci: allow use label to skip integration tests (tikv#1616)

 

Signed-off-by: disksing <[email protected]>

* remove useless metric tidb_tikvclient_cop_duration_seconds_bucket (tikv#1602)

 

Signed-off-by: XuHuaiyu <[email protected]>

* client: implement SendRequestAsync for RPCClient (tikv#1604)

ref tikv#1586

Signed-off-by: zyguan <[email protected]>

* execdetails: export grpc process and wait time to time details (tikv#1614)

 

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: Bisheng Huang <[email protected]>

* Refine pessimistic lock related metrics and stats (tikv#1620)

 

Signed-off-by: yibin87 <[email protected]>

* metrics: adjust bucket count to reduce metrics data (tikv#1609)

 

Signed-off-by: Lynn <[email protected]>

* update tidb for integration tests (tikv#1621)

 

Signed-off-by: disksing <[email protected]>

* support redact key in logs (tikv#1612)

ref pingcap/tidb#59279

Signed-off-by: tangenta <[email protected]>

Co-authored-by: you06 <[email protected]>

* update integration_test/go.mod (tikv#1624)

 

Signed-off-by: tangenta <[email protected]>

* memdb: introduce snapshot interface (tikv#1623)

 

Signed-off-by: you06 <[email protected]>

Co-authored-by: ekexium <[email protected]>

* pd: enable OutputMustContainAllKeyRange (tikv#1632)

 

Signed-off-by: lhy1024 <[email protected]>

* Fix backoff lose info when forked (tikv#1627)

ref pingcap/tidb#60271

Signed-off-by: yibin87 <[email protected]>

* tikv: disable health-feedback in next-gen (tikv#1635)

 

Signed-off-by: zyguan <[email protected]>

* enable ts validation for normal read (tikv#1619)

 

Signed-off-by: ekexium <[email protected]>

* Add txn write conflict metrics (tikv#1551)

close tikv#1550

Signed-off-by: sujuntao <[email protected]>

Co-authored-by: sujuntao <[email protected]>

* apicodec: fix a typo when encoding request for CmdMvccGetByKey (tikv#1638)

 

Signed-off-by: tiancaiamao <[email protected]>

Co-authored-by: cfzjywxk <[email protected]>

* *: update kvproto version (tikv#1636)

ref tikv#1631

Signed-off-by: Chao Wang <[email protected]>

* txn: provide more information in commit RPC / log mvcc debug info when commit failed for `TxnLockNotFound` (tikv#1640)

ref tikv#1631

Signed-off-by: Chao Wang <[email protected]>

* txn: handle undetermined error in client go (tikv#1642)

close tikv#1641

Signed-off-by: Chao Wang <[email protected]>

* txn: fix the implemention of undetermined error (tikv#1644)

close tikv#1641

Signed-off-by: Chao Wang <[email protected]>

* locate: implement SendReqAsync for RegionRequestSender (tikv#1618)

ref tikv#1586

Signed-off-by: zyguan <[email protected]>

* update pd client for resource group and keyspace (tikv#1645)

 

Signed-off-by: lhy1024 <[email protected]>

* tests: bump tidb to fix integration tests (tikv#1650)

 

Signed-off-by: zyguan <[email protected]>

* Fix some metrics that miss const labels (tikv#1652)

 

Signed-off-by: yibin87 <[email protected]>

* Replace etcd safe point with txn safe point for read safety check (tikv#1634)

 

Signed-off-by: MyonKeminta <[email protected]>

* Fix stale read metrics (tikv#1649)

close tikv#1648

Signed-off-by: you06 <[email protected]>

* *: support async batch get (tikv#1646)

ref tikv#1586

Signed-off-by: zyguan <[email protected]>

* Update kvproto dependancy and set keyspace name for rpc context (tikv#1667)

close tikv#1668

Signed-off-by: yibin87 <[email protected]>

* ci: add next-gen integration tests (tikv#1661)

 

Signed-off-by: ekexium <[email protected]>

* snapshot: set `ReplicaRead` to false when `ReplicaReadType` fallbacks to `ReplicaReadLeader` (tikv#1663)

ref pingcap/tidb#61745

Signed-off-by: you06 <[email protected]>

* resource_control: support collecting cross AZ traffic in ru consumption (tikv#1669)

 

Signed-off-by: glorv <[email protected]>

* txnkv: prevent some actions from being interrupted by kill (tikv#1665)

fix pingcap/tidb#61454

Signed-off-by: zyguan <[email protected]>

* region_cache: add ForceRefreshAllStores function (tikv#1686)

 

Signed-off-by: guo-shaoge <[email protected]>

* upgrade gRPC to allow consumption by peerdb

* Bump gprc version to 1.73.0

Remove references to `grpc.NewSharedBufferPool()`, it was removed from
the grpc package and causes build to fail - it's always on.
(https://pkg.go.dev/google.golang.org/grpc/experimental#WithBufferPool)

Signed-off-by: Tiago Scolari <[email protected]>

* go mod tidy

---------

Signed-off-by: Ryan Leung <[email protected]>
Signed-off-by: Ti Chi Robot <[email protected]>
Signed-off-by: Vlad Dmitriev <[email protected]>
Signed-off-by: you06 <[email protected]>
Signed-off-by: ekexium <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: MyonKeminta <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: Neil Shen <[email protected]>
Signed-off-by: disksing <[email protected]>
Signed-off-by: XuHuaiyu <[email protected]>
Signed-off-by: yibin87 <[email protected]>
Signed-off-by: Lynn <[email protected]>
Signed-off-by: tangenta <[email protected]>
Signed-off-by: lhy1024 <[email protected]>
Signed-off-by: Chao Wang <[email protected]>
Signed-off-by: glorv <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: Tiago Scolari <[email protected]>
Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Co-authored-by: Vlad Dmitriev <[email protected]>
Co-authored-by: you06 <[email protected]>
Co-authored-by: ekexium <[email protected]>
Co-authored-by: zyguan <[email protected]>
Co-authored-by: MyonKeminta <[email protected]>
Co-authored-by: okJiang <[email protected]>
Co-authored-by: Neil Shen <[email protected]>
Co-authored-by: disksing <[email protected]>
Co-authored-by: HuaiyuXu <[email protected]>
Co-authored-by: Bisheng Huang <[email protected]>
Co-authored-by: yibin <[email protected]>
Co-authored-by: Lynn <[email protected]>
Co-authored-by: tangenta <[email protected]>
Co-authored-by: lhy1024 <[email protected]>
Co-authored-by: JT <[email protected]>
Co-authored-by: sujuntao <[email protected]>
Co-authored-by: tiancaiamao <[email protected]>
Co-authored-by: cfzjywxk <[email protected]>
Co-authored-by: 王超 <[email protected]>
Co-authored-by: glorv <[email protected]>
Co-authored-by: guo-shaoge <[email protected]>
Co-authored-by: Kevin Biju <[email protected]>
Co-authored-by: Tiago Scolari <[email protected]>
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Jul 9, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: cannot checkout release-8.5: error checking out release-8.5: exit status 1. output: error: pathspec 'release-8.5' did not match any file(s) known to git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delayed etcd endpoint sync causing KVStore errors
6 participants