Skip to content

Conversation

crazycs520
Copy link
Contributor

No description provided.

@@ -555,6 +555,8 @@ func (c *batchCommandsClient) waitConnReady() (err error) {
cancel()
break
}
// Trigger idle connection to reconnection
c.conn.Connect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be an experimental API which may not be stable enough?

Copy link
Contributor

@cfzjywxk cfzjywxk Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, it seems each time the API is triggered a new goroutine would be created for each sub-channel which could be a risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. For now, I've only found Connect API to work, I try to use the TiKV RPC interface such as heartbeat or health check RPC API, but I didn't find such an interface.

Anyway, I just added a test about this.

Copy link
Contributor

@zyguan zyguan Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually GetState and WaitForStateChange are both experimental, I think it's ok to use Connect here. How about to trigger Connect only once (maybe just before the for loop) and only when the state is IDLE (I'm not sure if we need to connect when the state is TRANSIENT_FAILURE since there is an internal backoff for reconnecting by grpc itself)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect API internal already checks the state, and only do connect when state is idle.

Copy link
Contributor

@cfzjywxk cfzjywxk Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForStateChange has been used for some time and the risk is relatively low.
Maybe we'd better check the gRPC client source code in detail about the state management and how could the Connect impact the underlying gRPC logic, and make sure the risk is acceptable and the usage is correct.

Signed-off-by: crazycs520 <[email protected]>
Copy link
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is no standard operation for this reconnection. The transient failure state will automatically be converted to connecting for retry and backoff internally. I have approved it for now, but we need to continue monitoring whether there are any abnormalities in terms of availability testing, stress testing, and disaster recovery testing. In addition, we also need to further organize and improve the code of batch client/region cache for maintainability.

@crazycs520
Copy link
Contributor Author

release 7.1 doesn't have this issue, so no need to cherry-pick to tidb-7.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants