-
Notifications
You must be signed in to change notification settings - Fork 243
improve region request log for diagnose #1300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: crazycs520 <[email protected]>
d1670c1
to
d6c8e6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <[email protected]>
Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later. |
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
internal/locate/region_request.go
Outdated
// ReplicaAccessStats records the first 20 access info, after more than 20 records, count them by `OverflowCount` field. | ||
type ReplicaAccessStats struct { | ||
AccessInfos []ReplicaAccessInfo | ||
OverflowCount int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging the access infor by peer when there are many access following? Overflowcount
could not provide useful information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ReplicaAccessStats
currently only records for this round, and in normal case, each replica should only access only once, for leader replica can be access 10 times, so normally, OverflowCount should always be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the follower read
or stale read
cases? The OverflowCount
still does not seem very helpful.
Consider such a case, the NotLeader
error is encountered serveral times because of region election, after the new leader information is returned to the kv-client, some new errors caused by disk io problems are encountered, while the related information is recorded as OverflowCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, done. PTAL
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
if replicaRead { | ||
tp = ReqReplicaRead | ||
} else if staleRead { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should replicaRead
be prior to staleRead
? Or is it guaranteed that replicaRead
and staleRead
must be exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the current tests, normally, replicaRead
and staleRead
flag must be exclusive.
* improve region request log for diagnose Signed-off-by: crazycs520 <[email protected]> * rename struct Signed-off-by: crazycs520 <[email protected]> * refine region error metric with store id label and add rpc error metric Signed-off-by: crazycs520 <[email protected]> * refine comment Signed-off-by: crazycs520 <[email protected]> * refine code Signed-off-by: crazycs520 <[email protected]> * restrict log Signed-off-by: crazycs520 <[email protected]> * refine code Signed-off-by: crazycs520 <[email protected]> * refine Signed-off-by: crazycs520 <[email protected]> * refine Signed-off-by: crazycs520 <[email protected]> * refine log Signed-off-by: crazycs520 <[email protected]> * refine code Signed-off-by: crazycs520 <[email protected]> * fix test Signed-off-by: crazycs520 <[email protected]> * address comment Signed-off-by: crazycs520 <[email protected]> * refine Signed-off-by: crazycs520 <[email protected]> * refine Signed-off-by: crazycs520 <[email protected]> * refine log Signed-off-by: crazycs520 <[email protected]> --------- Signed-off-by: crazycs520 <[email protected]>
* improve region request log for diagnose * rename struct * refine region error metric with store id label and add rpc error metric * refine comment * refine code * restrict log * refine code * refine * refine * refine log * refine code * fix test * address comment * refine * refine * refine log --------- Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]> --------- Signed-off-by: crazycs520 <[email protected]>
Signed-off-by: crazycs520 <[email protected]>
Close #1315.
What changed and how does it work?
1. Reduce
region_request
logBefore This PR
After 1 tikv server down when benchmark test, tidb will output tens of thousands log about
region_request
in a matter of minutesThis PR
2. make log more friendly to diagnostic slow queries.
Such as use
Connection ID
andtxn timestamp
to grep tidb log, then got following log:With upper log information, we can know following information about this query:
3. Add RPC Error stats.
Before This PR, the plan with execution info may be like following:
We can know
TableReader_17
execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
This PR
We can know
TableReader_17
execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , rpc_errors:{data_is_not_ready:2, not_leader:2, context deadline exceeded:1, context canceled:1}, backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
rpc_errors
info, we can known the cop task meetsdata_is_not_ready
error 2 times, meetsnot_leader
error 2 times, meetscontext deadline exceeded
1 times, and meetscontext canceled
1 time.4. Refine Region Error metric and add RPC error metric