-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
Motivation
Run-away queries are queries that consume more resources beyond user expectation. This could be caused by improper SQL statement, suboptimal plan.
Runaway query can impact overall performance if they are not managed properly. We need to manage run-away queries effectively. Long-running operations should be identified and aborted.
Currently, we already have the deadline mechanism pushed down to the TiKV layer that one coprocessor request would not execute in TiKV more than 60s by default. But a runaway query may not cost too much time on one single coprocessor request, thus the deadline mechanism can't help avoid run-away queries. In the meantime, deadlines can't be too small, otherwise, normal requests can be quickly aborted.
How to identify run-away queries?
Runaway queries can adversely impact overall performance if they are not managed properly. Resource manager can take action when a query exceeds more than a specified amount of elapsed time. The elasped time indicates the time of being processed, which excludes the waiting time.
Differentiating run-away queries from queries that really need to perform a full table/index scan is hard. There is no absolute rule. So we just let users define the rule to identify run-away queries. They can twist it on their own needs. The criteria are only the execution time, at least at present. Maybe add more dimension later.
TiKV would send back the scan detail in coprocessor responses. If the total elapsed time of the query exceeds the threshold, then it would be recognized as a run-away query(statement).
Task Breakdown
- Extend resource group meta with runaway config
- Update kvproto @Connor1996 resource control: Add runaway settings kvproto#1114
- Extend create/alter resource group statement @CabinfeverB
- parser part *: Introduce runaway statement in resource group #43843
- pd part
- Make alter patch style resource_group: support patch for altering resource group #44322
- Extend admin table
information_schema.resource_groups
ddl, I_S: support runaway attribute in resource group #43877
- Identify runaway in cop client and perform action @Connor1996
- Introduce runaway checker domain: Introduce runaway manager #44339
- Use const default resource group name resource_control: use const default resource group name #44526
- Update kvproto to add
override_priority
resource control: Add runaway settings kvproto#1114 - Override resource group priority on tikv side resource_control: use override priority if specified tikv/tikv#14926
- Fix mock PD client mock: impl resource manager get default resource group tikv/client-go#839
- Introduce runaway checker domain: Introduce runaway manager #44339
- Introduce admin table
information_schema.runaway_queries
- Persist runaway records in kv with flush mechanism domain: record runaway and quarantine query #44654 @CabinfeverB
- clean history records domain: support GC runaway record #44784 @CabinfeverB
- Quarantine runaway queries @CabinfeverB
- Identify later matched SQL or similar SQL and reject with error executor: impl runaway watch check #44474
- Provide admin table
information_schema.qurantined_watch
domain: record runaway and quarantine query #44654
- Provide a way to mark runaway manually
- Add query watch statement *: add
query watch
stmt for manul management of runaway watch #45500 - Let watch records sync among TiDBs *: global runaway watch by system table and impl exector for
query watch
#45465
- Add query watch statement *: add
Misc
- Introduce a metric "max query elapsed time by resource groups" metrics: add max query duration per resource group metrics #44746
- Add user document resource_control: add runaway queries docs-cn#14242
- Publish RFC resource_control: publish runaway management rfc #44745