Skip to content

stale-read query latency 10x spike cause by information schema cache miss #53428

@crazycs520

Description

@crazycs520

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

High schema cache miss.

Related metrics:

image image

Analyze

The reason for a large number of load snapshot schema operations must be InfoSchema cache miss, but what causes cache miss?

Almost all TiDB generate a large number of load snapshot schema operations at the same time. It's weird, I picked a TiDB (tidb-11) with a large amount of operations for log analysis.

grep 'load InfoSchema' tidb-11.log

"[2024/05/09 15:00:46.337 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585441] [neededSchemaVersion=585442] [\"start time\"=1.734944ms] [phyTblIDs=\"[]\"] [actionTypes=\"[]\"]",

"[2024/05/09 15:00:46.521 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585442] [neededSchemaVersion=585443] [\"start time\"=503.56µs] [phyTblIDs=\"[]\"] [actionTypes=\"[]\"]",

"[2024/05/09 15:01:34.130 +00:00] [INFO] [domain.go:224] [\"diff load InfoSchema success\"] [currentSchemaVersion=585443] [neededSchemaVersion=585445] [\"start time\"=4.158976ms] [phyTblIDs=\"[454637]\"] [actionTypes=\"[8]\"]",

The schema version has 585442, 585443, and 585445, but 585444 does not exist, and there's a hole that causes schema cache misses, so TiDB needs to load snapshot schema from TiKV. Related codes are following:

if h.cache[i-1].infoschema.SchemaMetaVersion() == is.infoschema.SchemaMetaVersion()+1 && uint64(h.cache[i-1].timestamp) > ts {

Well, Why does the schema version 585444 not exist? From the current code implementation, since GenSchemaVersion uses a separate transaction to generate schema version, but SetSchemaDiff use another transaction, so if GenSchemaVersion transaction succeed, but SetSchemaDiff transaction fails, then will have schema version gap in this issue.

Why use two separate transactions? this change is introduced by PR: support concurrent ddl

tidb/pkg/ddl/ddl.go

Lines 448 to 453 in 5d990c6

err = kv.RunInNewTxn(kv.WithInternalSourceType(context.Background(), kv.InternalTxnDDL), store, true, func(_ context.Context, txn kv.Transaction) error {
var err error
m := meta.NewMeta(txn)
schemaVersion, err = m.GenSchemaVersion()
return err
})

err = t.SetSchemaDiff(diff)

tidb/domain/domain.go

Lines 423 to 424 in 2abf83d

// Empty diff means the txn of generating schema version is committed, but the txn of `runDDLJob` is not or fail.
// It is safe to skip the empty diff because the infoschema is new enough and consistent.

2. What did you expect to see? (Required)

No information schema cache miss.

3. What did you see instead (Required)

information schema cache miss then cause many operation which is load snapshot schema for stale-read query.

4. What is your TiDB version? (Required)

nightly: e1a6b1d

Metadata

Metadata

Assignees

Labels

affects-6.5This bug affects the 6.5.x(LTS) versions.affects-7.1This bug affects the 7.1.x(LTS) versions.affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.component/ddlThis issue is related to DDL of TiDB.severity/moderatetype/bugThe issue is confirmed as a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions