-
Notifications
You must be signed in to change notification settings - Fork 6k
Closed
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.
Description
Enhancement
Background
Now, when a task encounters a fatal error, the following keys are written:
$prefix/last_error/$task_name → kvproto::brpb::StreamBackupError
$prefix/paused/$task_name → <nothing>
So far, we have suffered some pains due to this like:
- When someone pauses the task, it is impossible to track the reason for pausing.
- This happens when a task is paused by a new feature, like auto pausing tasks with huge lags.
- When a user mistakenly pauses a task, we cannot figure out when and how he paused it.
- Putting pause and error isn't atomic. This may cause a fatal error to be treated as retryable and the user may recover an unrecoveriable task.
Design
Here I propose a new format of pausing:
$prefix/paused/$task_name → PauseReason
Where `PauseReason` looks like:
{
"severity": "ERROR" | "REGULAR_OPERATION",
"operation_hostname": string,
"operation_pid": int,
"operation_time": datetime,
"payload_type": "application/x-protobuf;messageType=brpb.StreamBackupError" | "text/plain;charset=UTF-8",
"payload": bytes,
}
Notes:
- Though
payload_type
uses the same format as MIME types, for safety and simplicity, only the messagebrpb.StreamBackupError
and UTF-8 strings are accepted bylog status
. The BR CLI will refuse to parse the payload if the type isn't one of them. - I think
PauseReason
can be JSON instead of Protocol Buffer Message, because in almost all cases, the information it provides is for humans. JSON is more flexible and more friendly to humans.
Compatibility / Feasibility
For now, the pause key was used by both TiKV and the Coordinator. But neither of them reads the value of this key. So modifications to values are safe.
The error key was written by TiKV and only reads by the BR CLI (log status).
A possible update route:
- Update TiDB codebase, allowing BR CLI to read information from Pause V2.
- We just read error messages from both V1 and V2.
- In this step, the base library for Pause V2 can be added.2. Allowing the coordinator and BR CLI to pause a task by V2.
- Maybe allow the user to attach a custom message to this pause.
- Allowing the TiKV to upload errors by V2.
Metadata
Metadata
Assignees
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.