Skip to content

br/stream: save more information to pause (pause v2) #59660

@YuJuncen

Description

@YuJuncen

Enhancement

Background

Now, when a task encounters a fatal error, the following keys are written:

$prefix/last_error/$task_name → kvproto::brpb::StreamBackupError
$prefix/paused/$task_name     → <nothing>

So far, we have suffered some pains due to this like:

  • When someone pauses the task, it is impossible to track the reason for pausing.
    • This happens when a task is paused by a new feature, like auto pausing tasks with huge lags.
  • When a user mistakenly pauses a task, we cannot figure out when and how he paused it.
  • Putting pause and error isn't atomic. This may cause a fatal error to be treated as retryable and the user may recover an unrecoveriable task.

Design

Here I propose a new format of pausing:

$prefix/paused/$task_name → PauseReason
Where `PauseReason` looks like:
{
    "severity": "ERROR" | "REGULAR_OPERATION",
    "operation_hostname": string,
    "operation_pid": int,
    "operation_time": datetime,
    "payload_type": "application/x-protobuf;messageType=brpb.StreamBackupError" | "text/plain;charset=UTF-8",
    "payload": bytes,
}

Notes:

  • Though payload_type uses the same format as MIME types, for safety and simplicity, only the message brpb.StreamBackupError and UTF-8 strings are accepted by log status. The BR CLI will refuse to parse the payload if the type isn't one of them.
  • I think PauseReason can be JSON instead of Protocol Buffer Message, because in almost all cases, the information it provides is for humans. JSON is more flexible and more friendly to humans.

Compatibility / Feasibility

For now, the pause key was used by both TiKV and the Coordinator. But neither of them reads the value of this key. So modifications to values are safe.

The error key was written by TiKV and only reads by the BR CLI (log status).

A possible update route:

  1. Update TiDB codebase, allowing BR CLI to read information from Pause V2.
    • We just read error messages from both V1 and V2.
    • In this step, the base library for Pause V2 can be added.2. Allowing the coordinator and BR CLI to pause a task by V2.
  2. Maybe allow the user to attach a custom message to this pause.
  3. Allowing the TiKV to upload errors by V2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions