Skip to content

changefeed task failed after inject s3 failure last for 1mins #10710

@Lily2025

Description

@Lily2025

What did you do?

1、br restore data
2、create changefeed with redo log
config: "memory-quota = 2147483648\n[consistent]\nlevel = "eventual"\nflush-interval = 500\nstorage = "s3://tmp/test-changefeed?access-key=xxxn&secret-access-key=xxx&endpoint=http%3a%2f%2fminio-peer%3a9000""

3、inject s3 failure last for 1mins
[2024/03/04 21:00:07.457 +08:00] [INFO] [chaos.go:74] ["Run chaos"] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"] [name=failure] [selectors="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [selectorsRetainPolicy(selectors)="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="FailureExperimentSpec{Duration: "", Scheduler: }"]
[2024/03/04 21:01:07.458 +08:00] [INFO] [chaos.go:92] ["Clean chaos"] [name=failure] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"]

ticdc logs:
ticdc-1.tar.gz

What did you expect to see?

changefeed can recover automatically after fault recover

What did you see instead?

changefeed task failed after inject s3 failure last for 1mins

{
"upstream_id": 7342169415863000245,
"namespace": "default",
"id": "ticdc-task",
"sink_uri": "mysql://root:xxxxx@cdc-downstream-tc-tidb.endless-ha-test-ticdc-tps-7170215-1-752:4000",
"create_time": "2024-03-04 00:42:24.389",
"start_ts": 448131011512107011,
"resolved_ts": 0,
"target_ts": 0,
"checkpoint_tso": 448150164062077038,
"checkpoint_time": "2024-03-04 21:00:05.437",
"sort_engine": "unified",
"state": "failed",
"error": null,
"error_history": null,
"creator_version": "v8.0.0-alpha"
}

ticdc logs:
[2024/03/04 21:00:08.356 +08:00] [ERROR] [manager.go:269] ["redo: failed to create redo log writer"] [namespace=default] [changefeed=ticdc-task] [duration=1.046949ms] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.NewLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/mem_log_writer.go:53\ngithub.com/pingcap/tiflow/cdc/redo/writer/factory.NewRedoLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/factory/factory.go:53\ngithub.com/pingcap/tiflow/cdc/redo.(*logManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/manager.go:267\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func5\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:654\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.357 +08:00] [INFO] [manager.go:590] ["redo manager closed"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160]
[2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.373 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai]
[2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:299] ["changefeed tick failed"] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:306] ["an error occurred in Owner"] [namespace=default] [changefeed=ticdc-task] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_puller.go:756] ["close the ddl puller"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_sink.go:281] ["owner ddl sink background loop exits"] [namespace=default] [changefeed=ticdc-task] [error="context canceled"]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_client.go:268] ["event feed exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.379 +08:00] [INFO] [multiplexing_puller.go:242] ["MultiplexingPuller exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] [errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.380 +08:00] [INFO] [coordinator.go:261] ["schedulerv3: coordinator closed"] [ownerRev="{"revision":244137}"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.380 +08:00] [INFO] [changefeed.go:780] ["changefeed closed"] [namespace=default] [changefeed=ticdc-task] [isRemoved=false]
[2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/scheduler]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:854] ["Processor drop sort engine successfully"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:860] ["Processor try to close agent"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:869] ["Processor closed agent successfully"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:878] ["processor closed"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/agent]

Versions of the cluster

./cdc version
Release Version: v8.0.0-alpha
Git Commit Hash: 31dd24f
Git Branch: heads/refs/tags/v8.0.0-alpha
UTC Build Time: 2024-03-01 11:38:18
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

Metadata

Metadata

Assignees

Labels

affects-6.1This bug affects the 6.1.x(LTS) versions.affects-6.5This bug affects the 6.5.x(LTS) versions.affects-7.1This bug affects the 7.1.x(LTS) versions.affects-7.5This bug affects the 7.5.x(LTS) versions.area/ticdcIssues or PRs related to TiCDC.severity/majortype/bugThe issue is confirmed as a bug.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions