Make CI great again!

## Enhancement

The unstable CI has been a hinder to our daily development. Sometimes we have to run it again and again, and again ... to merge a PR.
It takes a very long time to run the CI and if a unstable test fails, all the time are wasted.

There are many bad decisions we made in the past:

- To make it run faster, we use a wrong way to parallel
- Auto re-run to amend for the unstable test cases
- No timeout limitation on a single unit test
- High coupling of the code and running environment
- Multiple teams to maintain CI thus the unclear responsbilities

Let's review it one by one.

We use a wrong way to parallel. Some test cases are not side effect free, for example:

- the enable and disable of the failpoint
- modification of the configuration
- change of the global variable in a single process.
- etc...

If the test cases with side effect run parallelly with others, it could cause some test fail unexpectedly. 
When they're in a single OS process, make them parallel is not a big deal.
Parallel make the CI run a bit faster but when some test fail, it takes us a lot of time to investigate the root cause.

We employ auto re-run using the CI bot to amend for the unstable tests. IMHO, this is another failure.
The auto re-run tend to make us ignore the unstable cases. See https://github.com/pingcap/tidb/issues/25899
As time goes on, there are more and more unstable test cases. And test cases almost fail randomly!
Later, if someone file a PR and run all test pass without retry, maybe he would think: WOW, today I'm so lucky~

Back to the point of parallel, why we want to make the test parallel? we want it to run faster because it's slow.
So why it's slow? because we're adding more and more test cases.
Well, I found many code are poorly writen. Some test cases call `sleep()` at will, and a single test case may run even more than 120s! 
The problem is, there was no timeout limitation on it, so I added one.
However, the timeout makes the CI more unstable https://github.com/pingcap/tidb/issues/26717

Parallel requires more machine resource and with retrying, that can eat up all the machine resources! 
The high load of the CI environment makes the test cases more likely to fail.
The situation comes from bad to worse.

So high coupling of the code and running environment makes the running time of a single test case undecidable.
Timeout limitation on a single unit test doesn't work well.
I can see the test case always finish within 3s in my own computer, but that's not the truth in the CI environment.

========================

### What's the solution?

- [X] Step1: remove all the in-process parallel [#30692](https://github.com/pingcap/tidb/pull/30692)
- [x] Step2: Add back parallel by run a single test case in a isolated OS process [#30828](https://github.com/pingcap/tidb/issues/30828)
    - [x]     polish of the tool ut, exit code, output log etc
    - [x]     fix some failed test cases
    - [ ]     limit the timeout for a single test case
    - [x] #31885
    - [x] #31893

- [ ] Step3: Make a bootstraped store snapshot to avoid `Bootstrap()` in every single test


For step2, we can specify the test case by the name manually:

```
go test -run TextXXX
```

And we can run different test cases parallely, each with its own OS process, so the code is isolated without side effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make CI great again! #30822

Enhancement

What's the solution?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make CI great again! #30822

Description

Enhancement

What's the solution?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions