-
Notifications
You must be signed in to change notification settings - Fork 6k
Description
Enhancement
The unstable CI has been a hinder to our daily development. Sometimes we have to run it again and again, and again ... to merge a PR.
It takes a very long time to run the CI and if a unstable test fails, all the time are wasted.
There are many bad decisions we made in the past:
- To make it run faster, we use a wrong way to parallel
- Auto re-run to amend for the unstable test cases
- No timeout limitation on a single unit test
- High coupling of the code and running environment
- Multiple teams to maintain CI thus the unclear responsbilities
Let's review it one by one.
We use a wrong way to parallel. Some test cases are not side effect free, for example:
- the enable and disable of the failpoint
- modification of the configuration
- change of the global variable in a single process.
- etc...
If the test cases with side effect run parallelly with others, it could cause some test fail unexpectedly.
When they're in a single OS process, make them parallel is not a big deal.
Parallel make the CI run a bit faster but when some test fail, it takes us a lot of time to investigate the root cause.
We employ auto re-run using the CI bot to amend for the unstable tests. IMHO, this is another failure.
The auto re-run tend to make us ignore the unstable cases. See #25899
As time goes on, there are more and more unstable test cases. And test cases almost fail randomly!
Later, if someone file a PR and run all test pass without retry, maybe he would think: WOW, today I'm so lucky~
Back to the point of parallel, why we want to make the test parallel? we want it to run faster because it's slow.
So why it's slow? because we're adding more and more test cases.
Well, I found many code are poorly writen. Some test cases call sleep()
at will, and a single test case may run even more than 120s!
The problem is, there was no timeout limitation on it, so I added one.
However, the timeout makes the CI more unstable #26717
Parallel requires more machine resource and with retrying, that can eat up all the machine resources!
The high load of the CI environment makes the test cases more likely to fail.
The situation comes from bad to worse.
So high coupling of the code and running environment makes the running time of a single test case undecidable.
Timeout limitation on a single unit test doesn't work well.
I can see the test case always finish within 3s in my own computer, but that's not the truth in the CI environment.
========================
What's the solution?
-
Step1: remove all the in-process parallel #30692
-
Step2: Add back parallel by run a single test case in a isolated OS process #30828
- polish of the tool ut, exit code, output log etc
- fix some failed test cases
- limit the timeout for a single test case
- handling the coverage.txt #31885
- integration with the CI script #31893
-
Step3: Make a bootstraped store snapshot to avoid
Bootstrap()
in every single test
For step2, we can specify the test case by the name manually:
go test -run TextXXX
And we can run different test cases parallely, each with its own OS process, so the code is isolated without side effect.