Skip to content

Make CI great again! #30822

@tiancaiamao

Description

@tiancaiamao

Enhancement

The unstable CI has been a hinder to our daily development. Sometimes we have to run it again and again, and again ... to merge a PR.
It takes a very long time to run the CI and if a unstable test fails, all the time are wasted.

There are many bad decisions we made in the past:

  • To make it run faster, we use a wrong way to parallel
  • Auto re-run to amend for the unstable test cases
  • No timeout limitation on a single unit test
  • High coupling of the code and running environment
  • Multiple teams to maintain CI thus the unclear responsbilities

Let's review it one by one.

We use a wrong way to parallel. Some test cases are not side effect free, for example:

  • the enable and disable of the failpoint
  • modification of the configuration
  • change of the global variable in a single process.
  • etc...

If the test cases with side effect run parallelly with others, it could cause some test fail unexpectedly.
When they're in a single OS process, make them parallel is not a big deal.
Parallel make the CI run a bit faster but when some test fail, it takes us a lot of time to investigate the root cause.

We employ auto re-run using the CI bot to amend for the unstable tests. IMHO, this is another failure.
The auto re-run tend to make us ignore the unstable cases. See #25899
As time goes on, there are more and more unstable test cases. And test cases almost fail randomly!
Later, if someone file a PR and run all test pass without retry, maybe he would think: WOW, today I'm so lucky~

Back to the point of parallel, why we want to make the test parallel? we want it to run faster because it's slow.
So why it's slow? because we're adding more and more test cases.
Well, I found many code are poorly writen. Some test cases call sleep() at will, and a single test case may run even more than 120s!
The problem is, there was no timeout limitation on it, so I added one.
However, the timeout makes the CI more unstable #26717

Parallel requires more machine resource and with retrying, that can eat up all the machine resources!
The high load of the CI environment makes the test cases more likely to fail.
The situation comes from bad to worse.

So high coupling of the code and running environment makes the running time of a single test case undecidable.
Timeout limitation on a single unit test doesn't work well.
I can see the test case always finish within 3s in my own computer, but that's not the truth in the CI environment.

========================

What's the solution?

For step2, we can specify the test case by the name manually:

go test -run TextXXX

And we can run different test cases parallely, each with its own OS process, so the code is isolated without side effect.

Metadata

Metadata

Assignees

Labels

type/enhancementThe issue or PR belongs to an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions