WEBVTT

00:00.000 --> 00:14.760
Okay. Hello everyone. Welcome to my talk. That's that the flakes. Actually it doesn't

00:14.760 --> 00:21.880
have anything to do with laser guns if you think. It's just like trying to eliminate those

00:21.880 --> 00:29.960
pesky flakes that are ever present. My name is Daniel Hiller. I'm going to

00:29.960 --> 00:36.320
give you a short overview about flakes today to show you how we are trying to detect it

00:36.320 --> 00:44.000
before merge. This system doesn't work as we wanted. Then I'm going to talk about the

00:44.000 --> 00:49.680
approach that I've been looking at and that we're trying to implement and then I'm going

00:49.680 --> 00:55.680
to talk about the plans that we have for that. I hope I have a bit questions and answers

00:55.720 --> 01:01.360
time but I'll try to rush as fast as I can. Hopefully I'm not too fast. Hopefully everything

01:01.360 --> 01:05.920
works as expected. As I said, my name is Daniel Hiller. I'm a software engineer and the

01:05.920 --> 01:11.440
OP-shift virtualization team at Red Hat. My main concern is Cuebert, the ISIS system and

01:11.440 --> 01:18.720
automation general. So let's start right in. I think a couple of people might have been at my

01:18.800 --> 01:26.440
talk last year, hopefully. Okay, so I need to give you a bit of the idea what it is.

01:26.440 --> 01:33.280
This is about what you see here is like a couple of test runs for a set of test lanes. Please

01:33.280 --> 01:38.080
pay attention to this number. This is the Git commit ID. So what you're seeing here is the

01:38.080 --> 01:46.080
couple of test runs all on the same commit. Who can tell me what is wrong with that? Yeah, exactly.

01:47.040 --> 01:53.440
Because this one failed and this one passed. So on the same commit, nothing has changed inside

01:53.440 --> 01:59.520
the code base. There has been something that has been failing the tests somehow. And that's exactly

01:59.520 --> 02:06.000
what we're talking about. That's what we're trying to find. These are flaky tests. The flake is a test

02:06.000 --> 02:10.960
that without any code change will either fail or pass in successive runs. By the way, you don't

02:10.960 --> 02:14.880
need to make photos. I think you can download the slides right away. There is like you do

02:14.880 --> 02:22.480
recording, including which you can take a photo. So just a bit of statistics. There is a survey

02:22.480 --> 02:28.960
of flaky tests that I talked about last year. 90% of the developers claim to deal with flaky tests

02:30.000 --> 02:39.920
of those like 23% found that there was a serious problem. And they were like the 15% of developers

02:40.800 --> 02:47.600
that were thinking it was a frequently encountered problem. We're dealing with a daily. So I

02:47.600 --> 02:54.320
think it should be important that you care for your flaky tests. And not say this is fine and move on.

02:56.640 --> 03:02.160
Okay, so just in a nutshell, flaky tests cause not only problems for the individual

03:02.160 --> 03:08.240
contributors, but also inside the community. For individual contributors, you see that

03:08.320 --> 03:14.480
we have prolonged feedback cycles for them. So they might rerun test lanes again and again because

03:14.480 --> 03:21.200
they have like failing tests which they obviously are caused by flakes. So they are starting to

03:21.200 --> 03:27.280
not trust the test anymore. For the project community itself, this is like slowing down everyone

03:27.280 --> 03:33.120
because like those rerun test lanes are consuming resources that could be used better in a better

03:33.120 --> 03:41.120
place. Also, it reverses acceleration effects. For example, like if you're doing like an octopus

03:41.120 --> 03:46.800
merchant trying to test everything together, one flaky test can blow up the whole system.

03:48.240 --> 03:52.320
Also, yeah, as I said, there are waste things that I resources in general.

03:54.800 --> 04:00.080
We're not Google, right? Okay, what we have currently at our

04:00.240 --> 04:04.880
Cooper project is we have a lane that actually runs on each

04:06.400 --> 04:11.680
commit which you do on each PR. We are running a set of change tests

04:13.840 --> 04:20.080
five times in random order because that gives us like an 88% chance to catch the flaky test.

04:20.080 --> 04:28.800
But the thing is flakes don't, it's similar to Heisenberg, right? So you can test 10,000 times

04:28.880 --> 04:33.120
and it might show up in the 10,000 one-stime. So that's a problem.

04:35.360 --> 04:43.120
At our place, how we implemented it, we have the problem that we have like the majority of tests

04:43.120 --> 04:48.880
are in the range of 10 seconds to two minutes. So we are talking about E to E test here which

04:48.880 --> 04:54.800
take especially long because you have the full system which are needing to spin up and then

04:54.880 --> 05:04.560
you need to run your system then. So like it has only an 88% chance. So like there is a 12% chance

05:04.560 --> 05:12.000
that you might miss them. We also like currently just use a shotgun approach of finding the test.

05:12.000 --> 05:16.880
The thing is we are only looking at the test files. We are not doing like an AST traversal

05:16.880 --> 05:24.240
which might be a point for approval on itself. And yeah, normally our rerun test lanes on average

05:24.240 --> 05:30.480
take around one hour which is faster than the majority of the remaining lanes but yeah, it's still

05:30.480 --> 05:38.160
like one hour which you need to wait. And yeah, for if you are having touch too many files,

05:38.160 --> 05:42.640
obviously you have to rerun all of them and there might be hundreds of tests which might

05:42.640 --> 05:46.560
take longer than a couple of hours which at some point needs to get kept.

05:47.120 --> 05:54.560
So I was researching and trying to find something about that and a stumble of the

05:54.560 --> 06:02.640
Kenya approach. This was referenced in a paper from 2022 and they were stating that they were

06:02.640 --> 06:09.920
able to reduce the time costs and monetary costs of rerunning at an 80% average of 88% which sounds

06:10.000 --> 06:17.440
very promising. And they do that by leveraging AI. In this case, it's like a random first

06:17.440 --> 06:25.280
which they use to give you a prediction of whether a test is flaky or not. So they have upper and

06:25.280 --> 06:32.720
lower boundaries which you see here. And what happens when a test is in the lower threshold,

06:33.280 --> 06:39.840
we just say it's stable, it's okay. And the upper threshold, we say it's definitely flaky.

06:40.560 --> 06:48.160
And in the middle of that, this is where we need to rerun. So we are trying to cap the number of tests

06:48.160 --> 06:57.280
into a smaller set which we just have to look at and rerun because the only way to find a

06:57.280 --> 07:06.560
whether a test is really flaky or not is to run it as often as you can. So if you're a bit familiar

07:06.560 --> 07:13.760
with how like a random force works. So you have in general you have a set of features which

07:13.760 --> 07:19.440
is all properties. So you have a vector of properties which are mostly the time just float values.

07:20.720 --> 07:27.040
And they give you numbers about how the tests are constructed. So what you see here in

07:27.040 --> 07:34.560
the upper part for example is like runtime things. So these need to get looked at at runtime.

07:34.560 --> 07:39.840
You for example like have I just only give you two examples like for example recon max memory.

07:40.880 --> 07:47.040
But also you have the other features which are constructed from static,

07:47.840 --> 07:52.240
static code analysis. For example like looking at the AST depths or a

07:52.240 --> 07:58.480
stoichiometric complexity, if you know what that means, stoichiometric complexity is just

07:58.480 --> 08:09.120
just as a side of the number of execution paths inside a method simply. So what we want to do

08:10.000 --> 08:16.800
we want to implement that. We want to have that as a replacement for our existing thing,

08:16.880 --> 08:23.840
just rerunning things we want to cap down. So the thought was what problems do we have.

08:23.840 --> 08:29.440
The thing is that the canier is implemented, but it's in Python. And we are talking about

08:29.440 --> 08:34.960
Cupid which is fully implemented and goes. So also the code base is also implemented and

08:34.960 --> 08:42.560
go so we can't use that. So we need to reimplement that. Then also we have the question where

08:42.640 --> 08:50.320
do we store the runtime data? Which might not be entirely necessary, but since you want to optimize

08:50.320 --> 08:55.840
a bit, you should probably think about that. The other thing is like when do we capture that?

08:55.840 --> 09:02.400
Do we have to run the tests always before so that we have the data inside the future vector

09:02.400 --> 09:09.360
or do we fetch it from somewhere else? And the last thing is like the data signs at GoLank.

09:10.080 --> 09:16.800
Python has well known and established frameworks and GoState actually we didn't know.

09:17.600 --> 09:26.240
That's the point. So our plan at the moment is this should be the implementation

09:26.240 --> 09:33.200
of what we want to do. Like I said, this is the fetching of the prediction.

09:33.360 --> 09:42.080
So what we want to see is we want to have the exact set of change tests which we for per each

09:42.080 --> 09:50.720
test we run the prediction to put it then into the random forest model to get the prediction

09:50.720 --> 09:58.480
of how flaky it is and then have a set of tests that we want to rerun. And as a side effect

09:58.480 --> 10:08.080
we get the positive and the negative predicted labels. So the parts that I must admit I wanted to

10:08.080 --> 10:12.880
have a prototype and show you a bit but it wasn't I didn't make it. I'm sorry for that but yeah

10:13.520 --> 10:20.080
I'll give you a point to a to a pull request that I'm working at at the moment. So largely

10:20.080 --> 10:25.920
this is for components. The tests that extraction, the feature extraction, the model prediction

10:25.920 --> 10:29.840
and the model generation and obviously also the model hosting which we're also doing and go.

10:31.280 --> 10:36.240
Then we have the test lane which currently is like a system what we are using we're using

10:36.240 --> 10:42.640
Kubernetes prow which most of the time we we have just like batch scripts that are running our

10:42.640 --> 10:49.680
tests as a base execution. Obviously we're using go gink or framework but that's just a

10:50.640 --> 10:57.440
and obviously like the model deployment on the cluster which we which we need to access at

10:57.440 --> 11:09.520
some point from all the lanes. That's so far the plan for v2 then we also have like the improvement

11:10.080 --> 11:15.680
that we want to plan on that one which you will be like a prow external plugin which we can

11:15.680 --> 11:24.480
use aside from the test lane. It can run on the side and give you additional information about

11:24.480 --> 11:30.160
the features that actually have been seen on the test cases. For example like the

11:30.160 --> 11:35.840
psychromatic complexity is something you might consider as a room for improvement on your tests

11:35.840 --> 11:40.080
and that's a valuable information that you can give the user spec and that's what we're going to do.

11:40.720 --> 11:54.800
Two minutes. Okay. Yeah so as I said sorry wrong direction a bit confused. Yeah that's

11:54.800 --> 12:00.160
what I already said. Yeah further improvements as I said like the components of the future

12:00.160 --> 12:05.600
vector provide in such a little advice to the contributor as I said like psychromatic complexity

12:05.680 --> 12:11.040
and but the but the other thing is that we can obviously increase the number of reruns we are doing

12:11.040 --> 12:16.560
at the moment we are just running five times but if we are having a largely reduced set we can do more

12:17.280 --> 12:26.560
and we can so be more exact and finding the flakes. That's all I have. I have a couple of links

12:26.560 --> 12:33.520
here in for cubic resources there is the link to my initial pull request. I have held another

12:33.520 --> 12:38.640
presentation about the flakes which you also can find here. There is the canier link for the

12:38.640 --> 12:43.440
paper and for the implementation of people of you are using python you have something that you

12:43.440 --> 12:48.240
can already use which is great and hopefully you can use something if you're using golang soon.

12:50.480 --> 12:56.080
So yeah that's it for me I've probably a bit of time of questions hopefully. No sorry.