WEBVTT

00:00.000 --> 00:16.640
Hello everyone, hello everyone, my name is Pavovietlorek, I work with Colabra and I joined

00:16.640 --> 00:20.480
Karnasji development team in July 2023.

00:20.480 --> 00:26.360
Today, I would like to share with you what we've been up to, what's still in the

00:26.360 --> 00:34.360
works and how you, Karnas developers, Karnas maintainers, automation or quality assurance

00:34.360 --> 00:44.600
engineers and everyone who's involved or just interested in similar efforts could benefit from it.

00:44.600 --> 00:52.000
I will start with just a short introduction to what Karnasji actually is as a project.

00:52.000 --> 00:58.600
Next, I will describe Karnas status of things in detail and after that, I will talk

00:58.600 --> 01:03.720
about potential next steps that we could take.

01:03.720 --> 01:09.960
I will also show you how you could try everything out yourself and finally, I will

01:09.960 --> 01:13.520
show you a few closing thoughts.

01:13.520 --> 01:21.640
So let's start by explaining what the Karnasji project really is and if we go way back

01:21.640 --> 01:30.200
to its early days in 2014, it started as an independent initiative by ARM-SOC maintainers.

01:30.200 --> 01:38.600
It later became an automated system for Karnas builds and bots and testing them on embedded

01:38.600 --> 01:44.200
ARM platforms through collaboration with Linnaro developers.

01:44.200 --> 01:51.720
Its dashboard is something that I believe many of you might already be familiar with.

01:51.720 --> 01:58.200
Throughout the years, requirements for the system grew and in many cases to the point which

01:58.200 --> 02:03.680
the classic or now legacy architecture could no longer support.

02:03.680 --> 02:10.320
This resulted in time-doubt queries, high infrastructure costs and also high maintenance

02:10.320 --> 02:11.640
costs.

02:11.640 --> 02:20.840
It eventually led to the idea of rethinking the design for a new Karnasji system and

02:20.840 --> 02:28.760
its instance hosted on Karnasji.org, which I will later refer to as a service.

02:28.760 --> 02:35.720
Also, putting a new system in place is why the former one is now being referred to as

02:35.720 --> 02:38.920
legacy.

02:38.920 --> 02:46.920
But both system and service, however, are not the only components of the Karnasji project.

02:46.920 --> 02:56.040
Together with growing requirements, also the whole ecosystem expanded from improving test

02:56.040 --> 03:04.600
quality, which is an effort that only makes use of Karnasji systems, but its scope

03:04.680 --> 03:09.560
is far different from just the test execution automation.

03:09.560 --> 03:18.600
Through, for example, preparing GitLab CI pipeline templates, which is for Karnasji development

03:18.600 --> 03:28.920
and testing, which is a project backed by Karnasji developers, but currently its separate

03:28.920 --> 03:29.920
effort.

03:29.920 --> 03:37.480
Also, there is another, completely, maybe not completely, but relatively separate effort

03:37.480 --> 03:45.840
of collecting all different testing results from various systems, not only Karnasji,

03:45.840 --> 03:55.200
specifically, and delivering them to any part interested in receiving or processing them

03:55.280 --> 03:59.360
further, which is KCI DB.

03:59.360 --> 04:07.440
But back to, or maybe, if one more thing, if you'd like to learn more about these efforts,

04:07.440 --> 04:14.560
go ahead and have a look at the linked materials, the one in the middle GitLab CI pipeline

04:14.560 --> 04:22.960
definitions is a pretty fresh one, it has been updated just last week, and now back to the

04:23.120 --> 04:25.920
growing ecosystem.

04:25.920 --> 04:30.800
This was actually the main reason behind the new system.

04:30.800 --> 04:38.000
It had to be designed with the extensibility in mind in order to easily integrate with

04:38.000 --> 04:42.200
all of those various services.

04:42.200 --> 04:49.440
But what's important to remember from this very short introduction to Karnasji project

04:49.520 --> 04:56.880
is that it has expanded and evolved into an umbrella for various efforts related to Linux

04:56.880 --> 05:06.640
kernel testing, system and service are just the components of the whole stock, and the

05:06.640 --> 05:13.920
service scaling and maintenance costs led to redesigning the whole system.

05:13.920 --> 05:23.840
Now, once we have this out of the way, let's review the Karnas State's current status of things

05:23.840 --> 05:30.880
and have a better perspective at various testing efforts under Karnasji project umbrella.

05:34.160 --> 05:40.880
Don't worry, we'll dissect this whole testing landscape diagram in a second, and starting from

05:41.600 --> 05:49.520
top left, we've got the inputs, so the Git trees, that's Karnasji system monitors.

05:50.400 --> 05:57.280
Below that, we've got the component that is responsible for storing the test definitions

05:59.440 --> 06:10.480
for also building all of those artifacts that will be later used and dispatching all different

06:10.480 --> 06:18.080
tasks to other components. And about this patching tasks, we've got the labs component

06:19.120 --> 06:26.480
which is responsible for actual execution on physical hardware of the tests that should be executed

06:26.480 --> 06:38.560
by Karnasji. On the bottom right, we've got also other systems to relate it to Karnasji testing,

06:38.560 --> 06:48.160
but not specifically part of Karnasji umbrella project, like in the zero day Red Hat's cookie or

06:48.160 --> 06:59.520
saysbot, which also feed their results to KCIDB on in the middle right, which as I mentioned before,

06:59.520 --> 07:08.240
collect all of those testing results. And finally, we've got also a new web dashboard for

07:08.240 --> 07:16.880
presenting results to the end users. And let's start going into details from this web dashboard.

07:16.880 --> 07:25.360
It probably might be the first contact for developers who just start to interact with new

07:25.360 --> 07:35.520
Karnasji system, and it attempts at providing only relevant information instead of being a

07:35.600 --> 07:43.520
data overload for new cameras. It's available at dashboards.karnasji.org.

07:43.520 --> 07:52.320
It currently is under heavy development, so if there is something you're missing or something you

07:52.320 --> 08:01.120
see improved, you'd like to see improved, please let us know. And we'll get back to it in a moment.

08:01.120 --> 08:09.120
But where does the data for web dashboard come from? Or maybe before we go to that, let's

08:10.320 --> 08:17.440
talk about the regression tracking. Because web dashboard is important, but what we can get from it,

08:18.000 --> 08:26.560
and the potential information that should be easily accessible through it would be the data on

08:26.560 --> 08:33.760
regression found in all of those systems. There were several attempts at improving the

08:33.760 --> 08:42.000
states of things and several POCs, the proof of concepts developed recently. Currently, we're

08:42.000 --> 08:50.160
moving the most promising ones to present their data back in the web dashboard, so that it's

08:50.160 --> 08:59.600
more easily accessible. And now let's go back to where these components take the data from.

09:00.320 --> 09:08.160
And I mentioned it before, KCIDB is a results collector from various sources, not only Karnasji

09:08.160 --> 09:16.720
systems, and it's a single point of truth for results delivery and also reporting or finding

09:16.880 --> 09:28.480
regressions. It's data can be accessed through web dashboard I showed you in previous slides,

09:28.480 --> 09:38.480
but it also exposes its contents as a graphana dashboard. But where do these all of these test results

09:38.560 --> 09:50.480
come from? For that, we have also a component for labs for actual test execution on physical hardware,

09:50.480 --> 09:59.600
but we do not limit ourselves only to physical hardware virtual devices are also present in those labs.

10:00.560 --> 10:13.280
Labs also cover the computer resources for build farms, and although currently most of the hardware

10:13.280 --> 10:22.800
labs are based on lava, so linear automated validation architecture, the new system does not limit

10:23.280 --> 10:35.200
ourselves just to lava labs. It's just easiest to get started and connect more new hardware to

10:35.200 --> 10:47.440
Karnasji. But again, the new system removed this limitation which was one of the drawbacks of the

10:47.520 --> 10:59.280
legacy one. And those labs can execute tests, but who dispatches them? Instead of the legacy

10:59.280 --> 11:08.240
hard to scale monolithic application, the new system for this purpose provides just thin abstraction

11:09.040 --> 11:21.360
layers for specific tasks. And these tasks are for example three monitoring to watch for changes

11:21.360 --> 11:30.640
in relevant git trees, but also processing API events and scheduling predefined tasks

11:31.040 --> 11:42.320
start in maestro component as well as reporting back the results collected from labs and

11:42.320 --> 11:55.680
submit them to for example KCIDB or to the developer who created a change processed by maestro.

11:56.640 --> 12:07.040
As you can see, there is also an email report component on the very end of the pipeline,

12:07.040 --> 12:17.440
but that's something that caused some issues in the past mainly with not entirely reliable

12:18.400 --> 12:26.320
notifications, so that's something that's still under development and we've got the email reports

12:26.320 --> 12:35.200
currently in the Karnasji main list in order not to flood anyone with false positives.

12:36.800 --> 12:46.000
And now that we've covered the automated execution, let's go to the on demand one.

12:46.560 --> 12:55.760
For that, we've got the new KCI dev tool, which is standalone utility to interact with Karnasji.

12:56.320 --> 13:05.440
It supports custom submissions, even for arbitrary commits, so not only monitoring the git

13:05.440 --> 13:15.440
karnas trees from maintainers, and knowing that not everyone might be a huge fan of

13:16.480 --> 13:25.760
web interfaces, it also allows you to retrieve the results in a machine-readable format for further processing.

13:26.720 --> 13:37.840
It will also be providing the ability to do the automated by sections from found

13:37.840 --> 13:46.000
regressions, and this feature is currently being finalized, but still under development.

13:46.560 --> 13:51.840
If you'd like to learn more about it, go to the kci.dev.

13:52.400 --> 14:05.360
The key points I'd like to highlight once again for this new system, and its integrations are

14:05.360 --> 14:14.000
the extensibility that was the main idea behind the system redesign, which also resulted

14:14.160 --> 14:23.360
in a much improved scaling of the whole system, and something that I would like stress once more.

14:23.360 --> 14:34.640
The new system is no longer bound to lava hardware laboratories, and that opens up the

14:36.080 --> 14:40.160
whole new way of interacting with physical devices.

14:40.800 --> 14:53.920
As for the next steps that we could take this system from, if you'd like to have a closer look

14:53.920 --> 15:03.280
at how things currently are, the easiest way to do that would be to just access the

15:03.360 --> 15:13.200
managed instances, so either production service, or if you're more interesting in the most recent

15:13.200 --> 15:25.280
changes our staging instance of the new system, you could also try the guided way of creating

15:25.360 --> 15:35.040
your own local instance of the new currency system, and there is also a semi-automatic way of deploying

15:36.560 --> 15:45.920
many, but not all of the components in this new system, stored in our currency i-dash deploy

15:46.240 --> 16:01.280
repository under local installs directory. And if you do, please let us know, I mean the

16:01.280 --> 16:11.440
accuracy i project, what you find that might still be missing from these components, but you'd

16:11.440 --> 16:19.280
like to see in the development workflow. Maybe there's something that might have been

16:19.280 --> 16:32.880
indicated, and there is already a tested component that would suit better the testing workflow,

16:33.600 --> 16:41.360
or maybe there's some new hardware that you'd like to see being tested by the currency i,

16:41.360 --> 16:48.160
and connected as a hardware testing laboratory to the whole pipeline.

16:49.920 --> 16:59.840
If you have any comments on any of those topics, you can either send an email on the mailing list,

16:59.920 --> 17:09.520
or let us know on IRC, channel or matrix channel, or simply hope to our Discord server.

17:11.440 --> 17:20.800
And all of the slides have been already uploaded to the first-time event page,

17:20.800 --> 17:28.400
so all the links are also available there. And now just to summarize it all,

17:29.280 --> 17:37.360
I wanted to share with you today that the new system is steadily going through

17:37.360 --> 17:43.760
a stabilizing phase, which also comes with some setting the legacy one.

17:44.800 --> 17:51.840
If there is any feature that you'll depend on from that setup and has not been migrated yet to

17:51.840 --> 18:04.320
the new system, please do let us know. But this whole new setup was focused on delivering reliable

18:04.320 --> 18:13.520
test results and the only relevant reports. So the main idea was to prevent

18:13.520 --> 18:23.120
maintainers burnout from increasing and hopefully helping solving it. But solving the CI needs

18:23.120 --> 18:33.280
for Linux kernel community is not really just a technical challenge. In big part, it's also a

18:33.280 --> 18:42.000
community challenge, and that is why it's crucial to continuously discuss what can be further

18:42.000 --> 18:52.800
improved. And that is why I hope I will hear from you about your experience with the system.

18:54.080 --> 19:00.960
And with that, thanks for your attention. If there are any questions, I will be happy to answer them.

19:12.960 --> 19:32.240
The question was about the test database and how they can be described to be executed.

19:32.960 --> 19:42.400
So depending on your test needs, the go-to answer would be try to write K unit,

19:42.800 --> 19:46.640
look if there is an LTP that already supports your use case.

19:50.080 --> 19:57.760
As for the tests that are currently executed, they are, like I said, K units, LTPs,

19:57.840 --> 20:07.360
simple boot tests, and also as the most of the physical hardware labs use lava as their

20:07.360 --> 20:12.960
underlying system, they are just lava test job definitions.

20:14.080 --> 20:21.600
Many test job definitions are available in the Linux test dash definitions repository.

20:21.600 --> 20:27.440
I can share the link later if you would be interested in having a closer look.

20:52.080 --> 21:10.320
So the question is about scaling and how with expanded use cases, we grow all of the components.

21:11.280 --> 21:21.520
Yes. All right. So yes, the scaling as always is not an easy task. That's why the new system

21:21.520 --> 21:32.720
started really small with selected feature set and a small number of use cases being supported.

21:33.120 --> 21:45.200
There are several ongoing efforts that I'm at improving. For example, KCI DB performance,

21:45.920 --> 21:57.200
so that the queries that you'd run on this component would not simply time out, but return

21:57.440 --> 22:03.920
relevant results to you about the growing all of the systems.

22:06.960 --> 22:19.600
Yes, one have to be cautious about adding new use cases to be supported and adding more hardware,

22:19.600 --> 22:34.160
more test suites, so far the KCI as a project has hit a few limits, but it's not something

22:34.160 --> 22:43.600
that the new setup does not support. It's more of the throwing more compute resources at the

22:43.600 --> 22:47.040
system problem, kind of problem.

22:47.040 --> 23:15.360
All right. So the question was about the data coming from different instances like staging and production.

23:16.320 --> 23:25.680
And data coming from staging instance is not something that goes to KCI DB, which is later

23:26.960 --> 23:36.080
accessed by other post processing components. Staging data is throw away. If we want to make sure

23:36.080 --> 23:45.600
that it's everything's fine, we compare it with previous results and if it matches, that means

23:47.360 --> 23:57.840
new release on staging is good to go and the changes can be migrated to production, but overall

23:58.400 --> 24:10.960
the data coming from staging is assumed that might be faulty. So before the new deployments,

24:10.960 --> 24:22.080
we compare that the key indicators are fine. We've got several specific trees,

24:22.080 --> 24:30.320
monitored on the staging instance that even have branches with non issues, so that we can also

24:30.320 --> 24:36.640
test if the new deployment of the system still catches those regressions.