WEBVTT

00:00.000 --> 00:10.680
Hi everyone, I'm Jake Hillian. I was hoping to present this with Johannesburg for a good

00:10.680 --> 00:15.160
but he hasn't quite made it yet due to Deutsche Bahn delays mainly but he should be

00:15.160 --> 00:19.920
here hopefully by the end to say hello. We're here to talk today about a side project

00:19.920 --> 00:25.680
we're working on involving concurrency testing using custom Linux schedulers. I work at

00:25.680 --> 00:31.520
Meta and I work on schedulers mainly custom Linux schedulers so this is pretty related

00:31.520 --> 00:35.480
to what I do. Johannes is an open JDK developer who recently had to spend a lot of

00:35.480 --> 00:40.120
time debugging a race condition so we hope we could put those two things together and we've

00:40.120 --> 00:44.920
got a bit of a proof of concept today that we can show you and explain how it works

00:44.920 --> 00:51.200
to attempt to make these a little bit more likely to occur which should make them easier to

00:51.200 --> 00:59.040
debug. So, Heisenbergs, I imagine lots of us are familiar. You've got the same input, you

00:59.040 --> 01:04.200
hope for the same output but instead you get a crash. This is not great and it's especially

01:04.200 --> 01:10.560
not great when it happens 1 in 10,000 or 1 in 100,000 or 1 in a million invocations. As an

01:10.560 --> 01:15.720
application owner debugging that from reports is very tricky. We'll go for a simple example

01:15.800 --> 01:21.440
now. Very simple because these things get complex in reality but imagine we've got some

01:21.440 --> 01:26.000
data being produced from a producer thread and we're consuming it in a consumer thread.

01:26.000 --> 01:30.640
In our case and in our example later on there's a explicit expiry date on that data which

01:30.640 --> 01:35.640
isn't what really happens in production. More likely you've got some reference to a pointer

01:35.640 --> 01:39.640
that you might clear in some other thread. All of these expiry reasons that that data

01:39.640 --> 01:46.360
might no longer be valid at some point in the future. It doesn't crash, the vast majority

01:46.360 --> 01:54.520
of the time. The reason for this is that schedulers are pretty good but when that interaction

01:54.520 --> 01:57.840
happens, when your machine gets a bit busy, when some processes get in the way that you

01:57.840 --> 02:02.760
weren't expecting, when the network gets slow, all of these things can just add extra delay.

02:02.760 --> 02:07.920
So, a large reason for these conditions is scheduling. We see this, for example, our

02:07.920 --> 02:12.200
rather debugger has a chaos mode that's supposed to make these a lot more likely to, but

02:12.200 --> 02:18.240
that has its own issues. What is scheduling then? What do we actually do? In this case

02:18.240 --> 02:22.800
we're talking about CPU scheduling. It's one of the more common types. The problem we have,

02:22.800 --> 02:27.800
we've got many processes, likely in the order of thousands and some number of CPUs, like

02:27.800 --> 02:33.160
the in the order of tens nowadays, and we need to somehow make sure all those processes

02:33.160 --> 02:38.760
work successfully on that CPU to share the system. The simplest way we might do this is

02:38.760 --> 02:43.640
just a schedule process A, and whenever it stops, we'll schedule process B. Unfortunately,

02:43.640 --> 02:48.360
that really does not work. There are classes of non-preemptive schedulers, sometimes it

02:48.360 --> 02:51.480
makes sense, but the vast majority of the time we're going to need to schedule be a bit

02:51.480 --> 02:55.000
sooner, or the issues we were talking about before, they'll happen more often, network

02:55.000 --> 03:00.040
time outs, all that sort of thing. So instead, we go over time and we'll stop scheduling

03:00.120 --> 03:04.440
A for a bit. B might not be ready the next time. You might schedule A again. We'll schedule

03:04.440 --> 03:08.120
B, we'll schedule A, we'll flip back and forth, and we're doing this on the scale of many

03:08.120 --> 03:13.720
thousands of processes, likely we have several ready at any point in time. On an actual system,

03:13.720 --> 03:17.160
it might look something like this. We're not going to be able to look at any of the B-thirl

03:17.160 --> 03:22.360
on this chart, but on the left, in the y-axis, we have which CPU we're looking at. As we go

03:22.360 --> 03:27.080
across, we're looking at what's happening on that CPU, whether a process is scheduled, the different

03:27.080 --> 03:31.480
colors, the different processes. It all gets quite complex, but these sort of charts super

03:31.480 --> 03:37.400
interesting. This is a 612 Linux system, just running a VDF that default scheduler. We can see

03:37.400 --> 03:40.760
processes are darting around all over the place, they're coming in, they're running for a short

03:40.760 --> 03:45.240
time, some of them are long running, they move about a bit, there's all sorts of complexity,

03:45.240 --> 03:50.760
and this is even on a pretty quiet system. When you start looking at big systems, hundreds of CPUs,

03:50.760 --> 03:57.000
all the interactions just get way more complicated. So when we look at our race condition,

03:57.000 --> 04:01.960
you're replicating this on your lovely, deaf machine, you've got a 32-core processor,

04:01.960 --> 04:05.240
it's nice and quiet, you don't want anything getting in the way of your testing,

04:05.240 --> 04:11.320
the bug never happens, ever. It's a nightmare. You know it's happening, people are reporting it,

04:11.320 --> 04:15.000
you're running the standard scheduler, and the bug never happens. You can try running

04:15.000 --> 04:19.560
stresses in the background, and that might make it a little bit more likely, but the bug still never

04:19.560 --> 04:24.920
happens. Working on custom scheduler is a matter, I got to work with some schedulers that are

04:25.000 --> 04:29.240
not too good, which is great. It turns out when you write a scheduler yourself, and you

04:29.240 --> 04:33.640
had loads of configuration options, there are many ways to configure that scheduler badly.

04:35.000 --> 04:38.520
And I found what I heard a while ago, I was working on a service, I was writing a scheduler,

04:38.520 --> 04:43.320
I got the configuration terribly wrong, and the service failed, it really didn't work.

04:43.880 --> 04:48.120
But there were three parts to the service, two of them hit massive time-out errors,

04:48.120 --> 04:55.400
but they came back to life, one of them crashed, so 250 hosts died, all at once, because of my

04:55.400 --> 04:59.960
scheduler. It turns out this was a race condition, and someone was storing the value of a shared

04:59.960 --> 05:04.680
pointer instead of copying the shared pointer and C++ for efficiency reasons, and they got it wrong.

05:04.680 --> 05:09.000
So if there was a long enough delay in scheduling the service with crash, this does happen

05:09.000 --> 05:13.800
in the real application, but it's so rare that nobody would ever look at it. Or even if you did,

05:14.440 --> 05:20.600
really struggle to replicate it. So what if we wrote a scheduler that was deliberately bad?

05:20.600 --> 05:25.560
What if it was deliberately erratic and got us into these states more often, where these errors

05:25.560 --> 05:31.320
are likely to happen? That's what we got a demo of today, but how would you write an erratic scheduler?

05:31.320 --> 05:35.560
Well, there's some options here. You could write it in the Linux kernel.

05:37.720 --> 05:41.800
You might have a hard time doing that in general. The scheduler is very sensitive,

05:41.800 --> 05:46.120
if you get it slightly wrong, your system will hit the soft lock-up detector and immediately reboot,

05:46.120 --> 05:51.320
which is a bit of a pain. It's hard to do. If you get it wrong in memory on safe ways,

05:51.320 --> 05:56.600
your system will crash even more quickly. And if you get it right, you're still waiting,

05:56.600 --> 06:01.320
well, maybe in the tens of seconds for a K exact every time you want to change your kernel.

06:01.320 --> 06:06.520
This is a bit awkward. Nowadays, we can do it in user space. You're hunting so

06:06.520 --> 06:10.040
to talk about Java. He's got a project that I'm not going to give enough credit in this

06:10.040 --> 06:13.960
presentation, because I don't know enough about it. Where you can write these schedulers in Java.

06:13.960 --> 06:17.880
I'm more familiar with the Rust ones. You can also do it in C. If that's what you like to.

06:18.760 --> 06:25.400
And it's all because of BPS, which is the B, scared-ext, which is the, I think we're supposed to

06:25.400 --> 06:32.760
call it a sex-depress, which is an interesting choice of logo we got there. And then this is the

06:34.120 --> 06:38.760
additional logo on the right. This is a photo of Brendan Gregg,

06:38.760 --> 06:43.560
supposedly shouting at hard drives. But there's a quote about putting JavaScript into the

06:43.560 --> 06:49.720
Linux kernel here. Many similarities between the EBPS, the way it runs in the kernel, and a virtual

06:49.720 --> 06:54.760
machine for JavaScript you might have in your browser. Here's a photo of him looking slightly

06:54.760 --> 07:00.680
more normal. I think he'd prefer that one. EBPS, we're not going to go into it. It's not super

07:00.680 --> 07:04.680
important how it works, but that there are a few details that we need to cover just for the

07:04.680 --> 07:08.520
understanding. When you develop an EBPS program, you're going to write your source code in

07:08.520 --> 07:15.160
some language. There's a few options we have. It sees the standard one. Rust works reasonably

07:15.160 --> 07:20.520
well. There's some sort of academic languages you can choose as well. And then there's the Java

07:20.520 --> 07:25.640
Transfiler that your analysis got, which is quite exciting. You can pull that into BPS bike code.

07:25.640 --> 07:31.080
It's like assembly, but it sits own language that works on all the Linux systems effectively.

07:31.800 --> 07:37.640
We make a CISC call to BPS to ask it to load our program into the kernel. You need a lot of

07:37.640 --> 07:42.280
privilege for this. It's a root only operation. Again, now I think it kind of got user for a while,

07:42.280 --> 07:48.760
but now it's all root. It goes through the verifier. The verifier is a magic black box that's

07:48.760 --> 07:54.200
supposed to make sure your program is safe in certain ways. So you can't remember it in a

07:54.200 --> 07:59.400
bad way that will cause your system to crash. You can't have unbounded loops that don't terminate

07:59.400 --> 08:03.560
because we're running this in the schedule a hop-off. If you have non-terminating code there,

08:04.120 --> 08:11.160
you're in trouble. Stuff like that. It's a bit of a beast to work with. But once you're verified,

08:11.720 --> 08:17.480
you get jet compiled, loaded into the kernel. You can look at sockets, network interfaces scheduling now,

08:18.040 --> 08:24.360
you're loaded as an X86 program. There's no or arm, whichever system you're on. There's no

08:24.360 --> 08:28.760
further run time basically attached. Then you communicate with that. Most of the using

08:28.760 --> 08:32.360
Cisco's at the minute, we're getting some new stuff called arenas, which are more like mat

08:32.360 --> 08:37.800
memory, and you can communicate back to user space. We can write an application across user space

08:37.800 --> 08:42.920
and kernel space, which is pretty cool. The general way we write our production schedule is

08:42.920 --> 08:48.520
we write some rust that talks to the BPS, and then the BPS runs in the kernel and makes quick scheduling

08:48.680 --> 08:55.240
decisions. So, that's BPS, how do we use that for scheduling? Recently,

08:55.240 --> 09:00.200
I mentioned that it's kernel 612, we're now 613, so it's pretty recent. Schedule XT,

09:00.200 --> 09:06.200
Schedule XT, it's the extension framework for jumping in as a scheduler from BPS space.

09:07.560 --> 09:13.160
This is a tasian, the creator, a few key features. I mentioned some of the troubles we've

09:13.160 --> 09:17.560
working in the kernel before, and the idea is that Schedule X makes them better. There's a

09:17.560 --> 09:22.920
magnum perfect, but it certainly makes them better. So, ease of experimentation, we have a repo with

09:23.800 --> 09:29.160
in the order of 10 Schedulers. Now, maybe a few more. The Linux kernel has two-ish

09:29.160 --> 09:33.640
Schedulers. Even the old one has to be ripped out to make way for the new one. So, we don't,

09:33.640 --> 09:39.080
we don't have a lot of optionality in the kernel, but you can run many different SCX Schedulers on

09:39.080 --> 09:43.080
your machine switching between them just by running a program and pressing Ctrl C. It's

09:43.080 --> 09:50.280
super easy. Customization, too. We can talk to user space. You can do basically anything you want

09:50.280 --> 09:54.920
in these Schedulers. Sure, some of it has to avoid the hot path, and you've got to communicate with

09:54.920 --> 09:59.720
user space a little bit. Turns out that's not as bad as we might think, but you can make loads of

09:59.720 --> 10:04.120
choices. You can use information from Nvidia RESTMI that the Linux kernel is never going to do

10:04.120 --> 10:09.080
and stuff like that. And finally, rapid schedule of deployments, deploying a new kernel at scale

10:09.160 --> 10:15.080
is tricky. We have to get it to millions of machines, and it takes in the order of weeks to get

10:15.080 --> 10:20.680
that kernel out. Deploying a new Schedulet can take a day. It's really easy, and running it and

10:20.680 --> 10:25.160
stopping it is also easy. You don't have to reboot. So, if we find out weeks later that our

10:25.160 --> 10:29.320
Schedulers kind of bad, we can just stop it and we go back to the default and everyone's safe.

10:29.320 --> 10:32.760
We don't have to worry about how much we've broken all the systems.

10:33.160 --> 10:41.080
In the SCX Scheduler, maybe no worries too much about the D-cells, but we have a few bits that we

10:41.080 --> 10:46.680
have to worry about. On each CPU, we have a local FIFO queue, it's just first in first out,

10:46.680 --> 10:51.320
and that's effectively read from by the kernel. If you've put stuff in that queue, the kernel

10:51.320 --> 10:58.360
side of SCX will make sure it gets run. On that CPU, in that order, quite convenient. In SCX,

10:58.440 --> 11:03.640
we generally have global queues as well when we write our own Schedulers. In this picture,

11:03.640 --> 11:08.040
we've got one. You can have a dozen, you can have as many as you like, and within those queues,

11:08.040 --> 11:12.120
you can mean different things. On some Schedulers, we might have a different Q per LLC.

11:13.000 --> 11:17.240
Some Schedulers, we have a different Q per how much we want to prioritise the workflow,

11:17.240 --> 11:23.000
and various different things like this. The Schedulers job that we write in SCX is to move things

11:23.000 --> 11:28.520
from global queues into local queues and accept new processes, make decisions based on them,

11:28.520 --> 11:34.760
and let them run in the order we like. To view a super simple Scheduler in the job aside of this

11:34.760 --> 11:40.360
framework, first step, well, first step is to license everything as GPL. That's an absolute

11:40.360 --> 11:45.960
requirement with BPS, which is pretty cool. License it's that we're going to share this this

11:45.960 --> 11:52.600
constant of a shared DSQ ID, nice and easy. We'll create a shared DSQ, which we need to be able

11:52.600 --> 11:58.040
to handle tasks in a more uniform way, handling it per CPU would end up with separate scheduling

11:58.040 --> 12:02.520
issues. So we'll create that DSQ, and now we've got our Q, and that's it. Next one,

12:03.640 --> 12:09.800
NQ. This happens when you receive a task that is now runnable. You've got a task, ideally,

12:09.800 --> 12:14.520
you want to put it on a CPU, but if you can't put it on a CPU, we're going to NQ it.

12:14.520 --> 12:18.760
In this case, we're using another K-Funk SCX BPS dispatch. We're taking our task,

12:19.320 --> 12:23.560
we're putting it in our shared DSQ. We're saying next time it runs, it can have up to five

12:23.560 --> 12:28.200
milliseconds, and then we're just passing through these flights. It's pretty simple too, so far.

12:30.040 --> 12:35.160
And the final one is dispatch. This is what's called on a CPU goes idle. You have your CPU,

12:35.160 --> 12:38.360
it's finished doing whatever it was doing. It doesn't know what to run next, because it's little

12:38.360 --> 12:44.200
Q is empty. So we just run SCX BPS consumed from the shared Q, which takes the task from the shared

12:44.200 --> 12:49.720
Q, and just runs it on that CPU for us. That's it. That's a whole schedule. It's not a very

12:49.720 --> 12:55.560
good schedule. We're using Firefox everywhere. There's no priority for any processes. Everything is

12:55.560 --> 13:00.600
completely equivalent, which it turns out doesn't work very well. There's also only one global

13:00.600 --> 13:05.400
Q, so if you're in any sort of complicated CPU, that will really struggle. If you've got two

13:05.400 --> 13:11.000
sockets on certain Intel machines, this will kill the machine, because they're cross socket communication

13:11.080 --> 13:14.760
is so slow, that if you try and run the scheduler, you hit the soft lock up detector,

13:15.720 --> 13:19.960
before the schedules thing can get kicked out. It's normally very safe. Normally,

13:19.960 --> 13:24.120
schedules, if you don't schedule stuff, it just gets kicked out, and you go back to normal. But the

13:24.120 --> 13:29.000
Intel machines are so slow, you can't actually get kicked out, because that bit of kernel code can't

13:29.000 --> 13:34.760
run in time. So that's quite interesting. But in the general case, you're pretty safe. This will run,

13:35.240 --> 13:41.400
and then you can extend it as you like. Producing erratic scheduling orders. That's what

13:41.400 --> 13:47.320
this was all about. How can we make our race condition fun more likely? We have a, let's see,

13:48.360 --> 13:53.400
let's go first of all, this is the example. We have an example here written, I believe in Java

13:53.400 --> 14:00.520
again. It's a super simple thing to crash. We just consume things from a Q that are only valid for

14:00.520 --> 14:04.680
a certain amount of time. It's missing a little bit of the code here, and I won't find it,

14:04.680 --> 14:09.320
because it gets, you always need a bit of plumbing to make these things work. But effectively, we get

14:09.320 --> 14:13.640
a task come from this producer thread. We've set a time on it that's just a limit, and if we

14:13.640 --> 14:18.920
try and read it beyond that, we're going to crash, and then we just keep reading it. On a quiet system,

14:18.920 --> 14:24.200
this is fine. This will run for days at a time, and it will never crash. Even on a busy system,

14:24.200 --> 14:29.560
we haven't yet seen a crash, but it can theoretically happen. We had to get quite simple with these

14:29.560 --> 14:36.600
examples to make them fit. Basically, I've got a video to show you from your harness that I'm

14:36.600 --> 14:46.200
going to have to talk over, I believe. Thank you. Okay, so we started our schedule. We've got schedule

14:46.200 --> 14:51.800
over the SH, which just launches our schedule, with the correct arguments. Some pools run Q.SH.

14:51.800 --> 14:56.040
Here's our sample script, the Java Registro G, and we're also getting some extra

14:56.280 --> 14:59.960
of the bossity out of it. Every time we make a scheduling decision, we're printing it here.

15:00.680 --> 15:05.880
And the way we've set this up is it's going to sleep things for just way longer than it needs to.

15:06.520 --> 15:12.200
We'll take run-able tasks that would get a CPU immediately on a normal scheduler and not schedule them

15:12.200 --> 15:16.200
for whatever amount of time this is saying. So this is going between kind of half a second and a second

15:16.200 --> 15:20.680
and a half of not scheduling our tasks that could be scheduled. Then when it runs it, we run it

15:20.680 --> 15:29.160
for 80 milliseconds, something like that, and it normally finishes. And we saw it crashed, which is

15:29.160 --> 15:34.280
pretty good. This program, again, hours of the time we've left it running on these machines,

15:34.280 --> 15:38.840
it doesn't crash. It just never hits these edge cases, but those edge cases are there, and they

15:38.840 --> 15:45.000
can be hit, and if you scaled this sufficiently, if that Fred with the random delay was actually

15:45.000 --> 15:50.040
a network request, then your system's got slow, that was curing on those. This would crash.

15:50.760 --> 15:55.400
So we've got a lot of times where this could happen. If someone reported it, you wouldn't be

15:55.400 --> 16:03.000
able to debug it comfortably on your machine. And that's it, we've got a crash. I think we've got a

16:03.000 --> 16:09.640
few minutes where I can briefly scroll through the code. There isn't too much of it, surprisingly,

16:09.640 --> 16:28.600
again. So we go back to, before I do that, have we got any questions?

16:29.400 --> 16:42.680
So the question was, are we running that in the CI or just locally at the minute? At this point,

16:42.680 --> 16:48.120
this is very local. It's quite constrained, this scheduler, it only works on small machines,

16:48.120 --> 16:53.960
at the minute, we use a lot of those Python cues. It's very new, it's very early. What we were

16:54.040 --> 16:59.240
excited about seeing was whether we could make it crash, and we can. So the next steps with this

16:59.240 --> 17:04.280
would be to production, I said a bit more, get it able to run on a big service. That example,

17:04.280 --> 17:07.960
I mentioned earlier, if I tried to, that machine, it's one of the new AMD chips, so it's got loads

17:07.960 --> 17:12.120
of LLCs, it's all a bit complicated. If we try and run this schedule on there, it just doesn't work.

17:12.920 --> 17:17.640
It gets kicked out. The machine survives, but it doesn't work. So we need a more complex hierarchy in

17:17.640 --> 17:22.440
the scheduler, and then to interject the randomness, and we also need some seeding and bits like

17:22.440 --> 17:27.080
that to try and get it more consistent, and probably a bit of searching to find the right

17:27.080 --> 17:31.560
conditions to make it crash. So this is still very early. We'll be happy about contributions too.

17:35.560 --> 17:40.600
Have we tried it on an arm machine? No, but it does work. So schedule, it's in general, we have

17:40.600 --> 17:46.200
tried on arm machines. It works fine, there's nothing super worrying about it, which is great,

17:46.200 --> 17:49.960
so yeah, we're a bit concerned, we've got a lot of arm to cover at the minute.

17:52.440 --> 17:57.160
Can you do from the schedule? Can you, for example, go to the process memory to see if I

17:57.160 --> 18:02.440
hope it has, what it has to improve, it has these kind of bits, now I know that I have to

18:04.040 --> 18:12.360
schedule it to get a CPU slash another process. So the question was about looking at the

18:12.360 --> 18:16.760
memory of the processes and what more information we can use from them to make a scheduling decision.

18:16.760 --> 18:21.720
That's super interesting. We haven't looked at that yet. We can do the filtering at the minute,

18:21.720 --> 18:27.480
it's based on parent pids effectively. So we, we were a little bit, the way we're running the

18:27.480 --> 18:32.520
schedule is we schedule the whole machine, but we only care about messing up this specific process,

18:32.520 --> 18:37.480
because otherwise we'll start finding race conditions in the shell and we'll be in trouble.

18:38.520 --> 18:44.200
So the reason we do it like that is just easier. We use p-pid at the minute, you can filter on

18:44.200 --> 18:49.800
in other schedules, we filter on things like calm, process name, thread groups, all these things.

18:49.800 --> 18:54.040
So that, but I've never seen the option of actually looking at process memory, that's super

18:54.040 --> 18:59.880
interesting. We are doing some stuff where the application can tell the scheduler what it wants

18:59.880 --> 19:06.040
in a more fine-grained way. Currently we just use niceness, which is a bit weak, it's not very

19:06.040 --> 19:11.320
rich. So we're doing more communication from the process to the scheduler in our production

19:11.320 --> 19:16.680
schedule. And I think that would be possible here too. We're running a BFF, it's four root privileges,

19:16.680 --> 19:18.680
you can kind of do what you want, which is cool.

19:26.040 --> 19:31.960
It's an excellent question. Their question was about reproducing the crashes and how we can

19:31.960 --> 19:37.000
make that happen. The answer at the minute is no, we don't have that. It's a lot of the goal of this

19:37.000 --> 19:41.720
project, but when we were running through it, we started looking at how we could build this scheduler

19:41.720 --> 19:50.120
to get rid of a huge amount of the non-determinance in the process. The way we saw we're scheduling

19:50.120 --> 19:54.360
it, we were going to slow it down too much if we tried to get rid of too much of the non-determinance,

19:54.360 --> 19:58.200
because we want to be in the position long term where we can run this on production applications

19:58.200 --> 20:04.040
without slowing them down to the point where they stop serving traffic. And that meant we

20:04.040 --> 20:10.600
made some compromises. The main one is we schedule things on course pretty quickly now when the

20:10.600 --> 20:15.000
original plan was to go one thread per process, but it just isn't scalable at that point.

20:15.000 --> 20:19.240
So there's definitely some work to be done to getting the seeded and making it more reproducible.

20:23.240 --> 20:27.240
I was wondering about this scheduling point where you can decide to make a schedule.

20:27.240 --> 20:32.360
Sounds like the only one that process was the kernel, is the point that you can do the schedule,

20:32.360 --> 20:36.760
but if you have something bad, let's say you recommend the variable, you should be on the

20:37.080 --> 20:42.680
criminally with a fashion head, so how do you explain the meaning of anything?

20:42.680 --> 20:49.160
Yes. So the question was about effectively summarized when we can preempt things. Any time,

20:49.160 --> 20:55.240
it's good. We have full control over it basically. So we can cut the slices down which helps,

20:55.240 --> 21:00.680
but we also have a K-fun called SCXBFF Kick CPU that kicks the CPU quickly,

21:01.560 --> 21:05.080
which is pretty cool. How we'd integrate that as a different question. We haven't done it yet.

21:05.080 --> 21:09.080
We're purely working with slices in a minute, so that does, sorry.

21:10.440 --> 21:14.760
That does get the preemptive scheduler to kick in and stop the process, and we will get these

21:14.760 --> 21:19.000
into leaving eventually, but you're having a more, if you could look at the memory and see a bit

21:19.000 --> 21:24.440
that's flipped, and then kick it, we have that option, which would be very exciting in the future.

21:24.440 --> 21:29.080
I'm hoping that we've opened SCXBFF's for the world of testing, and that everyone will have

21:29.080 --> 21:34.040
great ideas now, because I'm a scheduler developer. Your harnesses of open JDK developer that

21:34.120 --> 21:39.720
likes schedulers, so it would be really cool for other people to see that schedulers are

21:39.720 --> 21:45.560
available to them, and not completely impossible to write now, and use that in testing more widely.

21:45.640 --> 21:56.760
Can you prevent this soft lock from killing Linux, if let's say you want to explore all

21:56.760 --> 22:02.680
possible schedules which can occur in systems than it will crash?

22:02.680 --> 22:07.640
So there's two parts to that. We have two of these lock-up detectors, I got stover it earlier,

22:07.640 --> 22:12.680
but Skedox itself, if you have, we'll give it a task that's runable, and you wait more than 30 seconds

22:12.680 --> 22:17.000
and don't run it, the SCX scheduler will get kicked, and all those tasks will move back to the

22:17.000 --> 22:21.560
fair scheduler in the kernel. There's also the soft lock-up detector, which happens a bit later,

22:21.560 --> 22:25.000
that if the machine isn't, I'm not too, super sure on the details, I just know we hit it.

22:25.000 --> 22:29.960
If the machine isn't making reasonable progress, and it's not much later, I think it's maybe 40 seconds,

22:29.960 --> 22:36.920
45, and then it just repeats the machine. And that one, I would say turning that off probably isn't

22:36.920 --> 22:42.040
super productive, because if you were to hit that with any normal schedule without your scheduler,

22:42.040 --> 22:48.360
the machine would do the same thing. The SCX one we haven't needed to turn it off, because

22:48.360 --> 22:53.560
30 seconds is such a long time. Technically, if we were making kind of network requests,

22:53.560 --> 22:57.240
they could take longer to come back than that, but for the vast majority of bugs,

22:57.240 --> 23:01.400
30 seconds should be plenty. If you do want to change it, there's a number in the kernel,

23:01.400 --> 23:05.960
and you can always recompile it, and that will get longer, but you've got to be, there's many

23:05.960 --> 23:07.960
systems that come in to make that stop.

23:24.920 --> 23:28.200
Yeah, it's a good question. The question was about more erratic behavior, instead of just

23:29.080 --> 23:35.080
scheduling timings. The show answer is no, basically. There's stuff where we're interested in

23:35.080 --> 23:40.600
a memory latency change on systems as they get more loaded. We haven't done any work to

23:40.600 --> 23:45.480
train calls like that, and it's not easy to do that with the SCX scheduler. There are ways to do it.

23:45.480 --> 23:49.880
You can kind of force things to mess up their caches more often with scheduling decisions,

23:50.600 --> 23:55.320
and introduce extra processes that do that too. But I think those races are a lot finer-grained,

23:56.040 --> 24:01.080
and we haven't started looking about, yeah. That's great. Thank you very much.