WEBVTT

00:00.000 --> 00:09.000
All right, if I may have your attention, please.

00:09.000 --> 00:11.000
So this is a lightning talk.

00:11.000 --> 00:13.000
It is 10 minutes.

00:13.000 --> 00:14.000
It's going to go fast.

00:14.000 --> 00:15.000
Questions?

00:15.000 --> 00:17.000
You can seek me after work.

00:17.000 --> 00:19.000
My name is Machidenway.

00:19.000 --> 00:20.000
I work at Aficious.

00:20.000 --> 00:23.000
I'm here to discuss the problem of figuring out

00:23.000 --> 00:25.000
from a container perspective.

00:25.000 --> 00:28.000
How many CPU can I use within that container?

00:28.000 --> 00:32.000
If I'm on a vast machine, where I'm actually constrained

00:32.000 --> 00:34.000
to a small subset of that machine.

00:34.000 --> 00:38.000
So there are various uses to use the notion of knowing

00:38.000 --> 00:41.000
the current CPU number, the maximum number of CPUs

00:41.000 --> 00:43.000
in user space.

00:43.000 --> 00:45.000
Some of those use cases, tracing ring buffers.

00:45.000 --> 00:47.000
That's a personal interest of mine.

00:47.000 --> 00:50.000
Memory allocators, decimala gmailac.

00:50.000 --> 00:52.000
The Gnuci library mallac.

00:52.000 --> 00:54.000
Caches, such as the NPTEL,

00:54.000 --> 00:57.000
Treads.cache in Gnuci.

00:57.000 --> 01:00.000
Schedulers and user space statistics counters.

01:00.000 --> 01:05.000
And user space generally using the observable number of CPUs

01:05.000 --> 01:08.000
to automatically scale the number of threads.

01:08.000 --> 01:10.000
So the context of this problem.

01:10.000 --> 01:15.000
It becomes much clear when you look at modern machines

01:15.000 --> 01:19.000
with 512 or more hardware threads.

01:19.000 --> 01:22.000
So the memory footprint that it brings,

01:22.000 --> 01:26.000
when you use CPU data structures in user space within

01:26.000 --> 01:28.000
CPU set constraint container.

01:28.000 --> 01:29.000
That's an issue.

01:29.000 --> 01:32.000
And also scaling the number of worker threads.

01:32.000 --> 01:35.000
So I'm a maintainer of the rest-artable sequence

01:35.000 --> 01:37.000
system called in Linux.

01:37.000 --> 01:43.000
And in Linux 6.3, I've contributed an additional field

01:43.000 --> 01:48.000
that allows figuring out an index that is indexed by the

01:48.000 --> 01:52.000
thread number within an index bound by the maximum number,

01:52.000 --> 01:58.000
the maximum number of country running threads within the process.

01:58.000 --> 02:04.000
So the current approaches to bound that maximum number of running threads.

02:04.000 --> 02:08.000
So one approach is to use CPU set within your container.

02:08.000 --> 02:12.000
So this provides a memory user per bound when you limit the

02:12.000 --> 02:14.000
containers to CPU set.

02:14.000 --> 02:18.000
But it's not ideal to describe constraints in a cloud native way.

02:18.000 --> 02:20.000
It's bound to the machine topology.

02:20.000 --> 02:22.000
It's hard to compose containers.

02:22.000 --> 02:26.000
That are expressed with CPU set constraint.

02:26.000 --> 02:30.000
And it's tricky with big little key core e-core CPU.

02:30.000 --> 02:33.000
Because then when you wire yourself on a specific CPU,

02:33.000 --> 02:35.000
you're actually telling, well,

02:35.000 --> 02:39.000
I want to use a performance core or an energy efficient core.

02:39.000 --> 02:40.000
So this is semantic.

02:40.000 --> 02:42.000
You may not want to convey in terms of

02:42.000 --> 02:46.000
restricting your application.

02:46.000 --> 02:50.000
Another approach you have currently when you want to limit the bandwidth

02:50.000 --> 02:54.000
or the quantity of CPU time you use within a container is to use the CPU

02:54.000 --> 02:55.000
C groups.

02:55.000 --> 02:59.000
This allows limiting containers to specify portion of time slice.

02:59.000 --> 03:03.000
For instance, the CPU max interface file allow you to say,

03:03.000 --> 03:08.000
I want to use 2,000 microseconds per 1,000 microseconds slice.

03:09.000 --> 03:15.000
That means 200% of CPU time, but you can be migrated anywhere on that system.

03:15.000 --> 03:20.000
So it can be either 2 CPUs running the workload 100% of the time,

03:20.000 --> 03:23.000
or 200 CPUs running 1% of the time.

03:23.000 --> 03:28.000
So that's not a good way to let your system know that you want to be

03:28.000 --> 03:32.000
run on a subset of the machine's cores.

03:32.000 --> 03:39.000
So the proposal I have, and I discussed that at LPC 2024,

03:39.000 --> 03:43.000
I'm bringing this again to gather feedback.

03:43.000 --> 03:47.000
It's to introduce a new CPU max concurrency interface file.

03:47.000 --> 03:52.000
That would define the maximum number of concurrently running threads for the C group.

03:52.000 --> 03:58.000
So in terms of what needs to be done within Linux is to extend the Linux scheduler,

03:58.000 --> 04:04.000
migration and load balancer to track the number of CPUs that are concurrently used by the C group,

04:04.000 --> 04:09.000
and to constrain migration to the currently used set when the number of concurrently

04:09.000 --> 04:12.000
CPU reaches maximum threshold.

04:12.000 --> 04:17.000
There are some additions that would need to be done in the C groups that are structures,

04:17.000 --> 04:22.000
such as counting the number of threads in each run cube belonging to the C group with

04:22.000 --> 04:26.000
CPU counters within each of the CPUs.

04:26.000 --> 04:30.000
Count the number of CPUs in a global counters within the C group,

04:30.000 --> 04:35.000
and keep track of the set of CPUs in the CPU max within the C group.

04:35.000 --> 04:44.000
So the tricky part here is to handle errors and combining users that want to set their scheduling affinity

04:44.000 --> 04:50.000
and use CPU sets in combination with the max concurrency limits.

04:50.000 --> 04:56.000
And so this is where I have some ideas, but I also have some things that would need further

04:56.000 --> 04:57.000
class.

04:57.000 --> 05:02.000
So some of the ideas are to make scheduled affinity and CPU set fail.

05:02.000 --> 05:10.000
If the request for a given set is going beyond the available concurrency limit,

05:10.000 --> 05:14.000
decreasing the max concurrency could fail also.

05:14.000 --> 05:20.000
If decreasing the max concurrency under a certain threshold cannot be fulfilled,

05:20.000 --> 05:23.000
given the current set of constraint.

05:23.000 --> 05:26.000
Increasing the max concurrency should never fail.

05:26.000 --> 05:31.000
It's just giving more room for scheduling opportunities and migration opportunities.

05:31.000 --> 05:33.000
So that's not such an issue.

05:33.000 --> 05:37.000
What I expect though is that we may have some worries about,

05:37.000 --> 05:41.000
basically, what I suspect is an NP complete problem,

05:41.000 --> 05:46.000
which is basically if you have partially this drawing sets of affinity.

05:46.000 --> 05:51.000
So that's where I'm not entirely sure how we want to tackle that problem.

05:51.000 --> 05:57.000
Because I want to handle this at the point where the user are modifying the affinity

05:57.000 --> 06:04.000
or setting the max concurrency, so that it's basically just an impact on the set,

06:04.000 --> 06:07.000
then for the allowed CPU mask.

06:07.000 --> 06:10.000
So we don't have any impact on the fast path of the schedule.

06:10.000 --> 06:15.000
It's only load balancing and migration that get affected with additional constraints.

06:15.000 --> 06:18.000
But in order to do that, so if we have this joint set,

06:18.000 --> 06:22.000
that's a part where we may have to rely on some statistics.

06:22.000 --> 06:25.000
So feedback would be good on this point.

06:25.000 --> 06:30.000
So someone saying, well, I want to find my first thread on CPU zero and one,

06:30.000 --> 06:34.000
second thread on one and two, third thread on two and three,

06:34.000 --> 06:40.000
then what's the set of one or two CPUs that would allow doing your scheduling of all those threads.

06:40.000 --> 06:44.000
So at that point it becomes a bit trickier.

06:44.000 --> 06:51.000
I suspect that typical case would be having some kind of a general super set.

06:51.000 --> 06:58.000
So of all the general mask that allow free movement within a given set,

06:58.000 --> 07:04.000
and then subsets, so pinning specific task to specific CPUs, that's one easy case.

07:04.000 --> 07:07.000
So those might be kind of the two main use cases,

07:07.000 --> 07:14.000
but I'm interested to hear about how there are many uses of those kind of these joint sets,

07:14.000 --> 07:16.000
and how should that be handled.

07:16.000 --> 07:19.000
So that's what I have.

07:19.000 --> 07:23.000
We even have three minutes for question and I guess setting up the next speaker.

07:23.000 --> 07:32.000
Okay, are there any questions?

07:48.000 --> 07:52.000
Is there any overhead of exposing this new field in the,

07:52.000 --> 07:55.000
because you have to track this concurrence field all the time?

07:55.000 --> 07:58.000
Have you calculated the overhead of this?

07:58.000 --> 08:05.000
I don't expect that to have much overhead because that would be limited to migration and load balancing.

08:05.000 --> 08:07.000
So there would be some overhead on that,

08:07.000 --> 08:12.000
but I would be very careful about not impacting the scheduler fast path,

08:12.000 --> 08:19.000
because as soon as we start adding constraints that need to be taken into account in the fast path of the scheduler,

08:19.000 --> 08:22.000
that would be maxed by the scheduler of maintainers.

08:22.000 --> 08:24.000
So that's the target.

08:28.000 --> 08:36.000
Do you have time for one or two questions?

08:36.000 --> 08:38.000
Do you already have patches?

08:38.000 --> 08:39.000
Not yet.

08:39.000 --> 08:44.000
I'm actually, so I've started discussions with some,

08:44.000 --> 08:48.000
well, it's not being at the top of my priority stack at the moment.

08:49.000 --> 08:53.000
All right.

08:53.000 --> 08:54.000
All right, thank you.

08:54.000 --> 08:55.000
Thank you so much.

