WEBVTT

00:00.000 --> 00:16.000
I will talk about what I learned by writing my own container manager from scratch, going

00:16.000 --> 00:17.000
ruthless.

00:17.000 --> 00:22.000
Some lessons that I learned I want to share with you.

00:22.000 --> 00:27.000
I'm looking at my own 18, I look at it on GitHub and I'm a software engineer in

00:27.000 --> 00:31.000
changing art, lots of open source I like it.

00:31.000 --> 00:35.000
These are my contacts of various things.

00:35.000 --> 00:40.000
So what's little about little is the container manager that I wrote,

00:40.000 --> 00:45.000
because I needed something for my other open source project that I'm on

00:45.000 --> 00:51.000
the time on thing, which is called Disrebox, where I needed a full-back solution

00:51.000 --> 00:55.000
from when our host doesn't have podman or Docker for some reasons.

00:55.000 --> 00:59.000
And I wanted something that was self-contained, so easy to install.

00:59.000 --> 01:04.000
Ideally, just a single binary without any other external dependencies,

01:04.000 --> 01:08.000
being like wait in the sites.

01:08.000 --> 01:14.000
And as fast as possible, I didn't need everything that podman and Docker does.

01:14.000 --> 01:22.000
So just the bare minimum that will let me boot up simple container for my

01:22.000 --> 01:24.000
other projects.

01:24.000 --> 01:28.000
And also I wanted to learn about containers, improve my go.

01:28.000 --> 01:30.000
So that's what I wrote.

01:30.000 --> 01:33.000
So what are containers?

01:33.000 --> 01:37.000
Containers are a way to visualize a system without

01:37.000 --> 01:43.000
recurring to emulation or ritualization of unhaul OS.

01:43.000 --> 01:48.000
So in a VM, you have your host operating system running an

01:48.000 --> 01:54.000
hypervisor, where you run other operating system from bootloader up,

01:54.000 --> 01:59.000
and then in that operating system, you run whatever service application that you need.

01:59.000 --> 02:03.000
With a container, you have your host operating system,

02:03.000 --> 02:11.000
and then you have, you use your operating system main kernel to isolate other root

02:11.000 --> 02:12.000
offenses.

02:12.000 --> 02:18.000
We will see what they are, but basically other file systems where you run your apps or services.

02:18.000 --> 02:24.000
As you see, it basically removes a whole layer of,

02:24.000 --> 02:27.000
between the host and the guest.

02:27.000 --> 02:33.000
That's a pro because then it's very fast, very light, and it's easy to

02:33.000 --> 02:36.000
dispose of containers, just scrap them, create them.

02:36.000 --> 02:39.000
Very easy on the security side.

02:39.000 --> 02:44.000
You're sharing your kernel between your host and workloads.

02:44.000 --> 02:49.000
It hasn't implications, but we don't care about that now.

02:49.000 --> 02:52.000
There are some building blocks of containers.

02:52.000 --> 02:55.000
So you have our root of us or base file system for the container.

02:55.000 --> 03:01.000
You have namespaces, which is how we separate it from the main operating system.

03:01.000 --> 03:07.000
Capabilities, which are what stuff inside the container can and cannot do.

03:07.000 --> 03:11.000
Seagroops, it's a way to conceal and separate resources.

03:11.000 --> 03:17.000
So I can say this container cannot take more rammed out a set value.

03:17.000 --> 03:20.000
Second filters, even more sandboxing.

03:20.000 --> 03:26.000
We can filter out a set of c-scoles there.

03:26.000 --> 03:29.000
We can deny to a certain workloads.

03:29.000 --> 03:34.000
And then integration with BIOS system modules,

03:34.000 --> 03:36.000
the Linux app, or whatever.

03:36.000 --> 03:39.000
The first building block we will see is the root of us.

03:39.000 --> 03:44.000
It's the BFI system that is used by a Linux user land.

03:44.000 --> 03:49.000
In the case of Lilipod, I wanted to tap in the OCI registers,

03:49.000 --> 03:54.000
because this is the most diffused and biggest one where I can find everything.

03:54.000 --> 03:59.000
There is Docker Hub, Clio, GHCR, blah blah blah blah.

03:59.000 --> 04:01.000
Many of them.

04:01.000 --> 04:06.000
When we interrogate OCI registry, what we get is a manifest.

04:06.000 --> 04:12.000
We can see with a little Docker manifest inspect of Ubuntu and Genx image.

04:12.000 --> 04:18.000
We have a set of layers as objects in this JSON.

04:18.000 --> 04:24.000
These set of layers will give you a way to download this layer.

04:24.000 --> 04:27.000
And layers are shipped as tarbles.

04:27.000 --> 04:32.000
And you will have the checks on the layer.

04:32.000 --> 04:37.000
So you can always verify that don't have downloaded a corrupted layer.

04:37.000 --> 04:40.000
This makes it also easy to do that.

04:40.000 --> 04:45.000
So what I did with Lilipod is to use Crane as a library.

04:45.000 --> 04:50.000
It's very handy to interface with OCI Container Registries.

04:50.000 --> 04:58.000
You can pull down the manifest and open it and basically go and download all layers.

04:58.000 --> 05:07.000
What I do is use the checksum as a way, obviously, to know that I downloaded something not corrupt or whatever it was.

05:07.000 --> 05:12.000
But also rename each layer as the checksum itself.

05:12.000 --> 05:20.000
So for example, if I have an Ubuntu image and Ubuntu and Genx image, the base Ubuntu layer will be basically named the same.

05:20.000 --> 05:27.000
I will know it and I can use something I like hard links to deduplicate between various same layers.

05:27.000 --> 05:31.000
So you have a storage advantage.

05:32.000 --> 05:35.000
So how we can use the root of fast.

05:35.000 --> 05:37.000
You can see it's root in that.

05:37.000 --> 05:41.000
See it's root is very old.

05:41.000 --> 05:48.000
Unix is called the lets you change the root file system for a set process.

05:48.000 --> 05:54.000
You can basically enter it and basically it's the new root of that process.

05:54.000 --> 05:57.000
It's good for recycling file system access.

05:57.000 --> 06:02.000
I don't want, I don't know, this process to access my whole host or file system.

06:02.000 --> 06:09.000
And it's useful to bring your own dependencies libraries and stuff like that for asset process.

06:09.000 --> 06:14.000
So we go and see a root, but that doesn't work without root.

06:14.000 --> 06:17.000
Permission the 9 in operation or permitted.

06:17.000 --> 06:21.000
Because see a root needs us to be root users.

06:21.000 --> 06:26.000
And for convenience, we also want to mount additional file system.

06:26.000 --> 06:32.000
For example, a C-Sopherex FFS, a ProcFS, various tempFS or stuff like that.

06:32.000 --> 06:35.000
Well, I want to mount something inside.

06:35.000 --> 06:43.000
What we can use is to root less is the other building block of the containers,

06:43.000 --> 06:45.000
which are namespaces.

06:45.000 --> 06:55.000
So namespaces are set of is a technology provided by the kernel itself,

06:55.000 --> 07:04.000
to have some sort of isolated views of resources only for us a set process.

07:04.000 --> 07:09.000
There are various ways, various types of namespaces.

07:09.000 --> 07:12.000
And it's basically how containers contain, basically.

07:12.000 --> 07:22.000
So we have the mount namespace, which basically gives the process a local copy of the mount tree of the file system.

07:22.000 --> 07:29.000
So it can be manipulated by the process without affecting the mount tree of the whole system,

07:29.000 --> 07:31.000
just for the process.

07:31.000 --> 07:33.000
Same is for the users.

07:33.000 --> 07:39.000
And UTS is for host names, PID, IPC network and timeless basis.

07:39.000 --> 07:41.000
I think the newest one.

07:41.000 --> 07:49.000
So what we can do is call the unshare C-School to fuck the process in a new namespace.

07:49.000 --> 07:56.000
In case of C-School, we need the username space and the mount namespace.

07:56.000 --> 08:06.000
And in that name space, the process that you are launching is able to change the mount tree and the user tree,

08:06.000 --> 08:11.000
because it's just a local modification that doesn't affect the rest of the system.

08:11.000 --> 08:13.000
This is a little example.

08:13.000 --> 08:20.000
When we unshare and where we clone, we can, in the new namespace, map the user to something else.

08:20.000 --> 08:31.000
Like, Alice can become UID 96, WLB data, or Bob can become FOO, number 1000 and stuff like that.

08:31.000 --> 08:38.000
For C-Truth, we just need, for us, Alice to become ROOT.

08:39.000 --> 08:45.000
We unshare the mount namespace, the username space, and map ourself to the ROOT user.

08:45.000 --> 08:49.000
We see it through to our thumb, sorry to our ROOTFS.

08:49.000 --> 08:58.000
And success, we are in our new, very, very rudimentary and simple container.

08:58.000 --> 09:08.000
In early part, we are using the C-Sprocassibutes to unshare the various namespaces and because it's configurable,

09:08.000 --> 09:11.000
we can share something unshare others.

09:11.000 --> 09:19.000
And you will see here that I'm not using C-Truth, but pivot route, which is another C-School,

09:19.000 --> 09:26.000
for similar userfulness, because C-Truth can be escaped easy.

09:27.000 --> 09:33.000
With that simple line of code, you can basically escape any C-Truth.

09:33.000 --> 09:39.000
Because the mount tree inside the C-Truth is not changing.

09:39.000 --> 09:44.000
With pivot route, it's different, because it can leverage the fact that we are in a mountain in this space,

09:44.000 --> 09:52.000
that we can manipulate as we want, to really switch the route instead of just changing it,

09:52.000 --> 09:56.000
and then remove the original route of FES, so it's not accessible anymore.

09:56.000 --> 10:02.000
So you can just escape by C-H-T up up, right?

10:02.000 --> 10:07.000
How it works, so we stop with our route of FES on slash.

10:07.000 --> 10:12.000
And we have a new route, which is what we want to pivot to.

10:12.000 --> 10:21.000
With the C-School, we switch the new route with the old route at the same time,

10:21.000 --> 10:28.000
and you're left with only the old route, and the new route FES is the new route that it was before.

10:28.000 --> 10:36.000
But we can leverage the fact that we are in a mountain in space, and we can unmount the old route,

10:36.000 --> 10:43.000
and it disappears, so it's not accessible anymore from the process inside these namespace.

10:43.000 --> 10:48.000
We go for it, and for example, in Ubuntu, we have problems.

10:48.000 --> 10:51.000
So what's happening here?

10:51.000 --> 10:56.000
It cannot set groups, it cannot set GID and set UID,

10:56.000 --> 11:01.000
and there is these underscore APT user inside here,

11:01.000 --> 11:08.000
because APT uses a sandbox of its own to download stuff for security purposes.

11:08.000 --> 11:19.000
And here we have set groups to deny it, and we have only one user in this UID map.

11:19.000 --> 11:25.000
We only mapped ourself to route, and we didn't do anything else.

11:25.000 --> 11:34.000
So what we need here, we need to be able to map multiple users and to have the set groups primitive enables.

11:34.000 --> 11:36.000
We cannot do that.

11:36.000 --> 11:39.000
The only way to do that is to be route.

11:39.000 --> 11:45.000
So we have to use a little trick, which is using these new GID map,

11:45.000 --> 11:53.000
a new UID map tools that are in the shadow packages.

11:53.000 --> 11:58.000
Those are being that only route can do this stuff.

11:58.000 --> 12:05.000
Those are all set UID binaries, so they actually run as route.

12:05.000 --> 12:13.000
But it's just for a brief moment, only for one thing, which is only map something for the child process.

12:13.000 --> 12:15.000
This is the unshared process.

12:15.000 --> 12:22.000
It's a very small tool with very little codebase, so it's easily auditable.

12:22.000 --> 12:34.000
And it has security checks, where for example, new UID map and GID map can be called only from a father

12:34.000 --> 12:37.000
to its own child and knocking else.

12:37.000 --> 12:43.000
So I cannot just change the map single of whatever PID that I want,

12:43.000 --> 12:47.000
just the father from the father to the child.

12:47.000 --> 12:52.000
So what we will do in the host name space, we have our main.

12:52.000 --> 12:58.000
We clone and share to a new name space, and we have the child process.

12:58.000 --> 13:04.000
Immediately we use our helper that launches new UID map to it.

13:04.000 --> 13:10.000
So it has the right maps and mapings for users, set groups and whatever.

13:10.000 --> 13:15.000
Then we do the pivot route and run our entry point.

13:15.000 --> 13:22.000
It's very basic, so we are literally calling a shell command to it.

13:22.000 --> 13:25.000
That's the command that is run.

13:25.000 --> 13:31.000
But when we do that, the mapings here are changed.

13:31.000 --> 13:34.000
What are these numbers? Let's take a look.

13:34.000 --> 13:44.000
So the first number is the start of the range of ID inside the namespace.

13:44.000 --> 13:51.000
This other number is the start of the ID outside of namespace, so in our host.

13:51.000 --> 13:55.000
And this is the range, so how many of them?

13:55.000 --> 14:07.000
The first line is whatever we also had before, so it's mapping from user zero inside the ID, so route.

14:07.000 --> 14:16.000
To the user 1000 outside of the namespace, so ourselves most likely, for our range of one.

14:16.000 --> 14:19.000
So it's just one of one mapping.

14:19.000 --> 14:27.000
User zero to 1000 outside, so we are actually route inside the namespace, even if we are not route outside.

14:27.000 --> 14:34.000
And then it says from user 1 to 65,000, whatever.

14:34.000 --> 14:39.000
It maps them to the user 100,000 and more.

14:39.000 --> 14:46.000
User 1 inside the container will be 101 outside of the container and so on.

14:46.000 --> 14:53.000
So it's very far apart, so it doesn't interfere with real users on the host.

14:53.000 --> 15:00.000
But we have actually a mapping of all the users to real users on the host.

15:00.000 --> 15:03.000
And now APT works, we have set groups.

15:03.000 --> 15:12.000
We, the underscore APT user can run, so you can download packages and it's happy.

15:12.000 --> 15:28.000
The same thing is done for, can be done for the PID namespace, so I want to make sure that the process inside our unshared namespace cannot look to other PIDs outside.

15:28.000 --> 15:39.000
So it can interfere, I don't know, with my own processes, or with the real PID1 or P22 or P23 or whatever.

15:39.000 --> 15:45.000
So it clones the process tree and maps it back to 1.

15:45.000 --> 15:55.000
So then all the trials of 1 will be the trials of 65 here and our map to 2, 3 and 4 in the PID namespace.

15:56.000 --> 15:59.000
And the same can be done for the network namespace.

15:59.000 --> 16:07.000
It's a little bit tricky here because when we unshared the network namespace, we are just left with localhost and nothing else.

16:07.000 --> 16:24.000
So for now, for now, LilyPod doesn't do this type of networking, but you can create then bridges and interfaces here and here and then control the network access of the namespace.

16:24.000 --> 16:37.000
But that's not enough. Then we have capabilities. So what our capabilities is, there were introducing kernel 2.2, so it's quite old.

16:37.000 --> 16:48.000
But basically, before kernel 2.2, Ripsuser can do everything and non-Rootfuser can do just a subset of things.

16:48.000 --> 17:01.000
After that, they basically split all the privileges that the root user can do in little capabilities that you can also have.

17:01.000 --> 17:12.000
And right now, what's happening is I have this set of capabilities and when the container starts, it's interacting all of them.

17:12.000 --> 17:20.000
And that's dangerous because capabilities is a way to access a set of C-Schools.

17:20.000 --> 17:30.000
For example, with the C-School module, you can mod probe stuff. With C-School, you can see it through it.

17:30.000 --> 17:46.000
In P-Trace, you can P-Trace, or you can see it through own and dangerous stuff. And we don't want that because this can lead to escaping a container pretty easily.

17:46.000 --> 17:52.000
And so we want to drop whatever it's not needed to run our own container.

17:52.000 --> 18:00.000
This is a list of capabilities that I very much copied from Docker.

18:00.000 --> 18:12.000
It's very restricted set of stuff, but that lets you do whatever it's needed to enter a container or start whatever you have inside.

18:12.000 --> 18:20.000
And we can drop them, and we have less now, so it's more secure.

18:20.000 --> 18:31.000
We're almost there. So we can pull from our registry, we get our own TIGGs, we create our wrong root of us, we can run it, we are rooting side it.

18:31.000 --> 18:42.000
We cannot see process outside of it. We have the right mapping, we have the right set groups, we have the right capabilities, and we don't have network.

18:42.000 --> 18:49.000
Like we wanted at the beginning, but that's not enough to have a real secure container.

18:49.000 --> 19:05.000
We are still missing C groups, so we can limit the resource access of a container, we are still missing second filters to limit the access to C schools that the container can do.

19:05.000 --> 19:13.000
We are missing a Selinux or a Parmer integration, so that we can go even further beyond.

19:13.000 --> 19:20.000
This is the repo on GitHub, it's located on LilyPod, and thanks.

19:20.000 --> 19:30.000
How'd that any questions?

19:30.000 --> 19:38.000
There's one.

19:38.000 --> 19:50.000
Okay, so I'm pretty curious, you mentioned that you needed root access for this helper, you know, for a group mapping, and so do you know how projects like Podman do that?

19:50.000 --> 19:51.000
The same thing.

19:51.000 --> 20:04.000
So they use new idea to map them, just for that brief moment, and then they are rootless, others otherwise.

20:04.000 --> 20:06.000
Actually, it's not true, we released.

20:06.000 --> 20:11.000
It's rootless as much as it can.

20:11.000 --> 20:25.000
Actually, just making a quick comment on that one because it's kind of funny, so Alex was standing there and myself, we've been working for probably a couple of years now on the fully isolated privileged username spaces.

20:25.000 --> 20:35.000
So with that, every single user on the system will be able to get an entire, well, an entire 42 bit UID and GID map, we're not needing the privileged helper at all.

20:35.000 --> 20:45.000
It will still be a tricky for the file system access because you need to map that, but like you'll be able to do a few, or time affairs, or whatever, and get removed that path completely.

20:45.000 --> 20:50.000
I look forward to it.

20:50.000 --> 20:53.000
Hey, I have a question.

20:53.000 --> 21:03.000
Basically, do you plan to add any distrobox functionality that can be used only by LilyPod instead of Podman?

21:03.000 --> 21:09.000
I want to keep this to box container manager agnostic.

21:09.000 --> 21:14.000
So there won't be a special treatment for LilyPod.

21:14.000 --> 21:25.000
It will just be the fallback because it's like a 8 megabyte binary without external dependency outside of the basic c-schools that we are making.

21:25.000 --> 21:32.000
So this is the fallback solution for when I cannot find Docker or Podman.

21:32.000 --> 21:33.000
Thank you.

21:33.000 --> 21:36.000
Thank you.

21:36.000 --> 21:38.000
Anyone else?