WEBVTT

00:00.000 --> 00:11.400
Now that it is, you know what I'm doing, it is time for the let's talk today, Amy is

00:11.400 --> 00:16.120
unfortunately sick, but Baker is going to take over because he worked with Remy on this

00:16.120 --> 00:22.400
project, making the nest is actually fed on like tiny deposits with not a lot of C-F-U and

00:23.400 --> 00:34.200
so enjoy. Yes, thank you. Thank you. As Peter said, I worked with Remy on this project, this slide

00:34.200 --> 00:40.280
deck was written by him and focuses on the bits he did, I will just try to get through them as

00:40.280 --> 00:52.320
well as I can. What is DNA's this? Who here knows what DNA's this? Who doesn't? Okay, a couple

00:52.320 --> 00:58.120
of people. DNA's this is a DNA's proxy, I think there's a diagram. Right, so most people

00:58.120 --> 01:02.520
deploy it like this. You have a bunch of clients. These might be laptops or resolves in the

01:02.520 --> 01:08.840
internet. They speak to DNA's this, which loadbent's traffic to a bunch of backends. This is

01:08.840 --> 01:16.560
the common DNA's this deployment model in I-Spees and Hoosters, etc. However, DNA's this, you

01:16.560 --> 01:23.160
also sit on the client and imagine in this case, DNA's this living on your laptop or on your

01:23.160 --> 01:28.760
home router or whatever, talking to a backend resolver somewhere out there, your I-Spees or

01:28.760 --> 01:36.320
Quad9 or whatever. And it is that use case for which for which DNA's this only router might

01:36.320 --> 01:44.960
make sense. So in 2021, a bunch of people sat and over to the money to sat the DNA's this would

01:45.040 --> 01:52.520
be great because it would enable DNA's encryption from the router to the provider or Quad9,

01:52.520 --> 01:58.180
Quad8, whatever. And DNA's this already supports all the scripted protocols unlike the

01:58.180 --> 02:04.000
DNA's mask, which doesn't support as many. So we said, well, it's not that big, so we can

02:04.000 --> 02:13.240
just do that. Well, we were wrong. So in 2022, open WRT supported devices as small as this,

02:13.280 --> 02:21.720
four megabytes of flash, 32 megabytes of RAM. In 2025, four megabytes is unlikely to even fit

02:21.720 --> 02:29.000
a useful kernel, but back then they really tried. So we bought a bunch of these TP-Link boxes,

02:29.000 --> 02:34.760
which was a picture, but it looks like any TP-Link router you've seen, it's black 3 antennas.

02:35.640 --> 02:40.840
It's slightly better specced. It has like four times the numbers I mentioned.

02:42.680 --> 02:49.400
So with you be able to boot a kernel, run open WRT on it, have a weapon to face, and fit DNA's

02:49.400 --> 03:01.640
this in with that. Turns out DNA's this is not small. The binary size in a compilation that relies on

03:01.720 --> 03:09.080
shared libraries is like 9 megabytes today, plus the shared libraries, which may not be present

03:09.080 --> 03:21.320
on your router. So we realized that we were underestimating or overestimating our ability to fit

03:21.400 --> 03:28.680
inside a router, and also we didn't know what we were doing to begin with. There's

03:28.680 --> 03:36.600
it opened WRT, open WRT, two things matter, the size of your binary in the flash file system,

03:36.600 --> 03:42.840
with users compression, and the amount of memory you use that is only yours.

03:42.840 --> 03:54.520
So this memory definition is called the proportional set size, and it turns out it is

03:54.520 --> 03:59.240
super hard to measure. Like all things in memory, you would think if you have a device with

03:59.240 --> 04:04.520
16 megabytes of memory counting a few bytes here and there would become easier, and it is easier,

04:04.920 --> 04:12.760
but it's still not easy. Your memory is split up in a bunch of areas that all of different properties

04:13.000 --> 04:17.960
someone just yours, someone swapped in from your compressed binary, some are

04:19.240 --> 04:25.160
swapped in from shared library that you may or may not be the only user of, or that part of the

04:25.160 --> 04:31.480
library may or may not be just for you. So it turns out it's quite hard to actually count these things.

04:33.160 --> 04:40.680
But we managed to define a way of counting, and at that point, this will be found.

04:40.680 --> 04:46.120
The NS this itself needed about two megabytes of memory, live crypto, also needed about two

04:46.120 --> 04:52.760
megabytes of memory, and then we needed another one I have megabytes for other libraries we

04:52.760 --> 04:59.480
needed, including lip SSL. You'll notice that open SSL is actually the biggest memory user

04:59.480 --> 05:07.080
in this scenario, and then we needed two more megabytes just for storing our data estate, etc.

05:07.560 --> 05:15.400
So that's quite a bit more than four megabytes. So we tried a few easy things.

05:16.520 --> 05:22.760
The NS this defaults are for big setups, machines with four gigabytes of RAM, handling thousands of

05:22.760 --> 05:28.040
clients, etc. Your home router is not like that. So you take all these numbers that are default

05:28.040 --> 05:34.120
a tune them down. Only 50 outgoing queries at a single time, a single T speed threat, a single

05:34.120 --> 05:42.040
DOH threat, very small buffers in which we keep track of recent queries, etc. This helps some.

05:43.320 --> 05:51.240
Next up, the NS this has a shit on of features. Many of which you might not need on your home

05:51.240 --> 05:58.040
router, like the CDB or LNDB support, those were built for doing million entry block lists.

05:59.000 --> 06:05.880
Which you may want on your router, but not everybody would want it. So we took the open WRT

06:06.440 --> 06:13.080
the NS this package and added a whole bunch of extra flags to it to allow you to get rid of features

06:13.080 --> 06:20.040
completely at compile time. Then there's the compiler link of flags, those influence, binary

06:20.040 --> 06:27.640
size and memory usage a lot. As most of you would know, you can tell GCC or Clang or whatever to

06:27.720 --> 06:33.560
of to my zero, of to my one, of to my two. There's a funny little other one called optimized

06:33.560 --> 06:41.160
for size, which does something between one and two, I think, except when it would make the binary

06:41.160 --> 06:48.520
bigger. And this helps some. Hiding symbols just reduces the size of symbol tables, which is helpful.

06:48.520 --> 06:54.200
A link time-open optimization is quite an interesting one because it removes that code,

06:54.840 --> 07:00.920
even from libraries you might be linking statically. And I'll get it later, but linking libraries

07:00.920 --> 07:07.800
statically might actually reduce this usage because you can get rid of code that nobody's using.

07:09.880 --> 07:17.320
The bottom one was a bit of a pity. Disable position in dependent, I think it's execution,

07:17.320 --> 07:28.040
because by offers more security, that it makes the binary bigger. So there was quite good.

07:28.040 --> 07:34.520
The binary drop below two megabytes compressed by the file system choices open, W or T makes,

07:35.240 --> 07:39.320
and memory dropped a bit as well. So that was decent.

07:41.160 --> 07:46.120
Okay, right. So then let's figure out what is happening, where is all this memory going?

07:47.400 --> 07:52.920
The heap is the most important bit because it can be swapped out or swapped in. Most routers do not have

07:52.920 --> 07:57.960
swapped configured. And of course, the binary and the libraries are basically swapped in and out

07:57.960 --> 08:02.280
when necessary, but heat memory will just sit there, if physical memory.

08:06.040 --> 08:10.440
So, Rayme, you cell grind to investigate some of that.

08:11.400 --> 08:21.000
I have to say I haven't seen the picture before myself, but it also might be hard to read, but the

08:21.000 --> 08:29.160
big red reddish thing at the bottom is crypto malach. So a lot of a memory is barely going to

08:29.160 --> 08:36.120
live crypto allocating things, keeping stayed around for whatever reason. Then there's heap track

08:37.000 --> 08:41.320
which makes these nice flame graphs of where memory is being allocated,

08:43.320 --> 08:46.120
which I also have to use myself so I can't tell you much about it.

08:48.440 --> 08:55.160
But realizing that openness is always doing a lot of the allocating helped us strip some more

08:57.080 --> 09:04.600
use the stuff from the binary. So we don't load sivers and digest, we don't need our messages,

09:04.680 --> 09:09.480
openness, although they're not quite good. It can give quite extensive error messages,

09:09.480 --> 09:16.360
and we figured we don't really need those. Apparently, some things would allocate big and then

09:16.360 --> 09:23.240
be shrunk, which we could skip. And there's links to all the codes for these changes down there.

09:24.840 --> 09:30.600
Then there's lip H2O, which at the time we use to offer the OH, that library is that,

09:30.600 --> 09:35.880
it's not being maintained anymore and any useful capacity, but then it is what we had.

09:36.920 --> 09:42.760
It turned out that library also contains a bunch of things that we didn't need for running on a

09:42.760 --> 09:51.400
router. And indeed, the slide also mentions, we no longer use H2O. We now use NGH2P2, which sadly is

09:51.400 --> 09:57.880
what everybody uses for the OH. So there's a bit of an ecosystem problem there in that if NGH2P2

09:57.960 --> 10:02.440
has a bug, then all OPSOR's DOH, implementation is out there, we'll have that bug.

10:04.040 --> 10:09.480
So I hope something else arrives at some point, but right now this is the state of things.

10:10.920 --> 10:16.680
And again, we reduced a buffer because 8 kilobytes is more than enough for most DOH requests.

10:18.760 --> 10:25.640
We tried using wolf SSL, which is a nice project, and it has an open SSL compatibility layer.

10:26.520 --> 10:36.040
But it did not really reduce memory, and adding this extra dependency only helps, does not help the moment,

10:36.040 --> 10:41.160
some other program, does want to open as well on the same issue, because they have both libraries.

10:44.120 --> 10:47.480
So that's which did not do anything for us, but it was worth it shot.

10:47.640 --> 10:58.920
We tried UPS, which is a, which compresses binaries, not in the file system layer, but on a

10:58.920 --> 11:04.040
different layer, the problem with that is that it decompresses into memory of or if you're

11:04.040 --> 11:10.520
unlucky even onto a temp file system, which means you actually lose the benefits of the manpaging.

11:10.520 --> 11:18.520
All right, next step is to try to even harder. There's this tool called bloat, I think there's

11:18.520 --> 11:27.320
output here, yeah. So we built a binary with the features we want, we copy it, we strip it,

11:27.320 --> 11:33.000
and we run bloat on the stripped copy, because that's the one we want to measure sizes in,

11:33.000 --> 11:36.120
but we use original binary on the side to steal the bug symbols.

11:37.080 --> 11:46.040
And we found out that a lot of our memory was going to Lua. Lua is the program language we use for

11:46.040 --> 11:50.920
writing the nested configurations in, so it cannot just go, but still perhaps there was some

11:52.840 --> 12:00.120
room we could get back. So we realized that some of the structures in memory were padded

12:00.120 --> 12:11.640
inefficiently for memory purposes, and we realized that preventing false sharing, which means

12:11.640 --> 12:19.000
having unrelated variables not leaving close together in memory costs a lot of memory.

12:19.000 --> 12:25.560
So if you put those variables together, you can save memory at the cost of some performance on

12:25.640 --> 12:31.800
big multi-trade machines, which your router is not. Then there's the number of threats,

12:31.800 --> 12:37.880
the config I showed earlier did some of this, but it turns out we could strip even more threats

12:37.880 --> 12:42.600
from the process, which saves a lot of memory, because each threat comes with a stack,

12:42.600 --> 12:46.440
and the things that is doing also include a lot of states, of course.

12:46.920 --> 12:58.040
So for OpenWT, implementing a simpler threat model, where for many things, we didn't even have

12:58.040 --> 13:08.520
multiple threats capable of doing the handling, but just one. Then we found out that we were

13:08.680 --> 13:18.360
fragmenting memory, and as Andre said, the best way to synchronize threats, it's not synchronized

13:18.360 --> 13:24.760
them, the best way to allocate memory is to not allocate it. So it turned out there were a bunch

13:24.760 --> 13:31.480
of allocations, we could get rid of. The lower garbage collector is not very aggressive by default,

13:31.480 --> 13:37.320
so we now trigger it a couple of times during startup, but just goes and cleans up all those temporary

13:37.400 --> 13:43.960
objects that we don't need anymore. And of course, the number of threats helps a lot.

13:45.080 --> 13:52.760
We also tried linking the C++ standard library in. This made a memory usage slightly smaller,

13:52.760 --> 13:59.800
but it means we now are the distributor of lib s3C++, and again, if a second program wants to use

13:59.800 --> 14:08.120
the same library, then we are causing more memory usage, which is waste. So that was roughly where

14:08.120 --> 14:15.560
this ended up. We were slightly overtarget, but a waste smaller than before. The return on investment

14:15.560 --> 14:26.280
of other tricks we thought we could do would be small. So we set up an OpenWT feed on our website,

14:26.360 --> 14:35.080
and we have contributed all of this upstream to the OpenWT project, and this is the cursed status

14:35.640 --> 14:42.280
of DNS dist for the current OpenWT stable release. The binary compressed is one and have

14:42.280 --> 14:48.120
megabytes or five and a half uncompressed, which means five and a half maximum memory use for mapping

14:48.120 --> 14:54.440
the binary into. That's with all the features. Yes, so we now have two builds, a full one,

14:54.520 --> 15:01.400
and a not full one. The difference is roughly the list I showed before CDB and the B, etc.

15:02.760 --> 15:09.480
And the memory usage total, so that binary libraries and heap is around four megabytes now,

15:10.760 --> 15:16.280
which is still a lot on your home router, but it is a lot better than it was before.

15:17.480 --> 15:23.240
While we were doing this, we also added UCI integration. If you ever worked with OpenWT,

15:23.400 --> 15:29.160
you will have seen UCI or Lucy, which allows you to configure the software, the drums and your

15:29.160 --> 15:34.600
system, and before we did that all you could do was edit the DNS dist config, and there was no integration.

15:35.480 --> 15:41.960
So this will make things a lot better for OpenWT users with DNS dist. It's quite a big PR,

15:41.960 --> 15:45.800
so they haven't gotten around to reviewing it yet, but I'm sure they will soon.

15:46.760 --> 15:53.560
Here's an example of the UCI config. I'm not even sure that's being an after read here,

15:53.560 --> 16:02.360
but you can read it later. Oh, this is a fun one. We also added DDR desicnated.

16:03.400 --> 16:04.200
Bye-bye, help me.

16:04.440 --> 16:06.440
Okay.

16:12.280 --> 16:18.120
DNS desicnated network is over. It's a discovery, right? Right. Yeah.

16:18.120 --> 16:22.760
This allows DNS dist to tell your clients, like your iPhones and your Android devices,

16:23.400 --> 16:29.160
that they can use the RTRDOH to talk to your DNS dist, and they get a situation where

16:29.960 --> 16:36.440
DNS encrypted from your mobile to your router, and then encrypted from your router to some upstream,

16:37.960 --> 16:44.040
which is better than no encryption, but also a bit weird to have that re-encryption step in between.

16:44.040 --> 16:47.320
However, this allows you to do filtering on that device.

16:51.400 --> 16:52.520
And that is the last slide.

16:52.920 --> 16:54.520
Yes.

16:55.560 --> 16:57.560
Question.

17:01.080 --> 17:02.280
Thank you. Yes.

17:10.280 --> 17:13.320
If I could remove the lower part, how much memory would I save?

17:14.600 --> 17:21.800
Not a lot, because of WRT already shipped Lua, although they're getting rid of that.

17:21.880 --> 17:27.480
So it might be an interesting experiment for the next OpenWRT version.

17:36.920 --> 17:41.080
Have we tried linking statically against lip SSL with LTO enabled?

17:43.480 --> 17:51.000
I think we did, but given that open SSL is already installed, this will never give us benefits.

17:51.480 --> 17:57.000
If open SSL is not installed, and we knew where the only user, this will probably be the right

17:57.000 --> 17:59.240
cause of action, of course, of action.

18:03.240 --> 18:09.240
If I use DNS dist on OpenWRT, I'm still using DNS mask, which is a so DNS dist,

18:09.240 --> 18:15.960
is just between DNS mask and my provider in a summer, or do you replace DNS mask?

18:16.440 --> 18:22.760
So the way I've been running this at home for quite some time is I, sorry, the question is, does this

18:22.760 --> 18:30.120
replace DNS mask, or sit beside it? It could do either. You need a DHCP server.

18:30.840 --> 18:35.400
However, you can also use ODHCPD, which is actually a better DHCP server, I found.

18:36.200 --> 18:41.400
So I've been running at home. My office has been behind an OpenWRT box,

18:41.480 --> 18:46.760
with no DNS mask on it for quite some time, using ODHCPD, and DNS dist.

18:48.360 --> 18:53.720
And some of the UCI integration we did also offers host name,

18:53.720 --> 18:58.600
what land resolving, or DNS dist pools, that information from the DHCP server.

19:02.280 --> 19:08.040
I guess I was the question earlier, about from a matrix, the question has D to P integration,

19:08.360 --> 19:12.360
which is indeed a DNS mask when implemented in DNS dist somehow.

19:12.360 --> 19:17.480
Right, I did indeed just answer exactly that, very good. Anybody else?

19:20.360 --> 19:21.880
Okay, thank you all.

