WEBVTT

00:00.000 --> 00:19.760
So, my presentation is about DNS, but not just about DNS, not only about DNS, and actually

00:19.760 --> 00:24.360
this idea originated as an internal company contest.

00:24.360 --> 00:29.340
When I had the contest for every employee of our company, about this, take any data

00:29.340 --> 00:35.600
that you like and make application out of it, and it was just for one or two days, and

00:35.600 --> 00:42.040
apparently I was the only participant in this contest, so I have a lot of fun, and I have

00:42.040 --> 00:43.040
this presentation.

00:43.040 --> 00:56.280
So, I need some idea, I need a nice data set, and I need a way to visualize and analyze

00:56.280 --> 01:02.080
this data set, and to get a data set, the easiest way is to get it from the internet, and

01:02.080 --> 01:10.040
we will get internet scale data set, so how to get it from DNS, and just basics.

01:10.040 --> 01:14.280
You can do forward DNS requests, and the easiest way to do it from a command line is to

01:14.280 --> 01:20.760
just use the host command line tool, and for Google it will output IPv4 address IPv6 address

01:20.760 --> 01:26.940
and something else, and you can do a reverse DNS request for IP address, it will output something

01:26.940 --> 01:36.520
for example, the name, but for Google, for the same IP address, for Google it outputs something

01:36.520 --> 01:44.240
strange.1E 100.net, question for you, what does it mean, or anE 100?

01:44.240 --> 01:53.860
1E 100 is the certificate notation for the number they call GoGo, so the point is, regardless

01:53.860 --> 02:00.460
of the DNS, records don't have to point to the original name, they actually cannot, because

02:00.460 --> 02:08.860
a lot of names can point to a single address, and actually they can contain about anything.

02:08.860 --> 02:15.580
And you can also use the more advanced tool dig, where you specify the type of record

02:15.580 --> 02:22.340
you want to get, and you will get the answer from your DNS server.

02:22.340 --> 02:29.500
And now, a question for you, if PTR records don't have to be precise, they can contain

02:29.500 --> 02:37.100
anything, an arbitrary string, you can't count on these records to contain what you need,

02:37.100 --> 02:38.540
why do they even exist?

02:38.540 --> 02:47.340
4 for mail server, mostly, for sending emails, and also for nice infrastructure of the

02:47.340 --> 02:48.340
probability.

02:48.340 --> 02:56.420
So when you run something like MTR, it is like Trace Root, you will get also these names from

02:56.420 --> 03:00.140
reverse DNS records.

03:00.140 --> 03:09.820
And the idea is to take IPv4, which contains less than 2, less than 4 billion, something

03:09.820 --> 03:17.100
IP addresses, slightly less, and just do reverse DNS records for all of them.

03:17.100 --> 03:19.780
So how to do it?

03:19.780 --> 03:28.140
There are a lot of tools, ready for that, like MES DNS, or there is a library, GNU ADNES.

03:28.140 --> 03:34.900
They are fast, from the rhythm of MES DNS, they say, if you run this tool, you will

03:34.900 --> 03:38.660
get all the result in less than 1 hour.

03:38.660 --> 03:47.420
As a problem is, if you do it from home, most likely your internet will break, and your

03:47.420 --> 03:53.780
internet provider will break your internet, and you will have to call them.

03:53.780 --> 04:02.460
If you do it from your hosting provider from AWS, from digital ocean, whatever, it is also

04:02.460 --> 04:06.940
not a good idea, they don't like it.

04:06.940 --> 04:16.140
At least, even if they don't restrict it, it will be good if you communicate in advance.

04:16.140 --> 04:20.940
If you contact them, explain your reuse case, explain what exactly you will do, they will

04:20.940 --> 04:23.580
not, so they will not worry.

04:23.580 --> 04:28.700
But I am introvert, I don't like communicating with anyone.

04:28.700 --> 04:33.660
So I wanted to find another way to do it.

04:34.660 --> 04:39.460
Yes, so from home, not a good idea, from this center, not a good idea, my friend was

04:39.460 --> 04:45.700
blocked, because I asked him to do it, and he was blocked, not me, from the cloud, also

04:45.700 --> 04:47.700
not a good idea.

04:47.700 --> 04:52.140
But there, let's take a look how DNS can work.

04:52.140 --> 04:54.940
What protocol is exist?

04:54.940 --> 05:01.540
It can work over UDP, port 53, it is unreliable, it is not secure, it can work over this

05:01.540 --> 05:09.940
CPU, it is not secure, but more reliable, there is DNS over TLS, reliable, secure, but

05:09.940 --> 05:17.060
heavyweight, there is also interesting protocol, DNS over HTTPS.

05:17.060 --> 05:29.260
It is just on HTTPS API, where you connect, and request a record and it will give you result

05:29.260 --> 05:37.300
in one of several formats, including the original binary format, or JSON for convenience.

05:37.300 --> 05:43.700
And one of those is cloud, where DNS over HTTPS over, there is a Google DNS over HTTPS

05:43.700 --> 05:50.420
over, probably many more, and interesting that you can even write, you can even use it without

05:50.420 --> 05:51.420
the domain name.

05:51.420 --> 05:59.940
If you use something like cloud for home, you have to remove this name in advance using

05:59.940 --> 06:04.100
a different method, like writing it in the host file.

06:04.100 --> 06:10.060
But you can write HTTPS, just one, one, one, one, which is the address of this DNS

06:10.060 --> 06:13.860
server, and it looks quite nice.

06:13.860 --> 06:23.100
Another is also DNS script over TCP and over UDP, and it is actually a mess, because

06:23.100 --> 06:28.860
even for encrypted DNS, you can use more than three different protocols.

06:28.860 --> 06:37.540
I decided to check if I can use cloud for DNS over HTTPS.

06:37.540 --> 06:43.020
So the API, if I use the most convenient one, with JSON, it looks like this, use that

06:43.020 --> 06:49.420
if I, your request, the type of record you want to get, and you will get JSON with all

06:49.420 --> 06:52.340
this information.

06:52.340 --> 07:01.740
It can use over HTTPS 2, even HTTPS 3, and HTTPS 1.1, so everything works, and with HTTPS 2,

07:01.740 --> 07:12.180
I can use a pipelining with multiple requests, and it also works, you can use it from

07:12.180 --> 07:13.180
Coral.

07:13.180 --> 07:20.660
I have read carefully the documentation about if I will just do several billion requests one

07:20.660 --> 07:21.660
day.

07:21.660 --> 07:24.780
Will I violate their terms of service?

07:24.780 --> 07:31.260
And apparently I just did not find any mention of it, and I thought so cloud for such a

07:31.260 --> 07:40.060
big infrastructure provider, if I do plus just one billion requests, no one will notice.

07:40.060 --> 07:46.380
They suppose that the processing three lines of requests every day, if not 100, 3 lines.

07:46.380 --> 07:53.140
So I did it, I also prepared a table for results in my database, in my favorite database,

07:53.140 --> 08:00.740
which is clickhouse, the best one is called database, and this table is a simple day-time

08:00.740 --> 08:04.580
and JSON with a string data type.

08:04.580 --> 08:11.140
And then I wrote a simple shell script.

08:11.140 --> 08:14.020
Let's read this simple shell script.

08:14.020 --> 08:20.900
So first of all, there is a SQL query inside, and the SQL query will skip all the result

08:20.900 --> 08:31.300
IPv4 addresses like 127, anything, link local, etc.

08:31.300 --> 08:39.900
And it will generate a ranges of the first three numbers.

08:39.900 --> 08:46.260
And then I will take these ranges, and also, paralyzed by the last number, the last

08:46.260 --> 08:53.340
talk that, and I will parallelize using corral, and I will generate a comment that will

08:53.340 --> 09:05.740
query this 256 addresses with a single HTTP as request using corral.

09:05.740 --> 09:11.820
And I piped it to bash, because I just generated this comment, and then I piped it into

09:11.820 --> 09:17.940
clickhouse client, finally to insert into the database, with this another SQL query,

09:17.940 --> 09:21.500
insert into.

09:21.500 --> 09:30.820
But it is just one line, so I will script you can see it, okay, so what happens?

09:30.820 --> 09:38.460
I have run it, I have left it for a day, I went to sleep, and then I woke up, actually

09:38.460 --> 09:42.700
you know, nothing actually happens.

09:42.700 --> 09:48.860
So I did it from a single machine, I did not prioritize by many machines, and I tried to

09:48.860 --> 09:57.620
be gentle, so, without using too much parallelism, the question is how long did it take?

09:57.620 --> 10:09.780
A day, okay, any other two weeks, couple of hours, two months, no, actually it took about

10:09.780 --> 10:22.100
ten days, so it was not fast due to this HTTP as request, but at least no one complains.

10:22.100 --> 10:29.820
I was a little bit paranoid that no one complains, so yeah, around ten days, so I went

10:29.820 --> 10:36.620
to this service, this service is named Grey Noise, and it represents an interesting kind

10:36.620 --> 10:45.540
of observability for the internet, it is named Internet Telescope, what is Internet Telescope?

10:45.540 --> 10:58.580
This is a bunch of sorrows across the world that have some ranges of IP addresses, the ranges

10:58.580 --> 11:05.260
could be quite big, and they just listen to anything that comes to these sorrows.

11:05.260 --> 11:14.820
They open every port, they lock every UDP packet, and they just listen, listen and collect

11:14.820 --> 11:22.420
this data, and if someone does a massive scan, or the internet, this Internet Telescope

11:22.420 --> 11:31.940
are likely to get this noise, so the sorrows is named Grey Noise, and I want to, and I found

11:31.940 --> 11:40.580
that it detected a DNS over HTTP as a scanner, so probably me, but then I checked the

11:40.820 --> 11:46.060
information, and apparently it is not me, this is just every time something happens, and

11:46.060 --> 11:51.460
I don't have to worry, and even if I was detected, it does not mean that I'm a bad guy,

11:51.460 --> 12:00.180
I'm not a bad guy, trust me, actually I did not have to do this scan, because there

12:00.180 --> 12:07.540
are open datasets of historical DNS scans, one of them were available, was available

12:07.620 --> 12:14.980
from the project, the scanner, from the company Rapid7, unfortunately they no longer host

12:14.980 --> 12:23.700
this dataset, at least they no longer give it for download, so at least I did an interesting

12:23.700 --> 12:33.060
exercise, so let's take a look at this dataset, 3.69 billion records, it is not 2 in 32,

12:34.020 --> 12:41.540
because some addresses are resolved, the dataset size is 13 gigabytes, and this is not even

12:41.540 --> 12:52.020
parsed, a compression ratio is 65, the raw dataset, if we take just these JSONs, it is almost

12:52.020 --> 13:03.620
a terabyte, and to parse it I used this simple SQL query, let's not read it, regular expressions,

13:03.620 --> 13:13.020
JSON, etc, and here is the parsed dataset, time status, flags, IP address, and domain,

13:13.100 --> 13:22.860
IP, let's take a look inside, the columns are compressed quite good, and the parsed dataset is

13:22.860 --> 13:33.420
even less, not 10 gigabytes, but 5, so I can look and it looks plausible, but now a question for you,

13:34.380 --> 13:43.740
do you know what is the most popular TLD top level domain, you say dot com, how many people will

13:43.740 --> 13:59.980
say dot com, dot arpa, maybe, how many people will say dot net, let's take a look at this easy, dot

14:00.060 --> 14:08.860
net, dot com, and dot arpa, either I just removed it from, but probably it is not that popular,

14:08.860 --> 14:16.060
I'm not sure why it is not here, it should be here, we will take a look, so dot net is the most popular,

14:16.060 --> 14:23.340
now the question for you, what is the most popular second level subdomain, or first level depends

14:23.340 --> 14:44.460
on how do you count, go, what, go you k, go gp, any other options, net, what, maybe, let's take a look,

14:44.460 --> 14:53.660
I think you all, you are all wrong, and the second for some reason, comcast dot net, and the

14:53.660 --> 15:11.020
self is bb tech, what is it you know about, but probably you know, okay, okay, now the next thing I did,

15:11.100 --> 15:16.540
is I tried to find all the reverse DNS records, containing a couple of files, the name of my favorite

15:16.540 --> 15:25.500
database, and I found a lot of them, a lot of them also with company names, and what I did,

15:25.500 --> 15:37.740
I sent this list to our sales team, actually I could use different tool, what is it, what is it,

15:38.380 --> 15:47.260
so then it shows 23,000 clickhouse servers are visible on the internet, probably because they are

15:47.260 --> 15:54.780
not security, like it was with deep seek, but probably for some other reason, like they are security,

15:54.780 --> 16:04.540
but handshake goes, to them currently it is 33,000, and I want to get more from this data,

16:04.540 --> 16:13.020
that's how I need visualization, and the most obvious example is to make this map, this picture is

16:13.020 --> 16:23.260
from 2006 from XKCD, about representing the whole range of IPv4 using a space feeling curve,

16:23.260 --> 16:30.460
so the curve does not go like the scan line, but the curve that goes like this, feeling all

16:30.540 --> 16:39.100
the space, it is difficult to say, but better to visualize, and also another example also not from

16:39.100 --> 16:49.980
me, it is from 2018, almost exactly what I want, but I want it bigger and better, so I have

16:50.460 --> 17:00.700
made another one line shell script, it is this, I will draw a picture with SQL, and this SQL query

17:01.340 --> 17:10.460
will calculate first significant subdomain, like ABC.co.uk, ABC will be the first significant subdomain,

17:11.100 --> 17:19.420
then calculate hash, Cb hash 64, then from this hash extract 3 bytes, RGB,

17:22.300 --> 17:28.700
and then use another function, more than decode, that represents a space feeling curve,

17:30.300 --> 17:40.380
and it will generate these pixels, and I will make it like a text in the format name of

17:40.380 --> 17:53.820
that portable net map, PNM, and after all of this, I converted to PNG, and it looks nice,

17:53.820 --> 18:01.580
here is the picture like I wanted, and we will dive deep into this picture in a moment, but now,

18:02.540 --> 18:15.660
yeah, this is even more, it is zoomed, but I want anyone zoomed out, so I generated it with 4k resolution

18:15.660 --> 18:21.340
using that script, actually I want a full picture, and the full picture will be 4 billion pixels,

18:22.220 --> 18:34.940
65k by 65k, so not 4k, but 65k, and 65k displays don't exist as of this moment, but I can do it

18:34.940 --> 18:47.180
interactively, yes we can generate 4 giga pixel PNG, but actually I tried it, but when I tried to

18:47.180 --> 18:55.100
open it then, most of the viewers were failing, actually, some were not failing about that slow,

18:56.060 --> 19:03.020
but most were failing, and then I decided to generate an interactive page, a HTML page,

19:04.060 --> 19:11.340
like Google Maps, allowing to zoom inside a lot of tiles, and there are different tools for that,

19:11.340 --> 19:18.860
like OpenSeedRagon, and LiftLag, LiftLag is more used for maps, and OpenSeedRagon is more used for

19:18.860 --> 19:28.780
something like scans of museum, museum pictures, drawings, and it appeared not too difficult, so I

19:29.740 --> 19:37.020
have written another script to generate tiles in a different zoom levels, and here is

19:37.980 --> 19:48.300
several lines of a loop for zoom levels, a loop for x coordinate of a tile, for y coordinate of a

19:48.300 --> 19:56.460
tile, fine name will be like here, and here is the same SQL query to generate a picture and

19:56.460 --> 20:06.940
convert to PNG, done, done, done, I left it for a day, and it finished, now I have a bunch of tiles,

20:06.940 --> 20:16.140
and I want to arrange them into a source, and the source is here, let's take a look, so here it is,

20:16.140 --> 20:25.260
here is our map, we can zoom inside, we can zoom even more, let me try, what will happen if

20:25.260 --> 20:35.340
to this, why let's, something, do you know what it is, could it be, could it be tried to guess,

20:36.060 --> 20:47.100
come cast, let's take a look, no it's Amazon, but what is interesting if I zoom even more

20:47.100 --> 20:56.700
inside this violet thing, I can see a lot of different dots, and the question is why inside

20:56.700 --> 21:07.420
Amazon, I have this different dots with a reverse name, defined it specifically, and the answer is,

21:09.660 --> 21:20.300
yeah for sending emails, for sending emails, okay let's zoom back, and it is actually so beautiful,

21:20.300 --> 21:36.780
I spent a whole day just looking at this picture, now let's go back, and see, so every click also

21:36.780 --> 21:42.860
generates a SQL query to this table and from JavaScript it is easy to do, just use this fetch

21:42.860 --> 21:52.300
request with select query inside the post, HTTP post, and it works because I have the primary

21:52.300 --> 22:01.020
king's table, it is a pointer request, it is fast, and to make a source I used, I created a user,

22:01.020 --> 22:09.660
I set up limits, Quattas, read on the access, and this HTML page, from JavaScript directly queries

22:09.740 --> 22:19.340
this service, take a race, please don't be afraid, you can use clickhouse for data analytics,

22:19.340 --> 22:25.820
you can use clickhouse for DNS analytics, and you will find something new, you will have a lot of fun,

22:25.820 --> 22:31.420
maybe you will find creative way to test and break something, the source code is available, it is

22:31.420 --> 22:38.300
really easy, HTML page to JavaScript, and that's it, ah, by the way we have a dinner today,

22:38.300 --> 22:49.580
but unfortunately it is full, so please don't look at this one, thank you

