WEBVTT

00:00.000 --> 00:07.720
That's something I shouldn't have.

00:07.720 --> 00:10.320
So hello, everyone.

00:10.320 --> 00:16.400
Today's talk is a bit peculiar, I would say, I have a lot of stuff to work on, so I'll

00:16.400 --> 00:19.280
go super quickly about who I am.

00:19.280 --> 00:25.000
I'm Federico Workford, right at in the Kubernetes area networking, specifically for

00:25.000 --> 00:29.400
telco-core use cases, and I maintain them at a little bit project, so if it doesn't work,

00:29.400 --> 00:31.320
you can blame me.

00:31.320 --> 00:35.840
This talk is about my journey into a VPN.

00:35.840 --> 00:43.240
I knew nothing about a VPN, I was asked to find a way to have a VPN termination at Kubernetes

00:43.240 --> 00:49.760
not level, so this talk is mostly about my prototype and how I ended up having something

00:49.760 --> 00:53.160
more or less working.

00:53.160 --> 00:57.920
Now somebody needs a refresher on a VPN and a VIXELON.

00:57.920 --> 01:01.440
OK, see, a few hands will be super quick.

01:01.440 --> 01:03.560
Let's start with VIXELON.

01:03.560 --> 01:12.720
VIXELON is more or less L2 packets encapsulated into a UDP, so in a regular spin and

01:12.720 --> 01:19.840
lift's topology, you might have the lift that is doing the VIXELON encapsulation, so your

01:19.840 --> 01:28.720
house is connected to the lift, to VELON interface, the lift does the encapsulation, takes

01:28.720 --> 01:36.000
the packet, wraps it around the UDP packet, sends it to a tunneling endpoint, what's

01:36.000 --> 01:41.480
called the V-tap on the other side of the fabric, the packet lands there, reaches the target,

01:41.480 --> 01:47.920
reaches the destination, life is good, but in a big fabric, who is telling us which V-tap

01:47.920 --> 01:51.960
to target, and that's EVPN.

01:51.960 --> 01:59.680
So EVPN is a VGPS extension that allows us to spread the reachability information of macadresses,

01:59.680 --> 02:05.200
we type two routes and also reachability of networks, we type five routes.

02:05.200 --> 02:10.640
We can have a broader network connected or maybe a Kubernetes cluster, it's an association

02:10.640 --> 02:14.480
between a L3 domain and a V-tap.

02:14.480 --> 02:20.600
And of course, in the same fabric, you can have multiple overlays, defined by different

02:20.600 --> 02:30.440
what's called the V-tap, so this is how the context, what is our destination, what do

02:30.440 --> 02:39.440
I want to reach, this is a regular Kubernetes cluster, I'm afraid to touch anything,

02:39.440 --> 02:46.600
this is a regular Kubernetes cluster, we have some BGPS-speaking components connected

02:46.600 --> 02:56.360
to the fabric, they talk BGP through the VLAN interfaces, these BGP routes gets more commonly

02:56.360 --> 03:04.280
translated into EVPN routes and we have a mapping between a VLAN and a network VRF,

03:04.280 --> 03:09.240
and then when we send the traffic to that, that gets encapsulated.

03:09.240 --> 03:16.200
What we, what my goal is, what I try to do is to have this concept, the router,

03:16.200 --> 03:21.920
brought down at node level, in a separated network, in a space, talking to the host via

03:21.920 --> 03:28.000
bad legs that are doing the same thing that the VLAN interfaces were doing, so we can

03:28.000 --> 03:34.000
exchange routes to the underlay, through those VLAN interfaces and whenever we send

03:34.000 --> 03:41.120
traffic to those VAT interfaces, sorry, the traffic gets encapsulated.

03:41.120 --> 03:46.960
So, we take the interface connecting the node to the fabric, we move it inside the network

03:46.960 --> 03:53.440
in a space or at least this is what I will attempt to do, the routes are advertised

03:53.440 --> 04:02.680
by our BGP-speaking components, Calico, QV, MATLAB and others, and we'll be able to learn

04:02.680 --> 04:07.720
and advertise routes and when we send traffic, the traffic will be encapsulated.

04:07.720 --> 04:12.160
What is the advantage of these approach, the advantage is that whenever we want to add

04:12.160 --> 04:20.040
a new overlay, we want to have to go to the fabric and reconfigure it, for example.

04:20.040 --> 04:25.680
And to add more steam, I want to try to do that on my laptop because it's easier than to

04:25.680 --> 04:29.280
go finding data center.

04:29.280 --> 04:36.640
One node, I kept note of all the progress, so I have a repo, we do the very steps that

04:36.640 --> 04:43.880
I'm going to present and in my blog I have a very sparse about the topic, so it should

04:43.880 --> 04:47.560
be our reference afterwards.

04:47.560 --> 04:52.400
What do I need to do that on my laptop?

04:52.400 --> 04:58.360
I need a router with a VLAN support and my favorite Linux open source implementation since

04:58.360 --> 05:08.200
I've been maintaining MATLAB is FRR, FRR is, obviously my favorite BGP and other protocol

05:08.200 --> 05:09.840
implementations.

05:09.840 --> 05:20.240
FRR supports EVPN with a caveat, FRR takes care of the control plane, FRR, but it wants

05:20.240 --> 05:26.880
us to configure the host in such a way that the data plane is implemented by the Linux

05:26.880 --> 05:27.880
kernel.

05:27.880 --> 05:37.560
So, FRR is very pollinated in what it wants, for each VNI, it needs Linux VRF, a Linux

05:37.560 --> 05:42.720
bridge, VLAN connected to it, and that's basically it.

05:42.720 --> 05:51.040
Then there is the configuration, well documented in the project site, we need to declare

05:51.040 --> 05:57.040
that we want to talk EVPN with the external router and we need to spread around the

05:57.040 --> 06:03.880
reachability of our VTAP, the virtual endpoint, so then when the other side is sending

06:03.880 --> 06:06.400
UDP pockets, it's able to reach us.

06:06.400 --> 06:11.440
And then for each VRF we need to declare which routes we want to advertise.

06:11.440 --> 06:19.080
So this is the first part, second thing that we need is a network lab running on my laptop.

06:19.080 --> 06:24.080
And this is where container lab comes into place, container lab is a very, very nice project

06:24.080 --> 06:30.120
that is covered, it allows us to create a network topology based on container images and

06:30.120 --> 06:37.400
vectors, so it's slightly different from using Docker compose or some regular Docker networking,

06:37.400 --> 06:40.760
because you don't have a bridge there, you have a link, so it's more similar to what

06:40.760 --> 06:43.720
you have in a lab.

06:43.760 --> 06:52.240
This was my first example, I just wanted to make EVPN work, it took a while, but I managed

06:52.240 --> 07:02.080
to, again, I have the blog post, it was early last year and my repo with this example,

07:02.080 --> 07:08.160
and this is how autopology looks like in container lab, so you define the nodes and you define

07:08.160 --> 07:13.000
the links, basically you say which node you want to connect to and those are going to become

07:13.000 --> 07:21.000
vectors, with FRR is slightly different, because you can't overwrite the entry point,

07:21.000 --> 07:28.280
so what I ended up doing was to passing a setup.sh file, in each node and calling it after

07:28.280 --> 07:34.640
the topology was started, and container lab supports many, many versions of the images

07:34.640 --> 07:42.000
that you can run, commercial or so commercial others in your topology, but I just use FRR.

07:42.000 --> 07:49.000
So we run the setup on each node and the setup looks like this, so what we need is to create

07:49.000 --> 07:58.000
the host layout at FRR once in order to have EVPN to work, so we create a VRF, we

07:58.000 --> 08:04.000
has laid the interface connecting the leaf to the host into the links VRF and then we create

08:04.240 --> 08:13.480
the VXL and the bridge and so on, and again this was the first example, my configuration

08:13.480 --> 08:22.000
is using now ready to be connected, so if we will share across the EVPN fabric, the

08:22.000 --> 08:29.400
reachability of the host connected to it, as type file droughts, so if I inspect the route

08:29.480 --> 08:36.200
in table inside the leaf, this is what we get, type file droughts, the local network and

08:36.200 --> 08:46.200
the local beta, and when you ping it goes through the leaf, it gets encapsulated and gets

08:46.200 --> 08:53.200
on the other side, seeing this ping worker was pretty exciting when I tried it, I know it's

08:53.200 --> 09:00.760
just a ping button. So again, the references I have a demo, a video, I have a blog post

09:00.760 --> 09:08.040
and a rico. Next step, we want to plug a Kubernetes cluster, stealing my laptop,

09:08.040 --> 09:14.000
still in the topology, so what I did was to use kind, kind is the one-on-project to use

09:14.000 --> 09:21.080
for development, CI and whatsoever, and the container lab has a nice way to have some

09:21.080 --> 09:27.560
nodes of the topology, which are actually kind nodes, and this is how you do that,

09:27.560 --> 09:34.240
basically you need to declare it in some way, I had to do some configuration inside the

09:34.240 --> 09:41.280
node, and that was it. Reminder, these are our ultimate destination, I container, vet

09:41.280 --> 09:48.960
bears, Linux VRF, BGP session with the host, and connected to the fabric, just a reminder,

09:48.960 --> 09:55.680
so I'm going to show what the setup of a kind node looks like, basically here, this is

09:55.680 --> 10:00.640
what I'm running inside the container, and what I'm doing is using Docker in Docker to spin

10:00.640 --> 10:07.600
up another for our instance inside the container, I take the namespace, I create the

10:07.600 --> 10:15.000
vet per, I put one leg of the vet inside the namespace, and then I take the interface used

10:15.000 --> 10:21.280
to connect it to the router for the underlay connection, and they put it inside the namespace,

10:21.280 --> 10:28.960
and then I do all the setup required for FRR, and then as an example, I can plug metallic

10:28.960 --> 10:37.880
be very in a project, it can then advertise metallic be, a lot of one-on-cerve service via

10:37.880 --> 10:46.480
BGP, so what I'm going to see into the router is that the local router is learning the

10:46.480 --> 10:55.340
BGP route through the vet as a BGP route, but it's also going to convert it as a type

10:55.340 --> 11:01.720
5 route with a local endpoint as the destination, so if I go on the other leaf, I'm going

11:01.720 --> 11:08.640
to see that route, and I'll be able to reach the service from one host connected to the

11:08.640 --> 11:16.000
other side of the fabric through VXLON to the node to metallurgy. Again, I don't have a,

11:16.000 --> 11:20.640
I have a demo, but I didn't have time, so I have the recording, the repo that you can

11:20.640 --> 11:29.040
tinker with, and a blog post explaining all this layout. And then move much further, so

11:29.040 --> 11:34.280
first, before it was a continuing on the host, now I wanted to have it as a pod, because

11:34.280 --> 11:41.040
it's Kubernetes, life cycle is easier to manage with pods, so I try to tinker with things

11:41.040 --> 11:48.520
to see if I was able to do that. And again, the destination is this, now we have two pods.

11:48.520 --> 11:55.320
One is one classical pod, it has its own network namespace, runs FRR, plus some things,

11:55.320 --> 12:00.720
one entry point to do the setup, and then I have another pod, a host network with a lot

12:00.720 --> 12:06.200
of privileges, a lot of sockets mapped in order to mess up with the host. So the other pod,

12:06.200 --> 12:12.680
what the other pod is does is it takes the interface, and it moves to the pod, to the target

12:12.680 --> 12:19.960
pod, creates the method, and puts a leg into the name space, and more or less it works.

12:19.960 --> 12:25.960
So the new difference is, I don't know if we can left about I copied a lot of the logic

12:25.960 --> 12:33.960
from his multis dynamic controller. So at runtime, we use the container runtime to find

12:33.960 --> 12:38.960
the right target, and more or less the stuff is more or less the same. So this is the main

12:38.960 --> 12:47.160
difference from before. And this is the high level, I want demo it, but I have the demo in

12:47.160 --> 12:54.160
my reference, and here we have a far overnight, it is able to exchange your hours with

12:54.160 --> 13:03.160
this router in a pod. And this is around November, when I started to say, but maybe I can

13:03.160 --> 13:11.160
try to write some real code and having a proper Kubernetes controller with that can be dynamic.

13:11.160 --> 13:20.160
So things went on a bit, and I have this project. I don't know if it's going to become

13:20.160 --> 13:28.160
a real project or just a POC, but I build a repo for it. It's the same architecture as before.

13:28.160 --> 13:36.360
We have the FR pod, we have a controller that is, it has a CRD-based API, and all the logic

13:36.360 --> 13:41.360
now is in the controller, because it was easier to have the hosting configuration and

13:41.360 --> 13:48.360
FR configuration in the same place. So the lifecycle is like the controller takes the interface

13:48.360 --> 13:55.360
and moves into the main space, it creates all the interfaces required by FRR, creates the

13:55.360 --> 14:03.360
developer, provides an FR configuration, sends a signal, and it tasks FRR to reload it.

14:03.360 --> 14:10.360
And this is the high level of having given as a pod. The API is pretty simple, again, subject

14:10.360 --> 14:18.360
to change. I need a slider, because I need a different data for each node. I need to

14:18.360 --> 14:24.360
tell the thing what is the interface that I want to move, and I have the session with

14:24.360 --> 14:32.360
the external router. On the other side, for each one I want to define is the session with

14:32.360 --> 14:39.360
BGP speaking component on the host. It's basically the evolution of the previous

14:39.360 --> 14:45.360
examples, but in a Kubernetes session, and the beauty of it is that it can interact with

14:45.360 --> 14:53.360
any BGP-enable component running on the host. It can be, again, whatever is able to

14:53.360 --> 15:00.360
talk BGP. And one nice thing is that the API of the router side is the same for all

15:00.360 --> 15:05.360
the nodes. So you don't have one session per node with a different API, but you just have

15:05.360 --> 15:10.360
a metallic BGP-peer, for example, and it works. Still very working progress, don't

15:10.360 --> 15:14.360
blame me if it doesn't work. Come back to the chat.

15:14.360 --> 15:21.360
A couple of months on the repo, I hope it's going to work flawlessly, more or less.

15:21.360 --> 15:27.360
I have a demo. Basically, I took two nodes, connected to the topology, running

15:27.360 --> 15:43.360
Calico, Calico is connected to the router via the VF, why Calico, because metallic

15:43.360 --> 15:47.360
B would have been too easy. I know the ins and outs of it. I wanted to take something that

15:47.360 --> 15:52.360
is BGP speaking, but I didn't know anything about it, to show that it was working.

15:52.360 --> 15:58.360
The way Calico works is that it talks BGP with the, using bird with the router. Each is

15:58.360 --> 16:07.360
announcing it's not IP-sider to the router via BGP. Those get translated into even

16:07.360 --> 16:12.360
routes across the fabric. The other node learns how to reach the pod side of the previous

16:12.360 --> 16:21.360
one. We send a packet encapsulated and encapsulated. Now, I have a real demo. I have a recording,

16:21.360 --> 16:29.360
but I also have a shell prepared. I will try my luck with the live demo.

16:29.360 --> 16:42.360
Trammorow. These are the Docker containers, the topology. Two of them are nodes of the kind

16:42.360 --> 16:54.360
of cluster. I have Calico that wants to peer with the IP of the local router, but it's not

16:54.360 --> 17:02.360
established, because I didn't configure the thing yet. If I look inside the FR container

17:02.360 --> 17:09.360
into the pod, it doesn't have the external interfaces. I don't have routes into that.

17:09.360 --> 17:15.360
Now, I have to work loads. One running on one node, one running on the other. I will try to

17:15.360 --> 17:23.360
ping and ping. It doesn't work. Now, I'm going to apply the configuration. I'm going to

17:23.360 --> 17:31.360
wait Calico to tell me that the BGP status is established. Now, the session is established

17:32.360 --> 17:39.360
on one node, but it should matter. Now, if I inspect the FR container into the pod,

17:39.360 --> 17:46.360
I see all the stuff that FR needs. So, I see the DRF. I see the Linux bridge.

17:46.360 --> 17:54.360
It is laid into the BRF. I see the VXLON and I see the lag, the vector lag, connecting

17:55.360 --> 18:01.360
this thing to the host and I see the app link to the external router.

18:02.360 --> 18:08.360
Now, I can see that on one node, I have a route targeting the pod

18:08.360 --> 18:15.360
side or to the other node via the IP of the router through the vet.

18:15.360 --> 18:21.360
Again, workload pods. I will try to ping from one to the other. Now, ping works.

18:21.360 --> 18:27.360
I try to ping from another host on the other side of the topology and ping works.

18:27.360 --> 18:33.360
If I try to TCP dump while I ping, my neck is hurting.

18:33.360 --> 18:42.360
So, I see the pocket crumming from the host, the ping, the ICP request,

18:42.360 --> 18:48.360
then we are going to send it out because we have our route to the destination

18:48.360 --> 18:54.360
through the Linux bridge, then we see going leading to the fabric as a VXLON pocket

18:54.360 --> 19:00.360
and then we see the reply coming back as a VXLON pocket and then get in the encapsulated

19:00.360 --> 19:08.360
sent to the host again where it finds its way to the pod.

19:08.360 --> 19:15.360
So, that was it for the demo. It worked. I am pretty happy.

19:15.360 --> 19:21.360
And then I can skip the recorded one.

19:21.360 --> 19:28.360
So, where do we go from here? First of all, I need to make it work flawlessly.

19:28.360 --> 19:33.360
I need to test a few things, but it's a nice prototype.

19:33.360 --> 19:40.360
We want to enable L2 it depends. So, we have to have a way to expose some way to connect

19:40.360 --> 19:47.360
an extended L2 domain on the host, possibly with an extra VF, or I don't know yet.

19:47.360 --> 19:53.360
We need to sacrifice one interface. We could use VLONs to move the to have a main interface.

19:53.360 --> 19:59.360
But, what if we want EVP and connectivity at day zero? Because it's a chicken and a problem.

19:59.360 --> 20:05.360
If EVP is all we want for our node, we will need to be able to reach the APS server.

20:05.360 --> 20:10.360
So, we need something that is alive before Kubernetes.

20:10.360 --> 20:14.360
So, I did some experimentations, running podmen pods,

20:14.360 --> 20:17.360
as SystemD units. There was a talk yesterday in the container dev room,

20:17.360 --> 20:23.360
but I'm not solved yet on the idea. Still, I still have to do some experiments.

20:23.360 --> 20:27.360
Maybe I faster data path. Like, this is kernel-based.

20:27.360 --> 20:33.360
We might, we had a great talk by Michael League about the ground today.

20:33.360 --> 20:38.360
Maybe we can join forces and have the PPK based data path,

20:38.360 --> 20:43.360
or maybe this is going to end up being a PLC and no one will use it.

20:43.360 --> 20:48.360
Rapping up, I think I talk about a few things.

20:48.360 --> 20:53.360
How I went from knowing nothing about EVP and to having something that more or less works,

20:53.360 --> 20:58.360
using stuff that was happening in my laptop, that made me,

20:58.360 --> 21:03.360
my productivity, like super, super quick compared to having to deal with the real routers,

21:03.360 --> 21:08.360
because every time I was missing something, I was just dropping everything and recreating it.

21:08.360 --> 21:12.360
So, that was, the container lab is, if you are into networking,

21:12.360 --> 21:15.360
is a very, very, very nice project.

21:15.360 --> 21:22.360
If you resources and then done, I further routing my best favourite protocol implementations.

21:23.360 --> 21:30.360
Container lab, I have this repo and my personal blog and the repo of the project,

21:30.360 --> 21:35.360
and the idea comes from my discussion with the Dostaticon folks last year here at Phosem.

21:35.360 --> 21:40.360
They have a nice implementation, probably, I think it's going to be converted.

21:40.360 --> 21:45.360
These are pros, but the idea of running EVP and termination and the node comes from that.

21:45.360 --> 21:50.360
With that, I finished writing time. If you have any questions.

21:50.360 --> 22:00.360
Questions, anyone?

22:00.360 --> 22:19.360
Hi. Thanks for sharing your work. I look forward to reading through what you've put together,

22:19.360 --> 22:22.360
because I've personally been trying to solve...

22:22.360 --> 22:25.360
Can you erase your voice or put the microphone in the air?

22:25.360 --> 22:29.360
I've been trying to solve this problem myself, and I was wondering,

22:29.360 --> 22:40.360
so from your example, we have metal LV, metal LV advertised the IP addresses, the BGP IP.

22:40.360 --> 22:47.360
So, for ingress, into the Kubernetes cluster, that just basically works with the way metal LV.

22:47.360 --> 22:53.360
Because I had to log in that very specific example, I had to add the static routes.

22:53.360 --> 22:59.360
But with the far Kubernetes, now the metal LV, you can also learn the routes on the node,

22:59.360 --> 23:04.360
so you can have this metric on path for free more or less.

23:04.360 --> 23:06.360
Yes.

23:06.360 --> 23:10.360
Good catch because there was a hole in that.

23:10.360 --> 23:17.360
So, I have a second question, then, have you considered how you might implement the same sort of logic for egress?

23:17.360 --> 23:22.360
If I have a pod in Kubernetes, I want to be able to talk to some other...

23:22.360 --> 23:25.360
But that was my colleague, example, did basically.

23:25.360 --> 23:31.360
So, the pod is part of the fabric, so the house was able to ping it, but not only that,

23:31.360 --> 23:35.360
the reply was getting back to the house on the other side,

23:35.360 --> 23:41.360
so that was made up with the EVP and VXLON.

23:41.360 --> 23:42.360
Okay, I missed that.

23:42.360 --> 23:43.360
Thank you.

23:54.360 --> 24:01.360
First of all, thank for your presentation, because it was like mind blowing from an active perspective, so many thanks.

24:01.360 --> 24:08.360
My question is, have you tried to play with the interconnection of the CNI or to say another word?

24:08.360 --> 24:15.360
Now we have a mechanism to connect the host with some kind of VRF on the networks.

24:15.360 --> 24:28.360
My question is, have you tried to connect the network of the Kubernetes with multiple networks?

24:28.360 --> 24:33.360
I'm putting VRF on the CNI itself and connecting the network.

24:33.360 --> 24:42.360
Like, VRF CNIs are not equipped for that today, so it's easier to have that kind of exposure at all levels.

24:42.360 --> 24:48.360
But of course, you would need to have orchestration with the CNI to be able to...

24:48.360 --> 24:52.360
The CNI would need to be VRF a word, basically.

24:52.360 --> 24:53.360
Okay, thank you.

24:58.360 --> 25:05.360
Thank you.