WEBVTT

00:00.000 --> 00:06.000
Thank you.

00:06.000 --> 00:08.000
Everyone can hear me?

00:08.000 --> 00:10.000
Yeah, okay.

00:10.000 --> 00:22.000
Hey, so hi everyone to them here to talk about a new project, which is called Grout.

00:22.000 --> 00:28.000
So I was supposed to present this with a colleague, Christoph, but he couldn't make it.

00:28.000 --> 00:31.000
So I will still talk about him.

00:31.000 --> 00:37.000
So he's a software architect at Grout Hat working on NAV and Delco.

00:37.000 --> 00:46.000
And I work with him on the Antoine integration for OpenStack and the NAV.

00:46.000 --> 00:51.000
Delco use case.

00:51.000 --> 00:56.000
So this is a quick overview of what I will talk about.

00:56.000 --> 01:01.000
First question would be why did we start a new project?

01:01.000 --> 01:05.000
Then what exactly do we want to do with it?

01:05.000 --> 01:11.000
And how did we manage to get this running in the beginning?

01:11.000 --> 01:18.000
I will then mention a few stuff about what we are planning in the near future.

01:18.000 --> 01:25.000
And give a just an overview of the performance we have today.

01:25.000 --> 01:31.000
So working on OpenStack for Delco operators.

01:31.000 --> 01:39.000
So we had a lot of requests both from our customers and also internally at Red Hat.

01:39.000 --> 01:50.000
How can we validate that OpenStack cluster deployment will hold and under load of traffic

01:50.000 --> 01:57.000
when you have actual virtual machine processing will with complex network processing happening.

01:57.000 --> 02:06.000
And until now all we had was DPDK Test PMD, which is, I don't know if you're all familiar with that,

02:06.000 --> 02:11.000
but it's just literally a software wire that you can use.

02:11.000 --> 02:15.000
So it's just meant to test drivers at the beginning.

02:15.000 --> 02:24.000
But we eventually used that because it was the fastest way we could follow what packets from one place to another one.

02:24.000 --> 02:30.000
Also we started discussing with the DPDK community two years ago.

02:30.000 --> 02:39.000
And there was some, there was mentioned that how can we make DPDK more visible?

02:39.000 --> 02:51.000
Because everybody thinks we're just writing drivers and how can we show the world what can be done with DPDK like a sort of flagship project or something like that?

02:51.000 --> 03:01.000
So this is where we said, why don't we build a vendor, so vendor independent solution?

03:01.000 --> 03:10.000
So the vendor independent part was important because when we certified that our platform will work.

03:10.000 --> 03:19.000
We cannot say, yes, we tested it with vendor XYZ VNF, but we didn't test with ABC.

03:19.000 --> 03:24.000
So it was hard to do. We needed something completely neutral.

03:24.000 --> 03:32.000
We also wanted to be using DPDK as a first class citizen.

03:32.000 --> 03:38.000
We didn't want to reinvent anything from scratch.

03:38.000 --> 03:48.000
So there were also a few requirements that we wanted to be programmable over a API.

03:48.000 --> 04:03.000
And the first minimum viable product was we wanted to have IP and IPv6 for learning in multiple VRFs.

04:03.000 --> 04:11.000
So here, how did we get this done until today?

04:11.000 --> 04:20.000
We took a leap of faith like in DPDK since approximately two years or three years.

04:20.000 --> 04:24.000
There is a new library, which is called RT Graph.

04:24.000 --> 04:30.000
And it's been inspired from the node library, from VPP.

04:30.000 --> 04:43.000
And basically what it gives you is a framework to do packet processing with packet vectors that you can pass from node to node.

04:43.000 --> 04:52.000
And it provided a strong foundation for something that we wanted to build with mostly DPDK libraries.

04:52.000 --> 05:00.000
But then we had to ask a lot of questions about how many RFC we need to read.

05:00.000 --> 05:11.000
How many times I need to read Linux and BSD stacks to understand why they're doing this at this place and not the other place.

05:11.000 --> 05:16.000
We also took a very strong decision.

05:16.000 --> 05:25.000
It was to make sure that every node in the graph was only taking care of its specific OSI layer.

05:25.000 --> 05:37.000
So we didn't want the IP output to be writing Ethernet or that's just an example, but we didn't want to take any shortcuts.

05:37.000 --> 05:42.000
So this is the example I took here.

05:42.000 --> 05:48.000
So in the graph we have a hardware Ethernet port receive.

05:48.000 --> 05:56.000
So we get the packets from all ports and then send them to, so we assume these are internet packets.

05:56.000 --> 06:04.000
We send them to Ethernet input, which will pop the Ethernet header, possibly VLAN if there's such.

06:04.000 --> 06:14.000
Then sends the packet to IP input if it is IP, if it's not IP, it will send it to either IP6 or ART or something else.

06:14.000 --> 06:26.000
And we'll use the receive interface that was flagged as metadata to each packet to perform a route to go up on in the correct VRF.

06:26.000 --> 06:33.000
And then if we need to follow out, it will be, to decrymentifiel, we calculate checksum.

06:33.000 --> 06:41.000
And then from the next hop we got from route hookup, we can determine, oh, we can send that packet to that output interface.

06:41.000 --> 06:46.000
Using that Ethernet destination address.

06:47.000 --> 06:57.000
And then we can follow that. So from Ethernet how now we can push a new Ethernet header and write the correct MAC destination address.

06:57.000 --> 07:02.000
And finally, actually send the packet on the wire.

07:02.000 --> 07:11.000
So this was the simplified view. So now this is the current graph we have.

07:11.000 --> 07:15.000
And I highlighted the path I just described.

07:15.000 --> 07:25.000
So this is simplified in parenthesis, because there are a lot of nodes which are hidden, which are errors.

07:25.000 --> 07:40.000
Like when we receive a bad checksum or TTL expires or whatever, the packets will be sent to a error node, which is hidden here, because otherwise you wouldn't see anything.

07:40.000 --> 07:51.000
But what it gives us is that every node, when one packet passes to one node, we can get a statistic like this IP output.

07:51.000 --> 07:55.000
Every time you see one packet going through, you will get statistics.

07:55.000 --> 08:08.000
And thanks to that, over the API, we can see, like, for example, we received four ARP requests, which were not for us, so we just dropped them.

08:08.000 --> 08:15.000
And you can see that in the stats. So that's what the graph gives us.

08:15.000 --> 08:25.000
So for the I level design, so we have a graph which is a opaque object using the ART graph library.

08:25.000 --> 08:34.000
And then we pass a copy of the graph to each of the data plane threads that we use.

08:34.000 --> 08:45.000
And they do busy polling, it's basically pdk, but we have a little less aggressive way.

08:45.000 --> 09:03.000
When we have mtpos, if we do busy polling, if you have zero packets, we will have an incremental sleep mechanism, which will do based on the size of the Q number of ports and speed of the connection.

09:03.000 --> 09:12.000
And we will be able to determine the maximum sleep period will be able to achieve without dropping any packets.

09:12.000 --> 09:29.000
And so then the control plane part is managed through a Lib event loop, which maintains a bunch of structures interface, next hops, routing tables and such.

09:29.000 --> 09:38.000
So we have a modular framework where you can, for example, not compile IPv6 if you want.

09:38.000 --> 09:45.000
It's just everything is bundled in an IPv6 module, so you can actually remove it.

09:45.000 --> 09:58.000
And so each module needs to come with its own sets of nodes, entry points in the subgraph, which can be used then at runtime to plug it at the correct location in the graph.

09:58.000 --> 10:16.000
And every node also carries its own metadata, like for example, if you receive a packet from Ethernet input, you will have the information of the receive interface.

10:16.000 --> 10:35.000
And so for IP on the control plane side, we have route tables, next hops, plus some API handlers, and we have a built in CLI, which is just for manual interventions and stuff like that.

10:35.000 --> 10:46.000
So currently we have, so we presented the first milestone of the project, it's September.

10:46.000 --> 11:04.000
And we had IPv4 V6 holding VLAN and IP tunnels that we just implemented to reassure ourselves that the graph design was correct that we could just cycle back into the graph without having

11:04.000 --> 11:16.000
crazy issues. And also we got the project accepted in the dpgeted.org organization.

11:16.000 --> 11:23.000
And since then, so we actually fixed a lot of bugs.

11:23.000 --> 11:35.000
We've added loopback interfaces, which are a way to terminate connections, TCP UDP, for routing demons.

11:35.000 --> 11:43.000
So for example, if you want, we're currently working on that, it's not done, but you can run FRR, zebra, BGPD.

11:43.000 --> 11:57.000
And if you so grout with create a loopback, which is a ton interface, and when it receives a packet, which is local, it will just send it back to the loopback.

11:57.000 --> 12:03.000
So Linux will be able to process and terminate the connections.

12:03.000 --> 12:10.000
It's also available in Fedora. Soon, I don't know exactly when 42 will be released.

12:10.000 --> 12:25.000
We have a packet tracing framework, which is mainly for debugging when you're developing the new feature, and you don't understand where your packet is going.

12:25.000 --> 12:40.000
We had the first external contributor that had the ICN PD6 simple stack, and also notifications.

12:40.000 --> 12:53.000
So this is just a short example, but where you can see that we can trace packets, and every time a packet goes through a node, it can be called some information.

12:53.000 --> 13:06.000
And so every packet you will see that it went through every step, and you understand why did it enter that or which node cause the drop.

13:06.000 --> 13:11.000
So what are we working at?

13:11.000 --> 13:16.000
As I said, we're working on FRR integration.

13:16.000 --> 13:36.000
So this is still working progress with the ideas to be able to exchange routes through BGPD, and have zebra configure grout, but without going through the Linux kernel, so direct connection from zebra to grout.

13:36.000 --> 13:47.000
And so Chris is actually working on that, and I'm also working on multipath routing.

13:47.000 --> 13:59.000
So this requires some redesign of what we have currently, so we're currently working on that.

13:59.000 --> 14:11.000
So we don't have a roadmap because it's we're two people working on the project semi full time.

14:11.000 --> 14:35.000
We would, one of the first ideas we had is that every slope path operation, we currently need to kick the packet out of the graph, the data plane graph, and pass it to a control plane thread to be processed, like when you receive an ARP reply or something, you need to update your next hot tables.

14:35.000 --> 14:55.000
But then so you lose the graph benefits of statistics and stuff like that. So what we would like to have is a dispatch mode, which is currently supported by RT graph, but you only can do that with pulling mode for all threads.

14:55.000 --> 15:15.000
So for control plane, a slope path, we would like to have something less aggressive, and so in order to pass packets from one thread to another, and be able to do slope aspirations while still preserving the graph aspect.

15:15.000 --> 15:26.000
And so other things we're planning is to do is just actually optimize the code, because currently it's just not fast.

15:26.000 --> 15:42.000
We didn't make any optimizations yet, and a few tests that we need to, so once the FR integration is done, we need to actually test whether our control plane will hold.

15:42.000 --> 15:52.000
And we're also expecting if anyone is interested in the project, please share your ideas.

15:52.000 --> 16:01.000
Oh yeah, so that was the thing I talked about, how do we keep the slope path in the same graph?

16:01.000 --> 16:15.000
So I will not spend that time. And to conclude, so as I said, we didn't spend any time to try and make it fast.

16:15.000 --> 16:28.000
So obviously it's a comparison to test EMD, which is just software wire compared to our IPD for flooding on one CPU.

16:28.000 --> 16:38.000
So I'm hoping we can make it better, and that's what I have.

16:38.000 --> 16:53.000
So if you're interested in that project, please go ahead and test it and provide some feedback with a, so we're on GitHub, also have a Slack channel on the DPDK project organization.

16:53.000 --> 16:56.000
And thank you.