WEBVTT

00:00.000 --> 00:13.000
I was actually started at 35, because that's one of the slides, but I don't want to keep you

00:13.000 --> 00:18.000
waiting, so we decided maybe I'll cut off and edit the video later.

00:18.000 --> 00:21.000
Mine is actually a very short lightning talk today.

00:21.000 --> 00:27.000
Luca was on stage earlier talking about packetfest and ziracanomic too much advertisement for it,

00:27.000 --> 00:29.000
and you should come there if you like this type of thing.

00:29.000 --> 00:34.000
There will be a longer session talking about S flow and vector packet processing there too.

00:34.000 --> 00:39.000
In case you notice there's four talks back to back on a VPP today.

00:39.000 --> 00:44.000
I'm talking number three or four after me or two gentlemen who also have excellent stuff

00:44.000 --> 00:45.000
to share.

00:45.000 --> 00:50.000
So I'm going to keep the boiler plate of VPP stuff to minimum today.

00:50.000 --> 00:55.000
My name is Pam. I work at a small company called IPNG networks.

00:55.000 --> 01:00.000
I've been in right community for a while, and IPNG started in the pandemic because I was kind of bored,

01:00.000 --> 01:04.000
and we work on a DPDK and a VPP applications.

01:04.000 --> 01:10.000
We run a ring in Europe that is the number three, I think, best connected ISP in Switzerland,

01:10.000 --> 01:14.000
which I thought was always kind of funny because it's kind of a basement ISP.

01:14.000 --> 01:20.000
I wanted to talk about S flow today. It's Anne Method for collecting traffic in Switzerland,

01:20.000 --> 01:25.000
routed networks that uses statistical sampling, not flow probing for the entire flow.

01:25.000 --> 01:34.000
Originally, RFC 317 has been superseded now currently a version 5 on the S flow.org website.

01:34.000 --> 01:41.000
What it kind of does is it uses data plane typically A6, right, to copy forward one in end packets,

01:41.000 --> 01:45.000
where N is a large number, like 1,000 or 10,000 or a million.

01:45.000 --> 01:50.000
And it copies the first couple of bytes from those packets into a little buffer.

01:50.000 --> 01:57.000
Adds things like ingress and egress interfaces, macadresses, sampling parameters, all that type of stuff.

01:57.000 --> 02:01.000
And if you want also extra metadata like which ASPath, which AS number,

02:01.000 --> 02:06.000
which types of things do we know about this packet as we see it in transit sampling it.

02:06.000 --> 02:11.000
It also periodically reads interface counters, like how many multi-casts and unicast packets.

02:11.000 --> 02:14.000
Did we see how many disk cards did we see and so on.

02:14.000 --> 02:18.000
And then it stuffs these all together sort of in data grams,

02:18.000 --> 02:25.000
and mostly UDP data grams, and forwards these batches of samples plus the stats like the counters

02:25.000 --> 02:29.000
into a central place called a collector.

02:29.000 --> 02:35.000
And because a lot of this heavy lifting is done by the ASIC to do the copying and so on,

02:35.000 --> 02:42.000
it's easy to get tens of thousands really of these agents in the field like switches and routers

02:42.000 --> 02:46.000
to talk to one or two central collectors, right?

02:46.000 --> 02:50.000
That can then get a holistic view over where the network is at.

02:50.000 --> 02:54.000
So VPP, you know, we talked about it a little bit, it's an open source data plane,

02:54.000 --> 03:00.000
similar to Grout, that provides super, super fast networking using a DPDK,

03:00.000 --> 03:07.000
VMA, Virtio, VMXNet, if you're on VMware, AVFA, if packet, all sorts of inputs really can easily

03:07.000 --> 03:14.000
do hundreds of millions, if not a billion packets per second, ask my buddy Vlad about that later.

03:14.000 --> 03:20.000
And we've seen on GCP as well, 100 gigabits is in the cards, like for a single VM.

03:20.000 --> 03:23.000
And it runs on commodity hardware, there's no binary blobs and things like that,

03:23.000 --> 03:29.000
so you can just download the packages from FIDO and install them on DBN or even Red Hat if you'd like.

03:29.000 --> 03:34.000
So line up today, like we talked about FIDO and Mohamed talked earlier about the GCP story,

03:34.000 --> 03:42.000
not on about MagLev, I'm going to talk a little bit about Sflow and Varunov and Gauthor going to talk about TLS after me.

03:42.000 --> 03:47.000
So I just wanted to really quickly go back into that graph and talk about it a little bit.

03:47.000 --> 03:53.000
So a packet enters in VPP, a directed A cyclic graph, starting from an input node,

03:53.000 --> 04:00.000
for example, DPDK, which we talked about before, or RDMA, or Vertio, these are all in gray.

04:00.000 --> 04:07.000
And then it goes on to layer two, all these packets go into these nodes, so I'd all the Ethernet packets anyway.

04:07.000 --> 04:14.000
And then we sort what type of Ethernet packet is this, this is IPv4, IPv6, and ended off to sort of vertical stacks,

04:14.000 --> 04:21.000
if you will, of graph nodes for IPv4, IPv6, MPLS, are that type of stuff.

04:21.000 --> 04:26.000
And then it ends hopefully, like in an output node, typically we figured out what to do with the packet,

04:26.000 --> 04:31.000
we're going to route it to this other destination, so we'll bring it back to Ethernet output,

04:31.000 --> 04:36.000
and then select an interface, put a MAC address in the Ethernet frame, and forward the thing on its merry way.

04:36.000 --> 04:40.000
That would be an output node in green, but then we also have drops.

04:40.000 --> 04:46.000
And actually we had an interesting question about where did that 0.004, you know, percent of drops come from.

04:46.000 --> 04:51.000
I would say, do not underestimate the importance of monitoring drops in your network.

04:51.000 --> 04:57.000
Like people talk about what made it through all the time, but we typically don't really know what happened when the packet died.

04:57.000 --> 05:02.000
Was it because of an app all, was it because of input congestion, like whatever we got,

05:02.000 --> 05:08.000
descheduled for a microsecond on the kernel, is it output congestion, because there was no TXQ in the hardware,

05:08.000 --> 05:13.000
that was willing to accept our packet, like we don't know, so it's actually really important to get visibility,

05:13.000 --> 05:17.000
and Sflow would be able to do that with a little asterisk.

05:17.000 --> 05:25.000
So what we've now done is we take that graph in the middle, and we try to insert a new node that takes all these packets,

05:25.000 --> 05:29.000
does something with them, and then typically moves them back to the next graph, right?

05:29.000 --> 05:33.000
So we insert this Sflow node into what we call the input arc.

05:33.000 --> 05:40.000
So that means every time a device had a set of packets that it was willing to give to VPP, we took them,

05:40.000 --> 05:44.000
and we looked at them, and we sampled maybe 1 and n, and then we moved them into,

05:44.000 --> 05:49.000
ostensibly, the next node, which would be Ethernet input in the vast majority of cases.

05:49.000 --> 05:53.000
And we're planning to do this as well on the green path here, the output node,

05:53.000 --> 05:56.000
so we can do egress sampling, that's actually pretty straightforward.

05:56.000 --> 06:02.000
And a little bit more tricky, at least in VPP, is drop monitoring because there is no easy mechanism for us

06:02.000 --> 06:07.000
to insert this node into that part of the graph that drops packets.

06:07.000 --> 06:12.000
So I'd like to talk to the VPP devs about that one a little bit.

06:12.000 --> 06:19.000
So then when you do this, that bubble that I had that I inserted into the graph is called an Sflow Worker node.

06:19.000 --> 06:23.000
And really what it does, it takes all the packets from the input and just moves them back to the output,

06:23.000 --> 06:26.000
which is kind of a little bit wasteful in a way.

06:26.000 --> 06:32.000
But one in n of these, we can copy, and what we do is we create a set of fee foes that are lock free,

06:32.000 --> 06:36.000
and multi-threaded, and wait free, and we just put the packet in there.

06:36.000 --> 06:40.000
And if the fee foes full, we'll drop them.

06:40.000 --> 06:43.000
That's one way for us to make sure we don't overload the rest of the system.

06:43.000 --> 06:48.000
But typically, there's enough space in these first and first out buffers for us in the worker

06:48.000 --> 06:55.000
to just add to the top of it, and then in the bottom side, we'll consume them somewhere else.

06:55.000 --> 07:00.000
So it's actually nice because this fee foe allows us to really just do what we do well in the data plane,

07:00.000 --> 07:06.000
which is shuffle packets around, and make all the rest of them more complicated stuff someone else is probably.

07:06.000 --> 07:13.000
And after messing around with it a little bit, Neil McKee from Inmond, who did most of the implementation,

07:13.000 --> 07:17.000
got us down to nine CPU cycles for every packet we touch.

07:17.000 --> 07:21.000
And about 17 cycles, if we decide that we actually want to sample it.

07:21.000 --> 07:25.000
So grabbing all the extra information and putting it in the fee foe, and so on.

07:25.000 --> 07:30.000
And just for reference, end-to-end, on at least this machine that I was testing on,

07:30.000 --> 07:32.000
layer two cross-connect.

07:32.000 --> 07:35.000
Ethan, at any extent, it out is about 144 CPU cycles.

07:35.000 --> 07:38.000
So we're adding nine and a half or ten to that.

07:38.000 --> 07:39.000
It's not trivial.

07:39.000 --> 07:44.000
But IPB4 is about two hundred and eleven, and PLS is about two hundred and nineteen when we do the load test.

07:44.000 --> 07:49.000
So it's not that terrible to add ten cycles, but obviously, it tends cycles is ten cycles, right,

07:49.000 --> 07:52.000
that you could otherwise spend on doing something else.

07:53.000 --> 07:59.000
Anyway, we put all this stuff in the fee foe, and we move on so that the data plane workers can stay fast.

07:59.000 --> 08:05.000
And then every now and again, we wake up this other thread in Maine that says, hey, is there anything in these fee foes

08:05.000 --> 08:06.000
that we should be aware of?

08:06.000 --> 08:09.000
And we just drain those fee foes, right?

08:09.000 --> 08:13.000
And what it does, this Maine task, is it grabs these packet counters every now and again,

08:13.000 --> 08:15.000
every ten seconds or so.

08:15.000 --> 08:18.000
And it grabs all of these samples from the fee foes.

08:18.000 --> 08:24.000
And it puts them in a Linux kernel construct called a P sample channel, which is a net link library,

08:24.000 --> 08:26.000
and it just gives them to the kernel.

08:26.000 --> 08:30.000
And then whoever would like to subscribe to those things can get them out the other side.

08:30.000 --> 08:32.000
This is another handoff point, right?

08:32.000 --> 08:37.000
So Maine in VP is grabbing all the stuff from the fee foes and the counters,

08:37.000 --> 08:43.000
and every now and again, a meeting P sample messages into the kernel.

08:43.000 --> 08:46.000
That's the arrow in dark gray at the bottom there.

08:46.000 --> 08:51.000
And then there's a third component to this architecture called the host S flow demon,

08:51.000 --> 08:56.000
which is already existing code that most everyone in the S flow world will have come across.

08:56.000 --> 09:00.000
And it then subscribes to net link and gets all these packets back out.

09:00.000 --> 09:07.000
The samples and so on, puts them into a UDP packet and sends them to those collectors in the middle.

09:07.000 --> 09:08.000
Right?

09:08.000 --> 09:10.000
So that's the multi stage.

09:10.000 --> 09:14.000
The data planes they super fast, the main thread just collates from the data plane,

09:14.000 --> 09:20.000
sends them to net link, and then S flow D will copy them to your collector.

09:20.000 --> 09:22.000
All right, so how do you configure this?

09:22.000 --> 09:25.000
I have this lab set up that I wanted to show.

09:25.000 --> 09:29.000
It's one machine called hungry hungry hippo because it always likes packets.

09:29.000 --> 09:35.000
And then a T-Rex load tester from Cisco both running on a rather old Dell R730.

09:35.000 --> 09:37.000
Then I have two loops.

09:37.000 --> 09:42.000
One at the top really is a layer two cross connect that is S flow enabled in VPP.

09:42.000 --> 09:47.000
We're sampling here, but also an IPv4 and MPLS that is enabled and sampling here.

09:47.000 --> 09:51.000
And then in green and red down below the exact same thing, but without S flow.

09:51.000 --> 09:54.000
So we can see what is the impact on performance.

09:54.000 --> 09:58.000
I'm not going to go over the entire config, but if you were to run this on your own machines,

09:58.000 --> 10:01.000
download VPP, run it in Docker as we saw before.

10:01.000 --> 10:07.000
You could type these types of things and you would bring up that IPv4 or MPLS,

10:07.000 --> 10:13.000
giving a label in a label number 16, pushing it out to this other interface with label 17 imposed

10:13.000 --> 10:17.000
that would be an MPLS P router in principle.

10:17.000 --> 10:21.000
Or a layer two cross connect is really just easing that frames in,

10:21.000 --> 10:23.000
copy directly to the other interface out.

10:23.000 --> 10:26.000
It's the cheapest thing that we maybe could do.

10:26.000 --> 10:30.000
To turn on S flow, there's a couple of things that are defaults and I put them here.

10:30.000 --> 10:33.000
The sampling rate would be 1 in 10,000 packets.

10:33.000 --> 10:38.000
The polling intervals, I think, 20 seconds by default, but I said, pull the interface.

10:38.000 --> 10:39.000
That's every five seconds.

10:39.000 --> 10:45.000
And then copy 128 bytes from each of the Ethan frames you do sample into the P sample.

10:45.000 --> 10:49.000
And turn it on on these four interfaces up top there, right?

10:49.000 --> 10:53.000
And that means that four interfaces down below have S flow disabled.

10:53.000 --> 10:56.000
And then host S flow D is pretty simple.

10:56.000 --> 10:59.000
There's a collector where we're sending our samples to.

10:59.000 --> 11:03.000
And then we subscribe to P sample group equals 1.

11:03.000 --> 11:07.000
And we enable the VPP module, which is in the HS flow D,

11:07.000 --> 11:11.000
which knows how to sort of take that stuff from P sample.

11:11.000 --> 11:13.000
We restart HS flow D.

11:13.000 --> 11:16.000
And pretty immediately we see all these samples coming in.

11:16.000 --> 11:21.000
And if we want to grab for the counters, we also see the counters every now and again, come by.

11:21.000 --> 11:25.000
So operators notes, if you were to start using this stuff.

11:25.000 --> 11:30.000
VPP often has a data plane network namespace where it puts its interfaces.

11:30.000 --> 11:37.000
And collectors can be created either in the default namespace or in the namespace with the namespace option.

11:37.000 --> 11:42.000
And for Linux control plane, a very powerful feature in VPP,

11:42.000 --> 11:52.000
we will by default use Linux control plane interface IDs from the Linux side to sort of be more seamless with respect to our pullers.

11:53.000 --> 12:00.000
But if LCP is not loaded at all or there exists no interface pair for this thing.

12:00.000 --> 12:03.000
Or we literally tell HS flow D do not use them.

12:03.000 --> 12:07.000
Then we will use the VPP representation of the interface.

12:07.000 --> 12:11.000
If you're curious as to why this is a big topic, come see me later.

12:11.000 --> 12:14.000
So Act 3 is about performance.

12:14.000 --> 12:16.000
That's also the last thing that I'll talk about.

12:16.000 --> 12:21.000
I'm taking one of these 2012 deals that has like 88 CPUs or something.

12:21.000 --> 12:25.000
It's not insanely expensive or powerful to run up the mill thing.

12:25.000 --> 12:33.000
And then I turned on a low test here with these pairs that I'll show port zero is sending to port one and port one is sending to port zero.

12:33.000 --> 12:37.000
And they're doing IPv4 and MPLS with Sflow turned on.

12:37.000 --> 12:41.000
And then the layer to cross connect, which is extensively cheaper, I guess.

12:41.000 --> 12:48.000
And then IPv4 and MPLS with Sflow turned off and finally layer to cross connect also without Sflow.

12:48.000 --> 12:53.000
And kind of nice when you do 80 gigabits or 47 million packs per second.

12:53.000 --> 12:57.000
There's absolutely no difference in performance with Sflow turned on or off.

12:57.000 --> 12:59.000
They're both doing 40 gigs of L1.

12:59.000 --> 13:02.000
They're both doing 23.6 million packs per second.

13:02.000 --> 13:08.000
Except the left hand side with Sflow turned on is also emitting 1 in 10,000 packets.

13:08.000 --> 13:14.000
This is easily 22,23,000 packets per second that we're sampling.

13:14.000 --> 13:18.000
So I was curious to see what is actually the limit of this thing.

13:18.000 --> 13:24.000
So this table here shows first if you take an interface of 10 gigabit one with 64 byte packets,

13:24.000 --> 13:26.000
which is the smallest we're allowed to send.

13:26.000 --> 13:29.000
You'll get 14.88 million packets per second.

13:29.000 --> 13:33.000
And if I do that with IPv4 turned on, VPP does a lot more work.

13:33.000 --> 13:36.000
Of course, and it'll do about 10.8, 10.9.

13:36.000 --> 13:39.000
And with MP left turned on, it's 10.1.

13:39.000 --> 13:40.000
That's just baseline.

13:40.000 --> 13:43.000
And then as soon as you turn on any sampling whatsoever,

13:43.000 --> 13:45.000
we insert this node in the graph, remember?

13:45.000 --> 13:48.000
And what it's doing is it's copying all the packets to the next node again,

13:48.000 --> 13:50.000
which takes 10 CPU cycles.

13:50.000 --> 13:53.000
So there's an immediate regression when you turn this thing on at all.

13:53.000 --> 13:59.000
So we go from 14.88 to 14.3 million packets per second when it is turned on

13:59.000 --> 14:04.000
with a synthetically large sampling number, like 1 in 1 million or so.

14:04.000 --> 14:06.000
But then it's not actually too bad.

14:06.000 --> 14:11.000
If you ratchet it up and go to 1 to 10,000, 1 to 1,000, even 1 to 100,

14:11.000 --> 14:16.000
which your vendor would not recommend you do on hardware most of the time.

14:16.000 --> 14:18.000
The further regression is not that bad.

14:18.000 --> 14:24.000
So for layer 2 cross-connect, we're going from 14.3 to 14.15 million packets per second,

14:24.000 --> 14:26.000
while sampling 1 in 100 of them.

14:26.000 --> 14:30.000
And that's where we also see that we start dropping these samples,

14:30.000 --> 14:35.000
selectively, and by choice, in the data plane, because we will be overloading the rest of the system.

14:35.000 --> 14:39.000
And so by design, we have this FIFA that is limited size, and when it gets over full,

14:39.000 --> 14:41.000
we just drop them from the tail.

14:41.000 --> 14:46.000
And that's what you see here in that last column, down below 1.8 million dropped samples.

14:46.000 --> 14:50.000
If you build a larger FIFA, you would get more throughput, I think.

14:50.000 --> 14:54.000
But it's really nice to see that this quad loop overhead that we saw before,

14:54.000 --> 14:59.000
really is just in moving the packets around, which we can do in 9 CPU cycles per,

14:59.000 --> 15:04.000
which is pretty great if you think about it, but it's still an overhead.

15:04.000 --> 15:07.000
And I don't want to show some interoperability slides.

15:07.000 --> 15:11.000
Again, this may be how you might show this stuff on the other side.

15:11.000 --> 15:16.000
Inman has a tool called S-flow RT, where you can just point S-flow agents app,

15:16.000 --> 15:19.000
and here's a screenshot from that.

15:19.000 --> 15:22.000
I wish of Aquarado, that's one of my personal favorites,

15:22.000 --> 15:24.000
actually fabulous piece of software.

15:25.000 --> 15:31.000
But it could be that if you're here, I would like to talk to you about counter samples and EF names

15:31.000 --> 15:33.000
in the Aquarado inlet.

15:33.000 --> 15:36.000
We also use end-top NG with many thanks to Luca,

15:36.000 --> 15:40.000
who personally went and added the S-flow collector to the open source version of this

15:40.000 --> 15:42.000
as we talked about this project.

15:42.000 --> 15:48.000
And you can see all that graphics really nicely line up in end-top NG as well.

15:48.000 --> 15:51.000
Thanks to Vlad for this beautiful picture.

15:51.000 --> 15:55.000
These are like really big machines converting like two kilo lots of power into heat.

15:55.000 --> 16:01.000
After they forwarded a terribut plus of traffic through the middle machine there on one VPP instance.

16:01.000 --> 16:02.000
Okay, thank you.

16:02.000 --> 16:05.000
Sorry, there is not time for questions, so we move to the next.

16:05.000 --> 16:12.000
Thank you.