WEBVTT

00:00.000 --> 00:11.600
Hello, what can everyone thank you for attending, so I'm Fernando, I'm a software engineer

00:11.600 --> 00:16.440
at Rayhead and today I'm going to talk about how do we perform link aggregation, balance

00:16.440 --> 00:21.880
SLB or source, what balancing, income and space within a moment of year.

00:21.880 --> 00:27.760
So just a quick introduction, so your quality familiar, but it's always good to remember

00:27.760 --> 00:29.840
what is link aggregation.

00:29.840 --> 00:36.160
So in essence is a technique used in networking to bundle multiple interfaces into a single

00:36.160 --> 00:38.880
device that acts as a single link.

00:38.880 --> 00:44.760
So usually you have ports that are physical links or there might be balance if you have

00:44.760 --> 00:50.560
a virtual network, but in this case for example you have tunings and you have a bond and

00:50.560 --> 00:57.640
you code use the that bond as an endpoint for your application and this can use to have

00:57.640 --> 01:05.200
a full trance environment doing a lot of balancing and some other operations.

01:05.200 --> 01:10.160
And usually these links are connected to a switch or they are connected to a different

01:10.160 --> 01:12.880
ISP or things like that.

01:12.880 --> 01:17.640
So this is usually called bonding in Linux because it's the driver implementation that

01:17.640 --> 01:23.440
we have and there's also a team but it's mainly bonding.

01:23.440 --> 01:28.800
Then you have multiple nodes and options which is what is going to define, how it's going

01:28.800 --> 01:34.840
to operate and what kind of behavior you're going to get from the bond, we are not going

01:34.840 --> 01:40.960
to focus on the options because there are a lot of them and for our talk is no they are

01:40.960 --> 01:45.120
not very very important but we are going to focus on the mode.

01:45.120 --> 01:56.720
So yes, we have around seven modes officially inside the kernel so the first one it's

01:56.720 --> 02:04.280
balanced from Robin as it sounds it just do wrong running so it goes sequentially picking

02:04.280 --> 02:05.640
every port.

02:05.640 --> 02:10.280
Then you have active backup, this is pretty straightforward, you have one port that is active

02:10.280 --> 02:16.600
when there is a fail over it's switch to another of the port's configure.

02:16.600 --> 02:22.960
Then you have ballang XOR with this is a very important one for us because it has a trance

02:22.960 --> 02:28.840
meat hash policy so in essence it will perform blood balancing but it does based on a hash

02:28.840 --> 02:35.720
policy that you can define as an option and the default policy is basically the sold

02:35.720 --> 02:41.960
Mac address, XOR, the destination Mac address, XOR, the packet type ID so we did you get

02:41.960 --> 02:48.440
identify of the connection and you make sure that connection that connection that matches

02:48.440 --> 02:53.240
the hash goes the same way the same thing all the time.

02:53.240 --> 02:59.440
Then you have rooks, you're pretty obvious it's a rooks so it trance me all the packets

02:59.440 --> 03:02.160
through all the ports all the time.

03:02.160 --> 03:12.520
Then you have 8, 0, 2, 3, A, D so this is a standard it's a dynamic link aggregation

03:12.520 --> 03:18.000
and in order to operate in it supports on the switch side on the remote switch otherwise

03:18.000 --> 03:20.480
it's not going to work.

03:20.480 --> 03:27.320
In essence it's dynamic link aggregation based on the doublex and speed settings on the

03:27.320 --> 03:34.760
mix and other algorithms and stuff that can be configured on the switch side.

03:34.760 --> 03:39.560
Then we have balanced deal with which is basically adaptive the trance meat balance in

03:39.560 --> 03:45.440
so based on the trance meat blood balancing that the neck has it's going to measure it

03:45.440 --> 03:57.600
and try to have even lowered on the different ports and then we have all adaptive trance

03:57.600 --> 04:03.800
meat balancing which is basically the same until we plus it has adaptive it also adapts

04:03.800 --> 04:08.880
on IP4 in the receive transmission.

04:08.880 --> 04:16.360
Let's create an interaction to link aggregation and now I want to do another important

04:16.360 --> 04:21.720
interaction which is OBS I'm pretty sure that you have heard about OBS in this room

04:21.720 --> 04:29.160
a lot during the day but I think that's what is OpenBSwitch so OpenBSwitch it's a software

04:29.160 --> 04:36.960
switch in communication which is usually used as a switch into VMs environments so you

04:36.960 --> 04:42.360
usually have a lot of VMs and you have OpenBSwitch connecting all of them and doing

04:42.360 --> 04:49.920
other operations like bonding which is also supported but in this case OBS and OBS

04:49.920 --> 04:57.560
bonding so OBS bonding only supports three modes LACP so LACP it's a protocol that

04:57.560 --> 05:05.640
allot the switch to cooperate when you are performing bonding or link aggregation so it requires

05:05.640 --> 05:14.440
LACP support on the switch in the remote switch once the negotiation is done there is

05:14.440 --> 05:21.880
no special treatment for the packet then we have active backup same as the kernel one it's

05:21.880 --> 05:28.800
one port up when there is a failover another port comes up and then we have SLV so that's

05:28.840 --> 05:34.840
important one for us today as you can notice in essence the two first modes are supported

05:34.840 --> 05:43.560
in link in Linux implementation but not SLV what is SLV so SLV is the lot one

05:43.560 --> 05:49.240
seen without the remote switch knowledge or cooperation and it's basically using a

05:49.240 --> 05:56.520
hashing like plexx or but this hash is based on the source MAC address plus a

05:56.520 --> 06:04.560
BN per so this is usually used in spying live situations where you have when the

06:04.560 --> 06:10.640
leaf has multiple VMs inside and you want to perform you want to perform what

06:10.640 --> 06:18.520
one C between the VMs inside the every machine so what is our goal as you could

06:18.520 --> 06:24.240
see OBS support this but the Linux kernel doesn't so we thought the number of

06:24.240 --> 06:31.880
managers could help here and find a way to implement this in conversation this

06:31.880 --> 06:36.240
can be useful for situations where you have a system where you don't want to use

06:36.240 --> 06:40.440
OBS or OBS is not available or maybe you are ready to implement everything with

06:40.440 --> 06:46.560
Linux pages and bonding for Linux and you are using some options that are not

06:46.560 --> 06:53.880
supporting in OBS and in essence you don't want to move everything to OBS the first

06:53.880 --> 07:00.520
thing that we need to do is of course take a look to the driver in the kernel so

07:00.520 --> 07:04.400
we need that some changes initially we thought that balance XOR was the ideal

07:04.400 --> 07:09.560
mode for S because it was doing already some it was based already it was the

07:09.560 --> 07:15.560
balancing based on a transmit hash so it was ideal but there is no hashing policy

07:15.560 --> 07:21.760
that matches what SLV is doing so we asked for help and we get some kernel people

07:21.760 --> 07:29.640
working into it and they did an introduce a new XMid hash policy option which is

07:29.640 --> 07:36.320
Billan plus SRC Mac and this is what it does is you get the Billan ID XOR

07:36.320 --> 07:43.040
the source Mac vendor XOR the source Mac dev and if we go to the code this is

07:43.040 --> 07:50.720
indeed quite neat I hope you can see sorry for the code but in essence what it

07:50.720 --> 07:55.720
does is that it gets the source Mac vendor it gets the source Mac dev checks if

07:55.720 --> 08:01.560
there is a Billan ID if there is a Billan ID it just XOR them if there is a

08:01.560 --> 08:06.360
Billan ID gets the Billan ID and XOR all of them together just that that's

08:06.360 --> 08:11.000
policy the rest of the implementation for this bond on kernel was just the

08:11.000 --> 08:15.200
Nellink bits to be able to configure it and see some effects on all the

08:15.200 --> 08:21.480
configuration options so that you will we are ready at shift

08:21.480 --> 08:27.640
balancing SLV and this is what we thought but no that's not the case we noticed

08:27.640 --> 08:35.040
that all BS did and also work solving some challenges that we found and we

08:35.040 --> 08:41.760
took along today to how they solved it and it was impressive so we started

08:41.760 --> 08:45.760
just to think on how we could do that we could do that and we thought that

08:45.760 --> 08:50.160
Nephid RF tables is the right candidate to help us here so let's take a look

08:50.160 --> 08:56.000
to the challenges that we found so first of all there are two situations where

08:56.000 --> 09:00.000
we can get packet application so the first one is when you get a broad

09:00.000 --> 09:05.280
multi-guess from the remote switch so in essence the remote received from a

09:05.280 --> 09:12.800
non SLV port broadcast or multi-guess packet and therefore the switch forward

09:12.800 --> 09:17.600
it to all the ports in SLV and then you have in your bonding you have

09:17.600 --> 09:23.800
duplicate the packages because all of the all those are going to reach the

09:23.800 --> 09:28.840
bond to solve that and if they were was a good fit because we could

09:28.840 --> 09:33.400
radiate everything to the primary port and make sure that the broadcast

09:33.400 --> 09:39.400
a multi-guess if the connection is already hash get you can check if this

09:39.400 --> 09:42.520
broadcast or multi-guess check if the connection is already hash and then

09:42.520 --> 09:46.680
send it directly to the primary port and if it is not the primary port we could

09:46.680 --> 09:51.240
drop the packet and then a situation is that the broadcast multi-guess

09:51.240 --> 09:57.480
from the bond port so the remote received the remote switch received a broadcast

09:57.480 --> 10:04.760
multi-guess packet from SLV port and it forwards all of them to all other links in

10:04.760 --> 10:09.880
the SLV so you get the same packet again and you get the application same again we apply

10:09.880 --> 10:15.720
the same technique and f-tables with check if we hash the the belong ID

10:15.720 --> 10:20.760
on the source Mac and if we did and it is not the primary port we can drop it and

10:20.760 --> 10:26.840
this is in essence I think yeah so the first thing that we have is we use dual

10:26.840 --> 10:33.480
sets of NF tables the max set pack so if we have a belong ID or we don't have

10:33.480 --> 10:38.840
a belong ID use two different sets so the first set is a combination of the source

10:38.840 --> 10:44.920
address plus the belong ID and the second set is just the source address then we have

10:44.920 --> 10:50.200
three chains so the first chain is in essence just updating if we

10:50.200 --> 10:58.280
the sets that we define above nothing else the two sets is basically one of

10:58.280 --> 11:03.880
then is dropping loop packages and the other package and the other one is dropping

11:03.880 --> 11:17.240
a broadcast packet so if we locate a detail we check if the set already contains the

11:17.240 --> 11:26.280
the combination and then we do some drop in and yeah on the yeah and then on the other

11:26.280 --> 11:31.640
one we do basically the same which if it is a broadcast or a multi-guess and then we

11:32.200 --> 11:40.200
drop one if you take a look to it the first one the one that is updating the set is on

11:40.200 --> 11:46.120
agres because we want to update when the interface is transmitting and the other one is on

11:46.120 --> 11:52.600
ingres so we check when the packets are arriving so we found one more problem

11:52.600 --> 11:59.720
IGMP and NLD so this is usually used with top of the rack switches so we noticed that

11:59.720 --> 12:04.520
top of the rack switches may prune the multi-guess tree and they sent their announced the

12:04.520 --> 12:10.840
different multi-guess group and so on and we noticed that there are problem because if the

12:10.840 --> 12:17.400
belong SRC Mac happens and the prune multi-guess is sent to a board member where the

12:17.400 --> 12:26.360
RX filter will drop it we will lose the the prune and therefore we noticed that the only

12:26.360 --> 12:35.880
the primary member board should accept this again NF tables we created a close set it's a chain

12:35.880 --> 12:44.920
basically it checks agres and it checks the IGMP type depending if it checks it sorry

12:44.920 --> 12:52.280
it checks a membership report and then send it to the primary board so in this case for

12:52.280 --> 12:59.720
example we have this rule for ATH2 which is not the primary but ATH1 is so before what the

12:59.720 --> 13:08.040
packet there and the same for ECM PD6 we check if we have NLD listener and then we do a forward

13:08.040 --> 13:14.040
by the way the counters that you might see there is because I was running this in debug in

13:14.360 --> 13:21.800
one manager showing debug it at some counters so it's easy for us to debug any bug that we might

13:21.800 --> 13:29.720
find and we found one more which was a little bit troublesome and if you're familiar with

13:29.720 --> 13:37.400
the link aggregation you're familiar also probably with this so in a spinoleaf environment

13:38.120 --> 13:45.080
architecture when there is a link fail over and you are doing balancing it's probably

13:45.080 --> 13:50.600
it's very probable that the switch is not going to update the Mac table on time until there is an

13:50.600 --> 13:58.600
outcoming an outcoming connection from the bond and there is some moment where you lose the

13:58.600 --> 14:04.840
connectivity between the switch and the bond for that specific port and the traffic is not

14:04.840 --> 14:12.680
already detected we found that OBS already fixes I'm wondering why we don't do this on canal

14:12.680 --> 14:19.000
I think I need to talk to some people because this should be doable on canal and it will work

14:19.000 --> 14:27.240
for all the bands mode and yeah so by the way this obviously happened if you have if you don't have

14:27.320 --> 14:34.040
an ACP because if you have an ACP the switch will coordinate and this won't happen so with all

14:34.040 --> 14:40.920
of the with all of this by reading the Linux switch FDB table sending a reverse ARP on the link

14:40.920 --> 14:49.080
fail over on link fail over event and we were updating the Mac tables on the switches so

14:49.080 --> 14:53.720
in essence what will happen is that you have usually this architecture with different

14:53.720 --> 14:59.480
balance in the bridge connecting to the end for example only can be best does a matter

14:59.880 --> 15:03.320
then you have the bond and the trivial next that is connected to different switches

15:04.200 --> 15:08.920
that may or not or may not talk to each other and then you have something as in this case I

15:08.920 --> 15:15.080
define the same architecture but it could be something else if there is a connection a string of

15:15.080 --> 15:22.920
connection going from link three through the switch one to link one sorry from both one to link one

15:23.640 --> 15:32.440
and link one fails for whatever reason switch the bond and the switch two does not it that it can

15:32.440 --> 15:41.480
reach the Mac address of the bond through switch two and therefore the connection is not

15:41.480 --> 15:47.720
redirected it's a little bit cumbersome but it happens and we got back reports that people

15:47.720 --> 15:54.760
that were experiencing this and we initially said yes this is supposed to happen it's a proven

15:54.760 --> 16:01.160
non-bonding and they replied it doesn't happen in OBS we investigated and OBS fix it so

16:01.960 --> 16:08.440
coolest OBS so we fix it and how we did all of this so we integrated everything with

16:08.440 --> 16:14.920
never manager so we introduced a new bond option in a manager which is called balance SLB so in

16:14.920 --> 16:21.080
essence you just need to configure balance XOR and the transmit policy to BLM plus SRC Mac

16:21.080 --> 16:28.280
and then this option to balance SLB one when balance SLB one is set to set in the configuration

16:28.280 --> 16:34.520
never manager do two big things so apart from configuring the the bond that is not a special

16:34.600 --> 16:41.480
it also monitors the link uh or carriers status of the different ports and according to the

16:41.480 --> 16:47.080
need it's modified NFT all rules said that I show before so if there is for example you

16:47.080 --> 16:51.400
value your primary and you're already connecting the EGMP all the melody to the primary

16:51.400 --> 16:57.160
and the primary goes down for whatever reason the rules said is going to change immediately

16:57.240 --> 17:08.040
to redirect the queries to a new primary and then basically it's an NMCA command that you can use

17:08.040 --> 17:12.600
so it has an NMCA LI connection a type bond connection name whatever if anything whatever

17:12.600 --> 17:20.120
and the bond options mode balance XOR XMB hash policy balance SLC Mac balance SLB one

17:20.120 --> 17:25.640
and with that you should be all set of course you need to configure on the other side the ports

17:25.640 --> 17:33.080
as you might want but this is the bond configuration for simplicity I don't know if you are

17:33.080 --> 17:40.360
aware about what is NMCA state but in essence it is a command line tool and also with an

17:40.360 --> 17:46.920
accompanying library that's added you to perform a host networking configuration in any declarative way

17:46.920 --> 17:53.320
so this will be integrated with the NMCA state it is never manager as a backend so this is the

17:53.320 --> 17:59.320
configuration that you will apply and with this you will have everything set of course notice that

17:59.320 --> 18:05.000
here the controller is BRT0 and BRCO you should have configuration before maybe you don't have

18:05.000 --> 18:13.080
a controller or maybe the controller has a name or whatever bonus point so while we were doing this

18:13.080 --> 18:20.760
we noticed that when updating the NF table rules set sometimes we need to make sure that we

18:20.760 --> 18:29.080
clean up the whole table and for that we did NF tables delete and try to delete the table for

18:29.080 --> 18:35.080
every interface that we have and we were getting an error and within understood why until we found

18:35.080 --> 18:42.040
out that NF tables delete fails when the object doesn't exist and that's a problem when you are

18:42.040 --> 18:49.160
scripting so our way to fix this is that we implemented in NF tables a new destroyer

18:49.160 --> 18:57.080
operation that a boy doing at and delete all the time to avoid this delete problem so if you

18:57.080 --> 19:03.720
are not sure that an object exists in NF tables now you can do NF TV destroy whatever and

19:03.720 --> 19:09.880
it's not going to fail so for example you have here NFD destroy table IP and amissing table

19:10.440 --> 19:17.640
it's going to return CO error code and if the table exists it's going to delete it if it doesn't

19:17.640 --> 19:23.320
exist it's not going to do anything at all so yeah that's an extra contribution that we were able

19:23.320 --> 19:28.360
to make and it was pretty nice that when we did it we were reach out by other people that it was

19:28.360 --> 19:34.520
really useful and people was looking for something like this before it was pretty simple to implement

19:34.520 --> 19:42.760
so that's a nice thing important remarks when use connected to obvious breach the link failure

19:42.760 --> 19:52.600
issue will be there and this has main reason is that all BS maintains its own FDB table so

19:52.600 --> 19:59.400
and right now there is no way that we can delete it from the API so it's only possible to

19:59.400 --> 20:10.520
delete it with a command line to OBS APP CTL but not with with OBS DB JSON RPC so we cannot

20:11.560 --> 20:19.080
read it at this point also we think that's kind of fine because we are expecting that if you

20:20.840 --> 20:26.200
have OBS in place use the OBS bonding because probably it's going to work better together

20:26.200 --> 20:32.840
than mixing the Linux bond and the OBS breach so yeah that's mainly in tiny important remark

20:32.840 --> 20:38.360
the bonding itself with work that the failed or bear situation will be there if you don't mind

20:39.240 --> 20:44.920
it's completely fine to use but again I when using OBS I personally recommend to use

20:45.720 --> 20:49.880
I not an OBS developer or anything like that but from experience everything works better

20:50.600 --> 20:55.160
if you use the whole suite as much as you can instead of mixing things

20:58.040 --> 21:04.440
yeah thank you for attending thank you for the organizer thank you very much

