WEBVTT

00:00.000 --> 00:22.200
For whatever reason, this is not full of screen, but you know, all right, hey, welcome

00:22.200 --> 00:27.160
everybody to our little presentation about how to break things.

00:27.160 --> 00:29.680
My name is Marco Salbe, I'm fairly technical.

00:29.880 --> 00:32.880
Let's just keep all this, we don't have too much time.

00:32.880 --> 00:38.280
Let's go short agenda, why break things and how to.

00:38.280 --> 00:42.600
Why it's fun and it helps you reproduce bugs.

00:42.600 --> 00:47.480
It helps you learn new software, you know, like you get a new database, you need to learn,

00:47.480 --> 00:52.080
you can start breaking it and see what it shows in the looks and that gives you a very

00:52.080 --> 00:54.880
good experience, right?

00:54.880 --> 01:00.240
And then I think what everybody is interested here, it helps you test software.

01:00.240 --> 01:05.080
And you may tell me, reproduce bugs, but if the disk is full, it's not a bug, right?

01:05.080 --> 01:09.320
Thing is, what is nice to have is this, right?

01:09.320 --> 01:18.800
Proper handling of errors and elegant retry or elegant abortion that gives us information

01:18.800 --> 01:24.440
and that we know it's going to be handled in a consistent way in production.

01:24.440 --> 01:30.640
So you know, this run out of space on this and it says I'm going to retry for minutes.

01:30.640 --> 01:33.840
So it knows that space can come back.

01:33.840 --> 01:39.120
So how you get to have this is, you need to have a test where the disk space is fully

01:39.120 --> 01:40.680
consumed.

01:40.680 --> 01:45.720
So what we're going to be looking at is precisely this.

01:45.720 --> 01:52.240
This was a very long presentation, it was planted for 90 minutes, so I have to cut it down

01:52.240 --> 01:53.640
quite substantially.

01:53.640 --> 01:57.160
Or for that, I can give you all the slides later.

01:57.160 --> 02:02.720
And we're going to go through the slides, but I'm going to be doing mostly demo.

02:02.720 --> 02:05.720
So we're going to switch to terminal in a minute.

02:05.720 --> 02:14.680
So we're going to see TCQ disk, toxic proxy, carried the FS and C groups, PR limit,

02:14.680 --> 02:19.000
tasks set, C surface, and hopefully we'll get time to a trace.

02:19.000 --> 02:20.880
So let's get started.

02:20.880 --> 02:27.560
TCQ disk is the transport control Q discipline.

02:27.560 --> 02:36.920
And it helps you inject latency, corruption, reorder the packets in your TCP stream.

02:36.920 --> 02:42.440
So essentially, it's a really, really useful tool.

02:42.440 --> 02:48.040
It's the Swiss Army Knife of networking is poking.

02:48.120 --> 02:52.200
So, correct packets, again duplicate packets, limit transfer rate, etc.

02:52.200 --> 02:55.640
Let's go very quickly and make that demo.

02:55.640 --> 03:00.120
So I have a few, let's clear this out.

03:00.120 --> 03:03.480
Quick, just one second.

03:03.480 --> 03:07.120
I was keeping those alive.

03:07.120 --> 03:21.360
So I have some Docker machines here, Docker, C, T, Mark, one, oops, I'm already there.

03:21.360 --> 03:23.520
Oh, I was already there.

03:23.520 --> 03:37.080
Oh, that was, Jesus, I'm sorry, Docker, C, T, Mark, C, T, Mark, C, T, Mark, C,

03:38.040 --> 03:46.680
so I'm just going to do very simple, let me go copy, paste my stuff.

03:46.680 --> 03:57.960
So let's start a listener here, good Lord, why are you not copying things?

03:58.520 --> 04:04.040
Ah, God's of demos are not with me today.

04:05.160 --> 04:06.760
Jesus, what's wrong with you?

04:09.400 --> 04:10.360
Say that again?

04:10.360 --> 04:13.080
You're already in the container.

04:13.080 --> 04:17.560
Oh, Jesus, yeah, I got nervous, I'm sorry, I'm new to this.

04:18.600 --> 04:25.800
So I have a listener, you know, and then let's set up a stream that is going to push bites in there.

04:26.680 --> 04:29.880
And you can see it's running hundreds of bits.

04:29.880 --> 04:35.960
And what we're going to do is find out the interface for those Dockers.

04:37.720 --> 04:41.480
And okay, that's not going to be the whole thing, perfect.

04:43.480 --> 04:49.000
And then the bridge AD is going to give us, oops, something didn't work.

04:49.080 --> 04:51.480
Okay, that's expected, Jesus, what a day.

04:54.760 --> 04:57.480
Okay, come on.

04:59.800 --> 05:00.840
I promise I practice.

05:05.160 --> 05:07.880
Okay, that's good, and then bridge.

05:13.720 --> 05:14.920
Okay, that's good.

05:15.000 --> 05:21.400
And now I can do, get the IP address, okay.

05:21.400 --> 05:27.320
And now I can actually use that bridge AD, oops, with this EQ disk.

05:27.320 --> 05:34.600
And I will add a QDC playing to that device, and I'm going to say delay, right?

05:34.600 --> 05:42.520
I'm going to add some delay between 500 microseconds to 10 milliseconds to 25% of the packets.

05:45.160 --> 05:49.800
And, yep, there you go.

05:49.800 --> 05:53.400
And, you know, my rate drop is substantially right.

05:54.280 --> 06:01.560
And it's one megabyte, and we can show, oops, oops, oops.

06:07.160 --> 06:07.560
This one?

06:08.520 --> 06:11.800
Yeah, it's here.

06:17.080 --> 06:18.760
Ah, I'm not sure if I can do this.

06:18.760 --> 06:19.400
This, come on.

06:22.440 --> 06:23.960
Okay, yeah, sorry.

06:26.040 --> 06:31.240
Okay, so again, essentially root is the root Q.

06:31.240 --> 06:35.320
Netem is network emulator, which is the QDC playing you are injecting.

06:35.880 --> 06:41.400
And then the Netem emulator delay is the function, and then the parameters.

06:43.400 --> 06:51.160
There are, you know, you can, again, instead of adding delay, you can

06:52.440 --> 07:00.040
retry or read, sorry, reorder or you could add bandwidth limit.

07:00.040 --> 07:07.640
So, the different commands for the same netem, which is again, the type of net work you're adding.

07:07.640 --> 07:15.320
And you can see here, it is showing me that there is a QDC playing netem on the root,

07:15.960 --> 07:21.720
and it's between 400 milliseconds, 400 microseconds, and 10 milliseconds.

07:22.520 --> 07:23.800
And then we can remove it.

07:24.360 --> 07:30.360
And there we go, and it goes back to 100 megabits.

07:32.440 --> 07:36.040
Okay, so again, this is one way to do it.

07:36.040 --> 07:37.960
You could also do it at the host level.

07:39.640 --> 07:48.040
Let me go out of here, and then I need the V4.

07:49.800 --> 07:52.760
And these guys are using host network.

07:52.840 --> 07:55.400
So, will you see it?

07:59.400 --> 08:02.680
Oh no, I did that one for toxic proxy, I'm sorry.

08:02.680 --> 08:04.920
I thought that I did for this EQ disk.

08:04.920 --> 08:12.680
Okay, then let's go to the next tool, which is, again, toxic proxy, and, okay, many, many.

08:12.680 --> 08:18.840
Okay, toxic proxy, this was developed by the guys at Shopify, and again, it also allows you to do

08:18.840 --> 08:26.360
net working errors, latency, limit bandwidth, trigger timeouts, and it has a rest API.

08:26.360 --> 08:31.480
So, you can actually reach it through curl or whatever HTTP library you have.

08:32.200 --> 08:41.000
And again, very simple, you create a configuration with a given name, oops, sorry.

08:41.640 --> 08:50.760
And then, what port the proxy will listen, and what port will your application connect to go through

08:50.760 --> 08:54.600
the proxy? And then you simply start toxic proxy like that.

08:55.880 --> 09:05.320
Yes, I already have it running here. Let's see. Let's do it local, the host here.

09:05.480 --> 09:20.600
So, I had to run my container using net host, and I'm exposing this port, and I'm running

09:20.600 --> 09:30.680
just the Shopify, oops, so do I make this bigger, you know, using the Shopify toxic proxy container.

09:31.640 --> 09:41.640
And then, I will set up my container here, and just set up my list in it, what's one with you,

09:41.640 --> 09:58.520
and oops, oops, oops, I already use, okay, why? Oh, come on, oh, let's try it before.

10:01.560 --> 10:07.800
I was also testing with net server, but it was just we complicated. Okay, so we have a list

10:08.760 --> 10:17.880
and on my other Docker, I will do the pushing, and I'm not sure if I already have my proxy running.

10:17.880 --> 10:22.360
No, okay, I have my proxy running, so I have to go there and start it.

10:38.120 --> 10:48.280
Then I will have to, this will actually create the configuration instead of having in a JSON file,

10:48.280 --> 10:55.640
I will just have a command through the toxic proxy CLI, and let's see how that goes.

11:02.200 --> 11:05.800
Okay, and you can see on the left that it created a new proxy,

11:06.360 --> 11:16.040
and now it's listening on the 666, on 999, sorry, so I can do this, and you can see this guy is on 666,

11:16.040 --> 11:25.000
and this guy is on 999, and it's pushing by through the proxy. So now we can go ahead and add what it's called

11:25.160 --> 11:35.160
toxicity and a toxic, and so again using the proxy proxy CLI, I do toxic add, and I want to add latency,

11:36.440 --> 11:43.080
and I want to add 1000 microseconds of latency, and I'm going to give that an aim,

11:46.200 --> 11:54.520
and it's going to work on the NCT stream, which if you notice is the one I use it here,

11:55.480 --> 12:06.200
so it corresponds with that name, and oops, not there, here,

12:09.400 --> 12:18.760
and you can see that the rate has plummeted, and we can go ahead and then remove the toxicity,

12:19.560 --> 12:31.160
the toxic, oops, again, but I'm sorry, and you can see that the rate is at full speed again.

12:32.840 --> 12:43.880
Okay, next, cherry DFS, it's hard to pronounce, has very fancy and poetic license,

12:44.840 --> 12:54.120
and it can inject latency errors, and it can affect the specific IO C schools, so there is a

12:54.120 --> 13:01.320
large list of C schools, I'll show you in a second, and this works on top of PUSFS, so you will need to

13:01.320 --> 13:11.880
have PUSFS installed, and let's go, let's go, let's go, let's go,

13:13.240 --> 13:21.400
so I did more profuse, I created my oops, sorry, I created my application directory,

13:22.040 --> 13:31.000
and I carried the back end directory, and then I ran carried D, and I said on application that

13:31.000 --> 13:45.320
I didn't even mount the carried D back end, and then I can do C D slash, won't application,

13:46.120 --> 13:56.360
and I would run a C smash, I guess, no, let me just copy from here,

14:01.560 --> 14:07.960
okay, I already run the C smash prepaid, okay, so I'm just going to execute the C smash,

14:08.600 --> 14:21.560
and I'm inside the application database, so C smash is going to run on the fake file system we have,

14:22.200 --> 14:27.560
and there you have, it's showing us some weight, and then I can go here, and I can say,

14:29.560 --> 14:34.680
I will do net estate, and I'm going to check it's running, and it should be running on 1990,

14:34.760 --> 14:50.280
so, and it's there, it's established it, so we have a working file system, and I will,

14:52.920 --> 14:57.160
it has a set of examples, for recipes, very simple,

15:05.000 --> 15:13.320
there we go, so, and it has like, these are examples, but very easy to extend, so I'm just

15:13.320 --> 15:25.720
going to do a recipe, perhaps, delay, and we should see, why it's not working,

15:25.720 --> 15:37.560
that should be working, I'm going to write this, why is that not working,

15:42.840 --> 15:49.640
well, that's quite unexpected, this is benchmark,

15:49.640 --> 16:07.000
well, that should be fine, I don't know why it's refusing to delay it, that's very weird,

16:09.880 --> 16:16.200
and actually it should bring me back, you know, when I do recipes delay, it should bring me,

16:16.200 --> 16:25.400
it should give me my prompt back, but this is quite embarrassing, okay, I'll skip this one,

16:25.400 --> 16:34.600
I'll come back to it, if time allows, this is really cumbersome, oh, mode, oh, shift, oh, there's

16:34.600 --> 16:45.240
V, oh, I'm not quite sure, yeah, multiplication that I did, am I there, yeah, okay,

16:48.360 --> 16:51.800
let me, okay, I know, perhaps, let's just clear,

16:52.760 --> 17:08.440
no, something, so it looks like a bug in Chariah that, really, it should work just like that,

17:08.440 --> 17:15.160
I'm super sorry, promise I tested all these, have no clue what's not working, let me just try

17:15.160 --> 17:30.600
one more, what if I do full, let's try this, all the recipes, who, no, that should make it fail instantly,

17:31.960 --> 17:38.600
okay, again, it does look like a bug, I'm not sure why, but I'll look it up later, let's move on

17:38.600 --> 17:46.120
because we don't have much time, okay, Cgroups, you know, is what you do containers with,

17:46.120 --> 17:53.160
and you can also limit a number of resources using Cgroups, and what we're going to do is,

17:53.480 --> 18:16.520
we have our C, C, C group and container, okay, oh, and here I created a slow disk

18:16.600 --> 18:22.200
directory, you just create, you know, like let's do it again, slow, and once you create the

18:22.200 --> 18:32.200
directory, it will create a number of virtual files that are actually an interface to the C group

18:32.280 --> 18:39.560
configuration, so now what we can do is, we run that C's bench again,

18:40.760 --> 18:53.480
six-manage, okay, okay, it's just not,

18:53.480 --> 19:15.480
and, okay, and now I can, um, digrib, the X, X-s bench, and I can do Echo 5, 6, 7, 6,

19:15.480 --> 19:25.080
into our C group blocks, so I'm adding C's bench to this C group, and now I'm going to flottle,

19:26.440 --> 19:29.400
and I can do that,

19:33.800 --> 19:44.920
what is it here? So, I will, um, find the idea of the device

19:44.920 --> 19:58.280
is 259C, and then you do, oh, come on, and this is not a slow disk, it's just a slow, so I'm just

19:58.280 --> 20:07.960
going to send, uh, I'm going to limit to 15 megabytes on the right BPS device, so, and once you do that,

20:08.280 --> 20:12.200
oh, good, why is not for working?

20:17.960 --> 20:26.760
I, you hear my word, I tested this a hundred times, I got bored,

20:26.760 --> 20:48.760
five, five, six, eight, five, three,

20:56.840 --> 21:09.080
okay, the guts of them are not with me today, okay, it's not going as I expected, I give you my word,

21:11.080 --> 21:19.320
no, no, I know, I have absolutely no clue why this will not work,

21:20.200 --> 21:29.880
no, the device, it's fine, I just see it right there, and, um, no, that was,

21:31.400 --> 21:39.800
wow, okay, I don't know what to say, truly, truly embarrassed,

21:40.360 --> 21:50.520
um, oh, you say that, let me try with my original C group and say if that works,

21:51.880 --> 22:09.240
okay, no, not here, uh, here, and if I do, so I added the PAD and now I add the latency,

22:10.680 --> 22:21.800
wow, I can fully tell you it's how it works, it's how it should work, okay, and I don't,

22:21.800 --> 22:29.960
if you're a limit, let's see if this one works, one minute, okay, I'm so sorry,

22:29.960 --> 22:37.560
I truly don't know why it didn't work, like, but I can tell you they work, I tested a hundred times,

22:38.520 --> 22:45.720
and I guess I left something broken in my, all my previous testing, I'll be glad to sit down

22:46.520 --> 22:54.120
outside and show you that how they actually do work, I'm truly sorry for the miss help,

22:55.880 --> 23:03.800
and I will, okay, I will also answer any questions you have, but I'm truly truly sorry

23:03.880 --> 23:11.720
that it didn't went as expected, I can't share the slides, I don't have a,

23:16.840 --> 23:22.920
okay, I will make sure they are there, yeah, and I will also add the notes for all the examples,

23:22.920 --> 23:26.200
I am truly, truly apologetic, but

