WEBVTT

00:00.000 --> 00:16.120
Hello everybody, thank you very much for being here today, I am Dominic Torno, I am the founder

00:16.120 --> 00:25.480
and CEO of Resonate where we focus on a distributed async of eight, programming model,

00:25.480 --> 00:38.620
to make distributed applications did simple. I am also the author of Think Distributed

00:38.620 --> 00:48.700
Systems, which is currently part of a many publications early access program. The book focuses

00:48.700 --> 00:56.420
on dependable models to empower you to reason confidently about distributed systems. I spend

00:56.420 --> 01:04.900
most of my time on distributed systems and distributed systems are known for their complexity.

01:04.900 --> 01:12.780
We all just know that distributed systems are really, really complex, but what edits core

01:12.780 --> 01:20.380
makes distributed systems so complex and there are two reasons, two fundamental sources

01:20.380 --> 01:29.460
of complexity. The first is non-deterministic partial order introduced by concurrency.

01:29.460 --> 01:36.660
In a distributed system we do not know what happens next and the second is non-deterministic

01:36.660 --> 01:43.140
partial failure introduced by distribution. In a distributed system we simply do not know

01:43.140 --> 01:50.580
what will fail next. So as an aside feel free to take pictures throughout the presentation

01:50.580 --> 01:57.300
and I have marked some slides as a harm moment to highlight what I think are like the most

01:57.300 --> 02:04.700
insightful slides of the presentation. This is one of my favorites. Now let's consider a simple

02:04.700 --> 02:13.740
very abstract example to develop an intuition. We have two processes, P and Q, each consisting

02:13.740 --> 02:22.700
of three atomic actions, three steps. So first let's introduce sequentiality. The defining

02:22.700 --> 02:30.680
characteristic of sequentiality is total order and total application. A sequential system

02:30.680 --> 02:39.560
executes one step at a time in a predetermined sequence. So consequently the sequential

02:39.560 --> 02:47.800
composition of P and Q leads to exactly one possible execution. There is no room for alternative

02:47.800 --> 02:55.520
behavior. But then we introduce concurrency and the defining characteristic of concurrency

02:55.600 --> 03:04.160
is non-deterministic partial order. So consequently the concurrent composition of P and Q already

03:04.160 --> 03:12.960
leads to 20 distinct steps of possible executions. And then lastly let's introduce distribution

03:14.000 --> 03:21.120
and the defining characteristic of distribution is non-deterministic partial failure. So consequently

03:21.200 --> 03:30.800
the distributed composition of P and Q leads to 69 distinct possible executions. And in a real

03:30.800 --> 03:38.800
system some of these executions are very common while others are exceedingly rare. So for example

03:38.800 --> 03:45.840
imagine a scenario where a server accepts a connection at the exact moment a configuration reload is

03:46.800 --> 03:54.480
a highly unlikely sequence of events. So now let's do a simple thought experiment given that we

03:54.480 --> 04:03.840
have 69 distinct traces. So let's assume 10% of these traces are the happy path. They occur 90%

04:03.840 --> 04:12.720
of the time. And 90% of these traces are edge cases. They only occur 10% of the time. It turns out

04:12.800 --> 04:20.240
if we non-deterministically pick a trace according to this probability distribution one thousand times

04:21.040 --> 04:32.720
the chances of missing at least one trace are almost 100%. So these rare traces provide the ideal

04:32.720 --> 04:40.800
breeding ground for the dreaded heisenberg. The term heisenberg is inspired by the heisenberg

04:40.880 --> 04:49.040
uncertainty principle from quantum mechanics. Heisenbergs are elusive. The rarity of the traces

04:49.040 --> 04:56.240
that height them makes them really hard to produce. And by the off chance we actually produce one

04:57.920 --> 05:05.840
it's almost impossible to reproduce them. And now take a moment to imagine the behavior space

05:05.920 --> 05:12.480
of a real world distributed system where dozens or even hundreds of processes and each of these

05:12.480 --> 05:20.720
processes has dozens or hundreds of steps. So the number of possible executions grows combinatorially

05:20.720 --> 05:28.400
and that phenomenon is aptly called behavior state explosion. But concurrency and distribution

05:28.400 --> 05:35.760
they don't just expand the behavior space. They make reasoning about a system almost incomprehensible.

05:36.400 --> 05:43.120
Concurrency and distribution introduce daunting uncertainty. And this uncertainty is the true

05:43.120 --> 05:51.520
challenge we face when we design distributed systems. So there is a direct correlation between

05:51.520 --> 05:58.480
certainty and confidence. When I am certain about my application I am confident in my

05:58.480 --> 06:04.480
application's correctness. So making changes adding new features or even deploying to production

06:04.560 --> 06:13.920
on a Friday afternoon no problem. With deterministic simulation testing you have a key component

06:13.920 --> 06:22.560
to achieving this level of certainty and confidence by methodically exploring the behavior space

06:22.560 --> 06:29.200
scaling across space and time. So throughout the rest of this presentation we build a foundational

06:29.280 --> 06:34.480
understanding of distributed simulation testing step by step. And then we build a dependable

06:34.480 --> 06:39.600
mental model so you can all decide with confidence if deterministic simulation testing is the

06:39.600 --> 06:46.800
right approach for you. So let's begin by grounding our discussion with a mental model of

06:46.800 --> 06:53.760
software systems. For our discussion we will think of a software system as being partitioned

06:53.760 --> 07:00.400
into two distinct entities. The environment and the application that is the highest level

07:00.400 --> 07:06.080
system model that is universal applicable. What's in? What's out? The environment represents

07:06.080 --> 07:12.720
everything that is not the object that we are developing and testing and the application represents

07:12.720 --> 07:19.840
everything that is the object that we are developing and testing. Now our goal is to test the

07:19.840 --> 07:28.080
application. To achieve this the application under testing conditions must closely resemble

07:28.080 --> 07:34.880
the application under production conditions. Any deviation must be non-material otherwise we're

07:34.880 --> 07:42.240
risking to test something fundamentally different. However our goal is not to test the environment.

07:42.240 --> 07:47.600
The environment is only a means to facilitate the test. So therefore the environment under testing

07:48.480 --> 07:53.520
conditions does not have to resemble the environment under production conditions at all. We are

07:53.520 --> 08:02.400
free to replace completely mock any component in the environment as we see fit. Now question is

08:02.400 --> 08:08.240
how do we delineate between the environment and the application and there is no universal answer

08:08.240 --> 08:13.840
where you delineate depends entirely on your requirements. So here we see an example of a web

08:13.840 --> 08:19.120
application composed of three distinct components. A client on the left of web server in the

08:19.120 --> 08:27.280
center and a database server on the right. A reasonable choice is to assign the client to the environment

08:27.280 --> 08:33.040
while treating the web server and the database server as part of the application. The focus is on testing

08:33.040 --> 08:40.000
the web server and database server in composition. But a reasonable alternative is to assign both

08:40.000 --> 08:45.040
the client and the database server to the environment and treat the web server as the application.

08:45.600 --> 08:53.600
The focus then is on testing the web server and isolation. Now now that we have a mental model

08:53.600 --> 08:59.440
of software systems, then is a software system correct and how do we establish the correctness

08:59.440 --> 09:06.240
of a software system. Correctness is not absolute correctness is relative to a specification.

09:06.960 --> 09:13.120
For example think of your parents. I bet most of you had at least I had one parents who is

09:13.120 --> 09:20.080
strict and one parent who is more lenient. So the very same behavior my behavior one parent usually

09:20.080 --> 09:27.440
my mother she would scold me for it whereas my dad would cut me some slack. So the same

09:27.440 --> 09:34.720
implementation can be considered correct or incorrect depending entirely on the specification that we

09:34.800 --> 09:43.840
choose to apply. Now as Leslie Lambert explores in his foundational work specifying systems,

09:43.840 --> 09:50.400
software systems can be understood as state machines. Right. The state machine is remarkably

09:50.400 --> 09:58.160
simple but very powerful reasoning tool. A state machine consists of three components, a set of states,

09:58.880 --> 10:06.080
set of initial states, you just want and a next state relation that determines how we transition

10:06.080 --> 10:12.960
from one state to the next. When we execute a state machine, the state machine generates a trace,

10:13.680 --> 10:19.120
a trace is the sequence of states that represents the systems behavior over time.

10:20.480 --> 10:26.880
Traces are at the heart of how we reason about software systems. Each trace represents a unique

10:26.960 --> 10:34.880
execution of the system. Now a state machine is non-deterministic, if it presents a choice.

10:34.880 --> 10:42.160
Right. Or in more sinister terms, if it presents some ambiguity. So in this example from S1,

10:42.880 --> 10:51.040
we may transition to S2 or to S3. This is reflected in the set of possible traces given the

10:51.120 --> 10:58.080
initial state of S1. In this example from S1, we may generate two distinct traces,

10:58.080 --> 11:05.680
S1 to S2 or S1 to S3. So given the same initial state, we may witness different traces.

11:07.520 --> 11:11.680
The state machine is deterministic. If the next state relationship is a function,

11:11.680 --> 11:16.880
that is if the state machine does not present any choice. In a deterministic state machine,

11:16.880 --> 11:25.200
there is no ambiguity, just a single predictable path forward. And this is also reflected in the

11:25.200 --> 11:35.200
possible traces given the initial state of S1. So in this example, in S1, there is only one

11:35.200 --> 11:41.200
trace that we can possibly generate. That is from S1 to S2 to S3. So the same initial state,

11:41.760 --> 11:44.640
we will always witness the same trace.

11:48.640 --> 11:53.600
Or coming back to the specification and implementation relationship that we discussed earlier,

11:53.600 --> 12:00.720
in effect, a specification is a set of traces. These traces represent all the valid behaviors

12:00.720 --> 12:09.840
allowed by the specification. So an implementation is correct with respect to a specification

12:09.840 --> 12:16.160
if every possible trace of the implementation is also a trace of the specification.

12:17.840 --> 12:22.640
And consequently, an implementation is incorrect with respect to the specification.

12:22.640 --> 12:29.440
If some traces of the implementation are not traces of the specification.

12:30.160 --> 12:36.480
So how can we establish correctness? There are two possibilities. There is software verification

12:36.560 --> 12:44.480
and there is software testing. Software verification aims to prove that all traces of the

12:44.480 --> 12:50.960
implementation are within the specification. Software testing aims to find and software testing

12:50.960 --> 12:57.040
so in software testing aims to find some traces outside of the specification.

12:57.280 --> 13:05.600
So in summary, if the verification asserts, the system is correct.

13:07.440 --> 13:12.960
So the verification asserts, there is no correctness violation. No correctness violation

13:12.960 --> 13:20.160
will ever occur. If the test asserts the system is correct, the test merely asserts

13:20.160 --> 13:25.360
that no correctness violation has yet occurred. We may find one later.

13:26.080 --> 13:33.760
So testing is essentially a search problem. We are the goal is to find traces of the

13:33.760 --> 13:40.560
implementation that violate the specification, that violate correctness.

13:44.080 --> 13:49.600
Now how can we increase the confidence in a system's correctness?

13:50.240 --> 14:01.280
By increasing test coverage, and test coverage has two key dimensions, space and time.

14:02.720 --> 14:12.240
So to scale across space, we extend our test by extending the number of distinct traces

14:12.960 --> 14:23.040
covered by the test. And to scale across time, we extend our test by increasing the length

14:24.080 --> 14:33.680
of distinct traces. However, extending the number of distinct traces is difficult.

14:34.640 --> 14:41.520
Remember our initial example and our thought experiment randomly stimulating the system

14:42.240 --> 14:49.840
will just produce more of the same. If we run the test a thousand times,

14:50.960 --> 14:58.000
about 900 times, we will just test the happy path. And we will omit a lot of the edge cases

14:58.000 --> 15:05.280
that are the ones that hide the bugs, especially the heisenbergs. Now additionally,

15:05.280 --> 15:14.800
extending the length of distinct traces is also difficult. Many systems have actual timing

15:14.800 --> 15:23.200
constraints. There are timeouts around crash detection, for example. There are timeouts around

15:23.200 --> 15:31.200
message loss detection. We allow the other side to respond within 10 seconds, within 20 seconds.

15:31.680 --> 15:40.720
For example, the time intervals driving rate limit algorithms. There are time intervals

15:40.720 --> 15:49.200
1 minute, 5 minute, 10 minutes. So increasing the length of trace increases actual testing time.

15:53.200 --> 15:59.680
Now deterministic simulation testing enables you to scale across space.

16:01.280 --> 16:08.800
By controlling the scheduling and the failure injection. And therefore increasing the number

16:08.800 --> 16:18.720
of distinct traces. Instead of blindly picking from the happy path, we are intentionally picking

16:19.360 --> 16:28.640
from the edge cases. And on top of that deterministic simulation testing enables you to scale

16:28.720 --> 16:36.160
across time by virtualizing time. And therefore when we increase the length of the trace,

16:36.160 --> 16:43.440
we do not actually increase the length of the world clock time. It takes to run the trace.

16:45.280 --> 16:53.440
And lastly, and that is one of the key achievements of deterministic simulation testing. And also,

16:53.760 --> 17:00.720
the defining characteristic deterministic simulation testing enables you to reproduce a trace.

17:01.280 --> 17:07.920
By ensuring that for every single starting state, there is only one possible trace

17:07.920 --> 17:15.440
through the system. So once you find a bug, you can reproduce that trace by putting the system in

17:15.440 --> 17:34.000
the same starting state. Now to be methodical and to ensure reproducibility, we construct the system.

17:34.000 --> 17:40.080
So that is we construct both the environment and the application so that the behavior of the system

17:40.080 --> 17:48.560
only depends on the initial state of the environment and the initial state of the application.

17:52.640 --> 18:02.000
Now the core idea is to substitute the environment with simulator. Remember, we are allowed to replace

18:02.000 --> 18:08.720
any component of the environment that we see fit. Deterministic simulation testing repeatedly

18:08.800 --> 18:17.200
executes an application in a simulated environment under changing initial conditions. Monitoring

18:17.200 --> 18:23.920
that the correctness constraints are maintained across all executions. And in practice,

18:23.920 --> 18:29.040
that means constructing the system to control otherwise non-deterministic factors.

18:29.680 --> 18:36.640
Like for example, incoming requests, but also crashes, message losses, and of course time.

18:38.960 --> 18:44.480
So on a high level, we have to think about implementing this simulator. We have to think about

18:44.480 --> 18:51.920
two core interactions. Requests from the environment to the application and requests from the

18:51.920 --> 19:00.320
application to the environment. The first represents the application input and output. Think of for example

19:00.320 --> 19:09.840
the web client in our initial example. And the second represents the application side effects,

19:09.840 --> 19:15.600
like talking to the network or reading and writing to the file system. Here, think of the

19:17.280 --> 19:25.360
database server, the initial example. Now by controlling, scheduling and failure injection in the

19:25.360 --> 19:32.480
simulated environment, given a deterministic application, we can deterministically generate

19:32.480 --> 19:42.960
an arbitrary amount of distinct traces of an arbitrary length. Now, who is doing deterministic

19:42.960 --> 19:51.360
simulation testing today? Deterministic simulation testing is still fairly new and it is also

19:51.440 --> 19:57.280
still fairly involved, because making sure that your application itself is deterministic and

19:57.280 --> 20:05.040
then building a deterministic environment for it is quite an investment. But a few companies

20:05.600 --> 20:10.480
have adopted deterministic simulation testing. So for example, foundation db. It's widely

20:10.560 --> 20:20.240
regarded as the trailblazer in deterministic simulation testing. Tiger beetle db, wildly

20:20.240 --> 20:30.240
regarded as the one that spread the word about deterministic simulation testing. And then we have

20:30.240 --> 20:36.240
others like for example polar signals, tools or my company resonate. Each project though took

20:36.400 --> 20:42.160
a different approach in designing and developing its application and designing and developing its

20:42.160 --> 20:49.760
simulator. So there is very little common ground. There are no ready-made libraries.

20:50.960 --> 20:57.840
But additionally, there is antithesis. That is the first company that offers deterministic simulation

20:57.920 --> 21:07.440
testing as a service by implementing their own deterministic hypervisor. Now, if you're curious

21:07.440 --> 21:13.360
about deterministic simulation testing and would like to dive deeper into deterministic simulation

21:13.360 --> 21:21.600
testing, the nuances and also how to structure a simulator. You're on the CEO of Tiger beetle and

21:21.600 --> 21:31.520
I will host a webinar this Thursday discussing deterministic simulation testing in depth. So if you're

21:31.520 --> 21:41.120
interested, please join us. And so as a conclusion, deterministic simulation testing requires a

21:41.120 --> 21:48.880
real investment. However, the pay off based on like personal experience is significant.

21:49.760 --> 21:55.760
Deterministic simulation testing enables you to methodically test the behavior space of your system

21:56.480 --> 22:05.600
by scaling across space and time, making tests and traces reproducible. So then you get to

22:05.600 --> 22:11.840
squash the highs and the highs and the highs and the highs. And with that, thank you very much for

22:12.640 --> 22:18.320
extending my talk. I just want to let you know that mining publications offered a discount on my

22:18.880 --> 22:25.840
book for the first time. If you are interested, and once again, thank you very much and I

22:25.840 --> 22:31.840
hope you enjoy the rest of the conference.

22:39.840 --> 22:41.200
Yes, please.

22:41.280 --> 22:47.840
So at some point in your conclusion, this long talks work on TLA plus?

22:47.840 --> 22:57.040
Yes, actually work with bulletin shaking. That is a very nice question. So

22:58.080 --> 23:03.520
the question for anybody who couldn't hear was about TLA plus a temporal logic of actions

23:03.520 --> 23:08.160
that was discussed in the book that I refer in specifying systems. And

23:09.040 --> 23:16.000
it's a language to specify systems and it comes with a tool for model checking.

23:16.640 --> 23:23.840
So it can also methodically test the behavior space of their model. But there is where the difference

23:23.840 --> 23:30.400
lies. In TLA plus, you specify your system, but it is a model. It's not the implementation of your

23:30.400 --> 23:36.640
system. So there you model check and that's basically deterministic simulation test. You model

23:36.960 --> 23:42.160
check the model. With deterministic simulation testing, you actually model check if you

23:42.160 --> 23:51.360
so say, you actually model check the implementation. So that is where I think one can draw a good

23:51.360 --> 23:56.400
line, good mental model for that. Yes, please.

24:06.880 --> 24:18.800
Mm-hmm. So for model checking, I am not an expert in implementing model checkers,

24:18.800 --> 24:24.080
but for model checking as far as I know, they have a very methodical exploration of the state space.

24:24.080 --> 24:28.560
And what they also on top of that, what they can do is basically they calculate what's

24:28.560 --> 24:34.800
I believe called symmetry sets, where this basically is equivalent to that, therefore I completely

24:34.960 --> 24:44.640
skip it. So that is something that for example, for our simulator that we implement, we do not

24:44.640 --> 24:53.360
actually implement an actual search algorithm. So what instead, what we fall back on is basically

24:53.360 --> 25:00.560
the small world slash small data hypothesis that the most interesting side effects come about

25:00.560 --> 25:06.080
when you have a lot of need for coordination and content in your system. So therefore instead of

25:06.080 --> 25:12.320
like firing off 10 requests over a thousand accounts, for example, right, we fire off 10 requests

25:12.320 --> 25:17.520
of a thousand requests over 10 accounts in order to make sure that there is a lot of conflict in the

25:17.520 --> 25:25.680
system. And without we were able to find a lot of the really interesting bugs, even without a

25:25.760 --> 25:32.720
guided search. But I know that antithesis, whereas I'm not an expert in antithesis, but a

25:32.720 --> 25:39.040
tithesis has a few APIs that you can include in your application to actually guide their search.

25:39.760 --> 25:47.360
So with that, you can more methodically find the interesting, the interesting traces.

25:49.600 --> 25:50.240
Thanks a lot.

25:50.240 --> 25:50.960
Thank you.