WEBVTT

00:00.000 --> 00:25.000
All right.

00:25.000 --> 00:54.000
All right, folks.

00:54.000 --> 00:58.000
I don't have a lot of time today.

00:58.000 --> 01:03.000
Initially, I thought this is going to be a 45 minutes, but turns out it's only 25.

01:03.000 --> 01:08.000
And so I tried to cram a lot of the technical details.

01:08.000 --> 01:11.000
And I might be missing a lot of things.

01:11.000 --> 01:13.000
So I'm rushing here.

01:13.000 --> 01:18.000
If there's any question, feel free to ask me in a whole way after.

01:18.000 --> 01:20.000
In case there's no time.

01:20.000 --> 01:23.000
First of all, about me, my name is Sun.

01:23.000 --> 01:25.000
You just call me Sun, that's fine.

01:25.000 --> 01:27.000
I'm from Vietnam originally.

01:27.000 --> 01:29.000
I'm currently based in the Netherlands.

01:29.000 --> 01:31.000
I have been living there for the last five years.

01:31.000 --> 01:34.000
I contribute to GitLab and Gitaly.

01:34.000 --> 01:41.000
I'm quite a version controlled not in general and built tools as well as developer toolings in general.

01:41.000 --> 01:49.000
If you have anything that you like about version control, built tools, developer productivity, come talk to me.

01:49.000 --> 01:55.000
I'm currently working as a solution engineer, built by that logo is right there as paying for my trip.

01:55.000 --> 01:58.000
So is there.

01:58.000 --> 02:04.000
And in the past, I worked for booking.com and Lazada, Ali Baba in general.

02:04.000 --> 02:06.000
Some excuses.

02:06.000 --> 02:11.000
I recently have to travel six, seven city for the last one week.

02:11.000 --> 02:13.000
I'm extremely jet lag.

02:13.000 --> 02:16.000
This slide was finished at six a.m. this morning.

02:16.000 --> 02:18.000
So a lot of mistakes.

02:18.000 --> 02:20.000
Please contact me.

02:20.000 --> 02:24.000
If you've spot any things or if you have any questions.

02:24.000 --> 02:28.000
I can be rich on various different channel.

02:28.000 --> 02:31.000
And yeah, what are we talking about today?

02:31.000 --> 02:34.000
We're talking about first all CI challenges, a scale.

02:34.000 --> 02:36.000
I've been working for big companies.

02:36.000 --> 02:39.000
Recently, I've been talking to a lot of big customers.

02:39.000 --> 02:45.000
Customer who are in the top 10 biggest company in the world, even.

02:45.000 --> 02:50.000
And yeah, the CI is definitely not a solved problem at scale.

02:50.000 --> 02:52.000
There are a lot of pain points.

02:52.000 --> 02:54.000
And that's what we are talking about today.

02:54.000 --> 03:00.000
One of the solutions coming out of all those challenges is artifact first in hermetic builds.

03:00.000 --> 03:05.000
Using those modern and newer built tools.

03:05.000 --> 03:07.000
And I would be touching on that.

03:07.000 --> 03:10.000
And using that is a background.

03:10.000 --> 03:13.000
We can talk about how distributed built system often work.

03:13.000 --> 03:19.000
And the set of remote built execution API, which is the open source project I'm here to talk about.

03:19.000 --> 03:21.000
And yeah, that's open source.

03:21.000 --> 03:27.000
Even though I'm working for a startup based on all these API that API is open source,

03:27.000 --> 03:31.000
you are free to use.

03:31.000 --> 03:33.000
Okay, let's define scale.

03:33.000 --> 03:34.000
What is scale?

03:34.000 --> 03:38.000
First of all, a lot of employees, right?

03:38.000 --> 03:42.000
Whether you're a 90 person small company or

03:42.000 --> 03:47.000
you're dealing with 20,000 engineers, right?

03:47.000 --> 03:53.000
The slower your built, the more time your employee is going to spend waiting.

03:53.000 --> 03:55.000
And those are working hours.

03:55.000 --> 03:57.000
You're paying salary on top of that.

03:57.000 --> 03:59.000
That's costing you money.

03:59.000 --> 04:01.000
That's a problem, right?

04:01.000 --> 04:06.000
The slow built is definitely a problem when your company has a lot of employees.

04:06.000 --> 04:11.000
Having big code issue, that's not just a synonym for monetary posts.

04:11.000 --> 04:15.000
Even though with monorepos, you see that problem happened a lot more often,

04:15.000 --> 04:20.000
because tooling is bright when you use monorepo, a lot faster.

04:20.000 --> 04:23.000
But that's also a problem for multi-repo as well.

04:23.000 --> 04:29.000
For example, one of my previous employer was Ali Baba with tens of thousands of engineers,

04:29.000 --> 04:36.000
contribute to multiple, sometimes hundreds of repuls at the time of certain projects.

04:36.000 --> 04:39.000
And it's slow, right?

04:39.000 --> 04:42.000
Your built is going to be slow if you have a lot of code.

04:42.000 --> 04:44.000
That's just very obvious.

04:44.000 --> 04:51.000
Number three might be not so obvious, but depending on the business that your company is in,

04:51.000 --> 04:55.000
you might have different requirements for risk tolerance.

04:55.000 --> 05:01.000
For example, if you're making robotic application for hospital, right?

05:01.000 --> 05:08.000
You might have slightly lower risk tolerance compared to an e-commerce company,

05:08.000 --> 05:13.000
because one mistake would cost you human life compared to just, you know,

05:13.000 --> 05:15.000
a couple hundred thousand dollars.

05:15.000 --> 05:21.000
So, yeah, depending on the risk tolerance as your risk tolerance increase, right?

05:21.000 --> 05:25.000
For example, you're making more money over time, right?

05:25.000 --> 05:28.000
You get more revenue.

05:28.000 --> 05:33.000
One month could cost you a lot more money, or if you're in the business of, for example,

05:33.000 --> 05:37.000
self-driving hospital equipments, or even aerospace, right?

05:37.000 --> 05:41.000
Those can be quite costly as well, but in terms of human life.

05:41.000 --> 05:46.000
So to improve on that, you have to do a lot more testing, security testing,

05:46.000 --> 05:47.000
shifting left.

05:47.000 --> 05:54.000
That means you have to do a lot more testing, and all that increasing the amount of test,

05:54.000 --> 05:57.000
meaning that you have a compute problem, right?

05:58.000 --> 06:01.000
More tests equal, more compute.

06:01.000 --> 06:09.000
You can see in this scale, you can start off with a very simple pipeline that has a single job

06:09.000 --> 06:13.000
that would do you just fine, initially.

06:13.000 --> 06:18.000
But over time, you're going to see yourself evolve into a multi-station pipeline,

06:18.000 --> 06:24.000
because, you know, you have more compute, you have to distribute your compute into multiple steps

06:24.000 --> 06:28.000
that are very traditional CI system kind of deal.

06:28.000 --> 06:35.000
But even after you distributed it, you still have to pay the bill for those compute, right?

06:35.000 --> 06:41.000
And how do you save money when you have a distributed compute problem?

06:41.000 --> 06:43.000
You cash, right?

06:43.000 --> 06:49.000
Selective testing, cashing, and how do you cash things accurately?

06:49.000 --> 06:57.000
You have to, and how do you select which node, which step to run, which step not to run?

06:57.000 --> 06:58.000
Right?

06:58.000 --> 07:03.000
Eventually, that's usually come down to defining some condition, right?

07:03.000 --> 07:05.000
When to run, when not to run.

07:05.000 --> 07:09.000
For example, if I changed this external dependency, then run my bill,

07:09.000 --> 07:12.000
but if I didn't change it, then don't run it, right?

07:12.000 --> 07:18.000
And all of those conditional will often be encoded into your repository, your code, your CI pipeline,

07:18.000 --> 07:23.000
at the final state, which is a graph, right?

07:23.000 --> 07:31.000
This is usually a deck, but I have seen more than a deck, but a deck is usually reasonable enough.

07:31.000 --> 07:37.000
And overall, with this, you know, your compute keep growing and growing over time.

07:37.000 --> 07:41.000
So, yeah, finally, cashing, right?

07:41.000 --> 07:45.000
But when you have cash, you have to cash invalidation problem, right?

07:45.000 --> 07:48.000
Cash invalidation is really hard.

07:48.000 --> 07:56.000
It just is one of the toughest kind of problem that you have in computer science in general,

07:56.000 --> 08:03.000
and below naming, of course, but naming is, we don't talk about naming here.

08:03.000 --> 08:06.000
You have a gift coming in, right?

08:06.000 --> 08:09.000
You would preferably, you would know that, you would want to know that,

08:09.000 --> 08:14.000
hey, how does that this would affect my bill graph over here, right?

08:14.000 --> 08:18.000
Some of the notes that are not affected should not be run, right?

08:18.000 --> 08:24.000
You can either eliminate them by cashing and reusing the previous result,

08:24.000 --> 08:30.000
and you have some sort of smart quarrying engine that you can filter out the execution,

08:30.000 --> 08:36.000
and only execute the note that a relevant to your bill.

08:36.000 --> 08:44.000
But even with cash at scale, you will still have a compute problem, right?

08:44.000 --> 08:48.000
You, no company might have a lot of users.

08:48.000 --> 08:54.000
Actually, your user could be running tests on a different waste commit, right?

08:54.000 --> 09:02.000
For example, I could be testing on top of the waste commit that is on top of master just five minutes ago,

09:02.000 --> 09:10.000
but my coworker could create their branch from a waste commit that is a window go, right?

09:10.000 --> 09:15.000
With different base, you have different graph, different bill graph underneath, right?

09:15.000 --> 09:25.000
And with that, it means that the Tory amount of graph that your bill system need to compute increase over time, right?

09:25.000 --> 09:31.000
All these notes cannot be dW kit, and it would be nice to be able to dW kit them somehow, right?

09:31.000 --> 09:36.000
Otherwise, you are just spending your money away, solving this compute problem.

09:36.000 --> 09:45.000
And yeah, we are a CI provider, and even with customer with 90% cash hit rates,

09:45.000 --> 09:52.000
we still see a huge amount of compute required to serve CI system.

09:52.000 --> 09:56.000
So yeah, to recap, this is the definition of scale.

09:56.000 --> 10:02.000
This is what we would try to address by the solution of bill tools today.

10:02.000 --> 10:04.000
First of all, you have big code problem.

10:04.000 --> 10:11.000
Then, when you try to solve it with the compute needs that big code introduced,

10:11.000 --> 10:14.000
you're going to run into cash invalidation problem, right?

10:14.000 --> 10:19.000
And with cash invalidation problem, you want to solve it with dependency tracking,

10:19.000 --> 10:22.000
by creating a DAC in general.

10:22.000 --> 10:28.000
And even at the end of the day with cash and dependency tracking in place,

10:28.000 --> 10:34.000
you will still run into a fundamental compute problem that is really hard to solve.

10:34.000 --> 10:39.000
The solution I'm presenting today, don't claim to solve all these, right?

10:39.000 --> 10:43.000
But it does a really good job at solving a lot of these.

10:43.000 --> 10:51.000
Even in the biggest of scale that we see today with the top company in the world.

10:51.000 --> 10:56.000
So this is a solution I'm talking about, which is modern bill tools, right?

10:56.000 --> 11:02.000
It was what's better than constructing a graph than bill tools.

11:02.000 --> 11:08.000
Even get up action today is just a glorified distribution of bill tools, if you think about it.

11:08.000 --> 11:18.000
Because at the end of the day, all that does is it take up your source code and construct a graph and then execute the graph, right?

11:18.000 --> 11:24.000
Here I'm talking about specifically Basel, which originated from Google,

11:24.000 --> 11:30.000
but originated from Meta, Facebook, and Pans originated from Twitter, right?

11:30.000 --> 11:37.000
There are several other bill tools that are really nice, they are getting there, but not quite.

11:37.000 --> 11:47.000
The reason why we like Basel, Buck, Pans, the most is that, first of all, they are one of the first who was able to get a lot of the catching story, right?

11:47.000 --> 11:59.000
A lot of the distributed bill correctly, and finally with Basel and Buck, they are very mature in terms of a bill telemetry that is built into the tool, right?

11:59.000 --> 12:09.000
And you kind of need that as scale, but I'm not going to talk about bill telemetry today because I need to remove all that to fit in the 25 minutes.

12:09.000 --> 12:14.000
Anyway, so what is artifact first bill tool?

12:15.000 --> 12:21.000
So when you think of bill tool, you often think of Meta Phi, I would assume, right?

12:21.000 --> 12:29.000
And Meta Phi can, if you spin hard enough, you can think of it as an artifact first bill tool, but it's not quite.

12:29.000 --> 12:34.000
And here I want to try to highlight what's the main differences.

12:34.000 --> 12:50.000
And I think the main difference is that artifact first bill tool, when you think about it, it would try to, the number one goal of an artifact first bill tool is that it's there to produce the artifacts, right?

12:50.000 --> 12:53.000
To run the task that produced the artifact.

12:53.000 --> 13:04.000
So if the artifact that is very on disk that you have is very exist, it either exists, then the task should not need to be rerun.

13:04.000 --> 13:17.000
And that definition provides you with a fundamental way of passing things in your, in your bill system, right?

13:17.000 --> 13:29.000
And I would have a narrow slide to talk about none artifact first bill tool later, but the fundamental thing is that if the artifact does exist, then don't run the task, right?

13:29.000 --> 13:32.000
That save you a lot of compute.

13:32.000 --> 13:40.000
But yeah, in terms of side effects, for example, a lot of people use Meta Phi to do release, right?

13:40.000 --> 13:43.000
Release doesn't really produce any artifacts, right?

13:43.000 --> 13:48.000
It's just called an API and get it over with or deployment.

13:48.000 --> 13:51.000
Those are often we call sci-effect.

13:51.000 --> 13:57.000
And with an artifact first bill tool, because artifact first bill tool do not handle this artifact.

13:57.000 --> 14:00.000
This artifact will be handled by a separate system.

14:00.000 --> 14:08.000
So with a setup that use artifact first bill tool, you have a segregation of concern, right?

14:08.000 --> 14:12.000
artifact first bill tool is there to help you produce the artifacts.

14:12.000 --> 14:16.000
And then you use something else to ship that artifact to production.

14:16.000 --> 14:20.000
And both of them can be quite mature in terms of solution.

14:20.000 --> 14:26.000
But here today, we are mostly focusing on the bill tool part.

14:26.000 --> 14:37.000
Inside an artifact first bill tool, this tool for all the bill tool that I least earlier, which is based on buck and pants, you have an actions, right?

14:37.000 --> 14:45.000
Then action definition here is that it's a transformation from one set of inputs to a set of outputs.

14:45.000 --> 14:47.000
What can the input be?

14:47.000 --> 14:49.000
The input can be your source file.

14:49.000 --> 14:51.000
It can also be your tool, right?

14:51.000 --> 14:54.000
Your compiler, your linker.

14:54.000 --> 14:59.000
It can be the command that you use to invoke that tool.

14:59.000 --> 15:03.000
For example, you can say, GCC, go compile me, this source file.

15:03.000 --> 15:06.000
That's a command, that's also an input.

15:06.000 --> 15:13.000
The environment variables going into that process to run this action also can be an input.

15:13.000 --> 15:19.000
The platform information, the target CPU, et cetera, can also be part of the inputs.

15:19.000 --> 15:23.000
And all of these can create you some outputs, for example.

15:23.000 --> 15:30.000
Example, output can be standard, I'll standard, or of the action, and the bill artifacts, like directory and files.

15:30.000 --> 15:33.000
All of these will get hatch, right?

15:34.000 --> 15:36.000
They will get hatch down to digest.

15:36.000 --> 15:41.000
Digest is a combination of, of a hash, a strong hash.

15:41.000 --> 15:44.000
It can be shot to 56 or like three.

15:44.000 --> 15:50.000
If you have really big block, you would prefer using like three instead, and the size of the artifact.

15:50.000 --> 15:53.000
Why do you want to have the size here?

15:53.000 --> 15:55.000
And get there's no size.

15:55.000 --> 16:02.000
Well, the size help you predict the size of the action overall, right?

16:02.000 --> 16:09.000
So, in later on, if you want to run this action remotely in a system that is not on your own computer,

16:09.000 --> 16:15.000
these size will help your scheduler think about where to put the task in.

16:15.000 --> 16:16.000
Right?

16:16.000 --> 16:23.000
So, all these gap hash, all the source file, it will be put into different directories.

16:23.000 --> 16:26.000
Those directories also going to get hash, right?

16:26.000 --> 16:31.000
And all of that will become a macro tree.

16:31.000 --> 16:33.000
And what does that do?

16:33.000 --> 16:38.000
First of all, it led you cache stuff in the remote cache really well, right?

16:38.000 --> 16:41.000
Everything get hash, and it gets star, oops, sorry.

16:41.000 --> 16:44.000
It gets star in a content addressable star.

16:44.000 --> 16:47.000
So, everything get dd duplicate really well.

16:47.000 --> 16:51.000
But on top of that, you can apply a schematic onto your action.

16:51.000 --> 16:57.000
The schematic essentially say that in a present of a sandbox, your action is reproducible, right?

16:57.000 --> 17:06.000
What's goes in is completely in control, and you can reproduce it reliably, right?

17:06.000 --> 17:12.000
And with the source identity that I just mentioned, change can be accurately identified,

17:12.000 --> 17:17.000
because everything will compute into a mercury, and if you invalidate one source file,

17:17.000 --> 17:24.000
if you change one source file, the hash of that file will change the hash of the entire action will change,

17:24.000 --> 17:31.000
and the action can be invalidated, and the cache can be invalidated correctly.

17:31.000 --> 17:37.000
I'm rushing here a bit, because I just get a sign that I only have a few minutes left.

17:37.000 --> 17:41.000
But yeah, this is how you cache the action.

17:41.000 --> 17:45.000
You put all the input file, the command, the platform into one file.

17:45.000 --> 17:49.000
You hash that file, that's the final 10 on your mercury.

17:49.000 --> 17:56.000
You use that as a cache key, and then the cache value is your outputs, right?

17:56.000 --> 18:00.000
And that would go into the action cache entry.

18:00.000 --> 18:08.000
Now, the nice thing about the action cache entry is that multiple notes on your CI system can use that action cache

18:08.000 --> 18:14.000
to avoid having to run that action over and over again, right?

18:14.000 --> 18:24.000
Humanity CD can guarantee that if an action has run on one computer, it can be reproduced,

18:24.000 --> 18:26.000
so it doesn't matter where it runs.

18:26.000 --> 18:33.000
It can run on a second computer or it can run on a computer in a data center somewhere, right?

18:33.000 --> 18:34.000
And it doesn't matter.

18:34.000 --> 18:39.000
But in fact, you know, implementation, why Humanity CD is not, you know,

18:39.000 --> 18:50.000
binary property is not a yes or no, but it's a combination of different things that can be applied depending on your requirements.

18:50.000 --> 18:56.000
Yeah, and because, you know, with hermatic actions, it doesn't matter where you're running it,

18:56.000 --> 19:01.000
you can run it on the same CI notes, or you can run it in a build farm, right?

19:01.000 --> 19:06.000
And in case of build farm, your CI notes going to upload the action to the scheduler,

19:06.000 --> 19:13.000
the scheduler will assign it to a worker, the worker build your outputs, and then upload that to a remote cache.

19:13.000 --> 19:19.000
And then multiple different notes in your CI system can benefit from that same action.

19:19.000 --> 19:26.000
This, the nice thing about distributed computing, which is that if you have different CI systems that issuing,

19:26.000 --> 19:32.000
that requiring the same action to be compute, the scheduler can be replicated, right?

19:32.000 --> 19:39.000
And this can help with the distributed computing requirement a little bit at scale, right?

19:39.000 --> 19:46.000
And this has been proven to be working for the life of Google or Apple in general.

19:46.000 --> 19:51.000
Finally, I just want to touch real quick on the remote execution API.

19:51.000 --> 19:57.000
Powering all of this is the set of API for different components to talk to HR.

19:57.000 --> 20:00.000
On the left-hand side there, you see basal talking to remote cache.

20:00.000 --> 20:02.000
It's talking to action scheduler.

20:02.000 --> 20:05.000
Over here, you see our worker also talking to the remote cache.

20:05.000 --> 20:13.000
Behind all of that is a set of open source API written in Protobuff and GIPC, right?

20:13.000 --> 20:25.000
It has several different components into it, but essentially on your left-hand side there are different ways that different client and worker can upload the block to the remote cache.

20:25.000 --> 20:42.000
Right? All with fine-missing block and get tree RPC, they are essentially for you to save in terms of the total amount of download upload or the amount of round trip that you need to interact with the cache.

20:42.000 --> 20:48.000
On the right-hand side here, as I said, action cache is just essentially a key value store, right?

20:48.000 --> 20:55.000
You can upload some key and value to it or you can get a value to it using some key.

20:55.000 --> 21:02.000
Finally, you have execution which is executed remotely on server farm.

21:02.000 --> 21:11.000
All of these open source and there are a lot of different implementation of these APIs, right?

21:11.000 --> 21:18.000
These are all the built tools that has implemented some part of the remote built APIs today, right?

21:18.000 --> 21:24.000
And these are all the server that implement the server side of it, right?

21:24.000 --> 21:30.000
So the nice thing of a remote built API here is that you have interrupt, right?

21:30.000 --> 21:39.000
There's an open specification that essentially help you avoid vendor lock-in when you start selecting these kind of technology.

21:39.000 --> 21:49.000
Obviously, one of them in my employers, I'm plugging them, but yeah, it's supposed to work with any of these setup right here.

21:49.000 --> 21:55.000
Now remote API is not perfect, right? There are a lot of things that we don't do well.

21:55.000 --> 22:00.000
I'm just highlighting here just a few points that come up in recent meeting.

22:00.000 --> 22:07.000
I mean, the working group of remote API, but first of all, we have not captured windows that well.

22:07.000 --> 22:18.000
We had Microsoft came up to a basalcon recently and gave a talk about the internal build system that actually was inspired by remote APIs,

22:18.000 --> 22:21.000
but they use a slight deviation of it.

22:21.000 --> 22:31.000
So that's going to show that this solution, the scale well, but there are some slight Microsoft specific logic that they want to be in there.

22:31.000 --> 22:36.000
We know that Unreal Engine recently introduced their own remote build system,

22:36.000 --> 22:40.000
but they don't use remote API and we really want them to do it.

22:40.000 --> 22:49.000
So hopefully we can invite them to come to the table and help build remote API in the future for window use case better.

22:49.000 --> 22:57.000
To be fair, right now, everything is still working on window, there's a few edge cases that might pop up here and there.

22:57.000 --> 23:01.000
We have a working group from support massive block, right?

23:01.000 --> 23:08.000
Recently, we started seeing a lot of LOM being used, a lot of containers being built with your build tools.

23:08.000 --> 23:17.000
So a massive block is going to require some more work to fit it into the protocol.

23:17.000 --> 23:21.000
And finally, we don't have a standardization for build telemetry.

23:21.000 --> 23:27.000
Even though both basal and buck, well, buck 2 in general, come with their own build telemetry.

23:27.000 --> 23:29.000
They are now standardized, right?

23:29.000 --> 23:36.000
So each of them using their own API, their own set of events.

23:36.000 --> 23:40.000
So right now there's no standardization for it.

23:40.000 --> 23:50.000
So different server implementation are using different API over there and you cannot interrupt it in between system and that's a little bit of downside.

23:50.000 --> 23:59.000
We are trying to get that sort of hopefully within the next year, but yeah, come talk to us if you're interested in this area.

23:59.000 --> 24:02.000
And yeah, that's my talk. I think that's about time.

24:02.000 --> 24:11.000
I'm sorry, I'm brushing through a lot of details, but yeah, that's the gist of it.

24:11.000 --> 24:14.000
The slide is available on a first-time website.

24:14.000 --> 24:22.000
And if you have any questions, fire away or reach me via email.

24:22.000 --> 24:23.000
Yes?

24:23.000 --> 24:29.000
So if I'm driving you on these two, it's come from, you know, the security system world or web.

24:29.000 --> 24:32.000
Are they using what can get that system world at all or...

24:32.000 --> 24:36.000
Doesn't really good question. Are these two being using embedded system?

24:36.000 --> 24:38.000
And yes, the question is yes.

24:38.000 --> 24:46.000
There are recent several talks from Google, where they are using basal to create different embedded system.

24:46.000 --> 24:52.000
Obviously, they have like their all Google home line up as well as Google watch, Google phone.

24:52.000 --> 24:56.000
All of that is currently being built with basal or blaze internally.

24:56.000 --> 25:02.000
Now, you have meta who are working on their own glasses or everything in meta are being built with about 200%.

25:02.000 --> 25:05.000
So the story for...