WEBVTT

00:00.000 --> 00:18.720
Hello, my name is Marco Dacos. I'm a software engineer at Google. I was going to give

00:18.720 --> 00:26.680
this presentation with Brandon Lum. Unfortunately, he fell sick. We had originally planned

00:26.680 --> 00:32.360
for, well, Brandon loves S-bomb so much that he really wanted to give the talk. So, he did

00:32.360 --> 00:39.000
a recording of the first part, but we didn't check, we didn't realize that that's

00:39.000 --> 00:45.360
not supported here at Fossum. So, I'm just going to give the whole thing. So, bear with

00:45.360 --> 00:54.040
me for the first part. Let's begin. Today, we're going to talk about Google's journey

00:54.040 --> 01:04.280
with S-bomb. We'll start out with how we implemented various aspects of this and then some

01:04.280 --> 01:13.880
lessons learned through this process. We will go through the S-bomb lifecycle of a generation

01:13.880 --> 01:28.880
storage retrieval and then various applications. So, before we start off with this lifecycle,

01:28.880 --> 01:39.880
I guess first, to give some context, our initial motivation to push dive deep into S-bomb

01:39.880 --> 01:47.960
was in mix of responding to EO 14028 and security and license use cases as well. Before we

01:47.960 --> 01:54.040
get into that lifecycle, let's talk about some design principles that we developed. The first

01:54.040 --> 02:01.280
question we asked is, what are we looking for in S-bomb? We rallied around two properties,

02:01.280 --> 02:08.880
first accuracy and trustworthiness, is the dependency information in the S-bomb correct.

02:09.880 --> 02:15.120
And then trustworthiness. Can we trust this S-bomb and use it for important security and

02:15.120 --> 02:23.040
compliance decisions? Based on these properties, we developed a series of best practices that

02:23.040 --> 02:33.440
will go through in this talk. There's a link here as well to the document that goes through these

02:33.440 --> 02:41.600
design principles that we recently made public. So, feel free to check that out. Another

02:41.600 --> 02:48.960
question, clearly, that we are in into, is S-p-d-x or cyclone-d-x or both. But more generically,

02:48.960 --> 02:57.800
the question is, how opinionated do we want to be throughout this process? The scope for us

02:57.880 --> 03:02.920
was very large. It's scanned, it's spanned, organizations, different products, different

03:02.920 --> 03:12.680
text acts. There were a lot of moving pieces as a result. We decided that less is more. That

03:12.680 --> 03:20.440
for all of the moving pieces to come together in a better way, being more opinionated would

03:20.440 --> 03:27.000
be helpful. So, we decided to only use one S-bomb standard based on the experience we had

03:27.080 --> 03:43.560
at the time. We went with S-p-d-x. So, let's get into S-bomb generation and how we approached

03:43.560 --> 03:51.320
to this. A question that we faced here is, when and where to generate S-bombs, we can do

03:51.320 --> 03:57.480
it at the source phase during the build or analysis. After the fact, after the artifact has been

03:57.480 --> 04:06.680
generated. On one hand, if we look at the source, if we generate S-bombs at the source, we

04:06.680 --> 04:14.440
found that things like tests and plug-in dependencies would end up in the S-bomb. We found

04:14.440 --> 04:21.160
that there is perhaps ambiguous dependency resolution. And on the other side, for analysis S-bombs,

04:21.240 --> 04:31.880
builds our lossy. There could be missing context in 2022. We did some work and showed that

04:33.720 --> 04:39.480
if we, for example, import a binary into a container image. As we mentioned, builds our lossy.

04:39.480 --> 04:44.920
So, that information won't. The dependencies of that binary won't show up in the S-bomb.

04:45.560 --> 04:58.840
This is a Goldilocks situation at the source S-bombs are inaccurate and in the analysis of

04:58.840 --> 05:07.960
the artifact stage, they can be incomplete. We found that the best approach is to meet in the

05:07.960 --> 05:14.360
middle at the build time where there is a good trade-off between accuracy and completeness. We

05:14.360 --> 05:23.560
found that to know when or to really know what goes into the artifact that's being produced,

05:23.560 --> 05:32.920
the builder and the build process knows best. So, we recommend that only the builders can

05:32.920 --> 05:40.600
generate S-bombs. But this is again also a nuanced question because during the build time,

05:41.240 --> 05:48.680
there can be an S-bomb generated by the build tooling itself or an S-bomb generated by an SCA tool.

05:50.760 --> 05:57.560
We recommend that wherever possible, the build tool should generate the S-bomb.

05:59.400 --> 06:09.080
And in some cases, this lists some cases where we do that. For Android, we created a

06:10.040 --> 06:18.040
Gradle plugin that generates an S-pdx S-bomb during the build. We have a similar approach for

06:18.040 --> 06:28.680
Google 3 which is our monar repo. But outside of that, we use SCA tools such as SIFT and OSV

06:28.760 --> 06:46.760
to generate the S-bomb during the build process. So, now we are faced with S-bomb storage. We

06:46.760 --> 06:51.800
have the S-bombs. So, let's just put them into a database and be done. Well, it's a little bit

06:51.880 --> 06:59.800
more complicated than that. And the main, one of the important questions here was to be able

06:59.800 --> 07:05.320
to use S-bombs for important security, compliance decision, to be able to get them to the

07:05.320 --> 07:10.440
government, for example, how can we create an S-bomb database that we can trust?

07:10.760 --> 07:22.600
So, for this, we developed a system called SILO, the Supply Chain Integrity Log,

07:23.800 --> 07:30.920
contrary to its name. It's supposed to break up metadata silos. For those familiar with Gwak,

07:31.960 --> 07:38.280
it serves a similar purpose. Gwak in the open as a project in the open SSF. It serves a similar

07:38.360 --> 07:44.200
purpose, and it's under the same team. Or it's worked on by members of SILO as well.

07:47.800 --> 07:57.160
So, SILO, for some context, collects metadata from software supply chain events that occur

07:57.160 --> 08:05.960
in Google. When builders produce artifacts, they produce build provenance, they sign it,

08:06.600 --> 08:14.360
and then they send it over to SILO. And then SILO can verify the signature,

08:15.480 --> 08:21.800
can verify that it was produced by a trusted builder, and then can ingest it, store it,

08:21.800 --> 08:31.880
and use it, and make that information usable. For S-bombs, we took a similar approach.

08:32.760 --> 08:42.040
We used something called, we used in Toto, and in Toto, a predicate type called the reference

08:42.040 --> 08:50.200
at a station, which we have now upstreamed to the at a station repository that's a SILO-C8s

08:50.200 --> 08:59.880
and S-bomb. Or in this case, S-bombs with a software artifact. And so here, when builders produce

08:59.960 --> 09:09.400
S-bombs, they generate and sign an Intoto-autostation and send it over to SILO and SILO

09:10.200 --> 09:18.760
verifies this to ensure the integrity of S-bomb. And this also supports only accepting S-bombs

09:18.760 --> 09:37.320
from trusted builders. So now, in theory, we have a table mapping artifacts to S-bombs and

09:37.320 --> 09:51.880
retrieval should be easy, but famous last words. So what we want is to have a simple input

09:51.880 --> 10:02.520
output of looking up an S-bomb by an artifact, such as a URI or a digest, but what we got was

10:02.520 --> 10:10.600
nothing when we did this operation. Why is that? So the context is that we're working in a

10:10.600 --> 10:18.440
supply chain. It's a graph and it has to be reasoned about as such. When we search for the S-bomb

10:18.440 --> 10:27.320
of GCR.io slash ABCD, we didn't get anything because no S-bomb was produced for that artifact,

10:27.400 --> 10:33.640
because that last stage when that artifact was produced was, say, a promotion process.

10:35.080 --> 10:43.160
And that's not a build. So there is no S-bomb generated there. So what we can do is go back

10:43.160 --> 10:52.520
in the graph. We can look at one lower lower. And here, in this case, we did not find the S-bomb.

10:52.760 --> 11:02.200
And then we go even further back and turns out that this artifact in the staging is a multi-ark

11:02.200 --> 11:14.440
image that was assembled from different container images itself. And here, we found S-bombs.

11:14.440 --> 11:21.160
Great. But we can go even further back. All the way back through the chain to collect all the

11:21.240 --> 11:29.560
S-bombs that were in some way associated in the production of this final artifact.

11:30.920 --> 11:36.440
So now, when we look up for the S-bombs of an artifact, we get a collection of S-bombs.

11:37.880 --> 11:45.960
Awesome. Here are some edge cases that we ran into throughout this process.

11:45.960 --> 11:55.720
Well, going back a second, what we did was generate a, do this, implement a transitive

11:55.720 --> 12:02.280
search through the software supply chain graph whenever looking for S-bombs to collect all of these.

12:03.800 --> 12:09.560
And here's some edge cases that we ran into through this process. I won't go into all of them here.

12:09.560 --> 12:15.240
But in general, the practice here, the principle here is that to get good quality S-bombs,

12:16.440 --> 12:23.400
that are accurate and complete, we try to compose them to arrive at completeness.

12:29.400 --> 12:35.400
So we've left out a little bit of detail here. Where does the scoff come from?

12:36.120 --> 12:43.320
This is an idealized state or is it real? So let's go back to the build, what we mentioned earlier.

12:43.960 --> 12:50.680
That when the build there produces an artifact, it creates a build problem, it's not only describing

12:50.680 --> 12:54.680
how the build was conducted, but also the dependencies and the materials that went into the build.

12:57.400 --> 13:02.520
So this problem is going to be used to construct the graph.

13:04.120 --> 13:10.440
And it can be used to glue together these lost pieces of S-bombs, creating more complete S-bombs

13:10.440 --> 13:16.760
by composing them together. And this is a blog post, partly by Brandon, that goes into this in

13:16.760 --> 13:21.240
more detail about how salsa and S-bombs work together.

13:27.640 --> 13:35.640
Another challenge we ran into here is that often a request for S-bombs are the artifact granularity.

13:36.600 --> 13:42.680
Somebody will ask our S-bombs for a product say pixel OS. And translating this to

13:45.480 --> 13:48.840
to something that can be interpreted with the mechanisms that we developed

13:49.960 --> 13:56.120
is challenging, maintaining self-to-inventories hard. And in these cases we relied on the expertise of

13:56.120 --> 14:03.800
product teams. This is an example of why, and part of this is challenging because different

14:03.800 --> 14:08.840
products can be composed of artifacts in different, in varied ways.

14:15.320 --> 14:22.680
Okay, so we've responded to the EO, but compliance isn't fun, so let's talk about some other

14:22.680 --> 14:32.280
things that we've done with S-bombs. S-bomb blobs are cool, but what they contain is even better.

14:32.360 --> 14:38.040
We used them to develop dependency inventory across the organization.

14:38.920 --> 14:45.640
This involved parsing and storing S-bomb data at scale. For now, we only store

14:46.680 --> 14:51.640
flat lists of dependencies that are in the S-bombs. We found that the graph structure that can be

14:51.640 --> 15:02.040
encoded in an S-bomb relationships was useful, but that it wasn't mature enough to really be

15:02.120 --> 15:10.120
dependent on. And developing such a dependency inventory with S-bombs in a large,

15:10.120 --> 15:15.960
softer producing machine like Google is difficult, where it's spanned different products,

15:15.960 --> 15:21.880
tech stacks, different CSED flows, different S-bomb generators. But there are some properties of S-bombs

15:21.880 --> 15:30.280
that really helped to succeed. This included, the fact that S-bombs are common format across

15:30.280 --> 15:37.880
ecosystems, that's enabled us to reason about otherwise siloed systems through a single pane of glass.

15:39.480 --> 15:45.960
And another thing that really helped here was that S-bombs are flexible both in the generation

15:45.960 --> 15:55.000
and in the concepts of them. We did, as we talked about, put forth recommendations on how they

15:55.000 --> 16:01.080
should be producing Google, but within that each ecosystem could choose to generate S-bombs

16:01.080 --> 16:07.640
in a way that produced the best S-bombs for them. But this flexibility also led to some challenges

16:07.640 --> 16:15.240
that we'll talk about in a bit. So we have this dependency inventory. We can combine it with

16:15.240 --> 16:21.160
things like threat intelligence and organizational metadata on using systems like silo and guac

16:22.120 --> 16:30.360
and use that for various use cases such as internet response, measuring risk from upstream,

16:30.360 --> 16:35.800
open source dependencies, and also for general business and housekeeping purposes.

16:39.640 --> 16:46.520
This inventory enables fast identification of where packages are used. This is critical for

16:46.520 --> 16:57.560
internet response. I think this is like a log for a shell or XE type capability, a quote from

16:57.560 --> 17:01.000
a product team is that they were able to figure out that they weren't affected by an incident

17:01.000 --> 17:09.560
within 10 minutes using S-bombs. What's shown on the screen here is a sample dashboard that

17:09.560 --> 17:17.960
can search across this inventory. As you can see, the searching for a single dependency

17:20.360 --> 17:28.600
searches through various siloed systems. That's part of the value that the S-bombs provide.

17:30.520 --> 17:34.760
This slide also calls out the importance of point of contact information or

17:34.760 --> 17:43.320
attribution of artifacts to artifacts and artifacts to products and to teams. This organizational

17:43.320 --> 17:51.000
metadata is important because otherwise the actionability of this inventory is limited and

17:51.000 --> 18:03.960
it creates toil when responding to incidents. Five minutes. It's not until 1030.

18:05.400 --> 18:13.720
Yeah, we'll have the questions. Okay, so another thing, another use case that we

18:14.920 --> 18:22.360
applied it to is to measure the risk from our dependency usage. This graph shows

18:24.280 --> 18:29.160
the subset of the fleet, the dependencies of subset of the fleet,

18:30.040 --> 18:37.400
mapped to the open SSF score card scores and this identifies a danger zone here.

18:40.840 --> 18:45.800
But this really highlights the need for additional metadata such as criticality and

18:45.800 --> 18:53.800
attribution to make this more actionable. Now, run through some lessons that we learned through

18:53.880 --> 19:00.680
this process of operationalizing a dependency inventory using S-bombs. As you might have noticed,

19:00.680 --> 19:07.160
we decided to use perils for the identification scheme for this inventory. We decided to use perils

19:07.160 --> 19:16.120
because they're common in S-bombs. They are other identification schemes and cut it and

19:17.640 --> 19:21.640
we didn't want to develop a new one. However, we quickly noticed that many S-bombs or

19:21.640 --> 19:28.200
packages in S-bombs didn't have perils. In these cases, we generated fake perils with the

19:28.200 --> 19:32.120
information that was available in S-bomb, but this still had limited utility.

19:34.440 --> 19:40.920
And often this happened due to things like build-time generation of S-bombs from artifacts

19:40.920 --> 19:47.880
that contain third-party vendard code, where the proper identification metadata hadn't been

19:47.960 --> 19:58.360
propagated through. Other shortcomings that we ran into related to S-ca tools that we used.

19:58.360 --> 20:03.480
And these are really a fundamental result of how they operate. But S-ca tools attempt to

20:03.480 --> 20:08.600
read files on disk and match them to packages in a registry based only on what the file

20:08.600 --> 20:14.440
on disk says and not and without access to the registry. Understandably, this can lead to some

20:14.440 --> 20:23.560
issues. The problem is that the same package can have, sorry, the problem is that different

20:23.560 --> 20:28.760
packages can have the same names or a very similar manifest files within or outside of the

20:28.760 --> 20:35.240
same ecosystem. And the scanner has to choose one. For example, Unity, which is a game engine,

20:35.240 --> 20:45.960
has the same package that JSON file as NPM. And in this case NPM perils were generated for

20:46.600 --> 20:55.000
something that wasn't NPM at all. Other cases include private packages. This is an example.

20:55.000 --> 21:00.840
This is shown on the screen. This is a meta-host file of a private package in a repository that

21:00.920 --> 21:09.800
was not published to NPM. But an NPM pro was generated for it. And for the by the pro spec,

21:09.800 --> 21:14.920
this pro indicates that this package is on the registry. But in reality, it's not.

21:17.240 --> 21:22.440
So why is this important? Why do I flag this? It's because some other people not with the best

21:22.440 --> 21:29.880
intentions also realize this. For a lot of these packages, such as the private packages, for example,

21:31.720 --> 21:40.600
malware was submitted to the registry under that package name. The name was squatted. And then

21:42.040 --> 21:49.080
this creates a lot of toil because whenever somebody uses a project that contains one of these

21:49.080 --> 21:56.760
say private dependent, private packages and runs a scanner on it, it will flag it as malware

21:56.760 --> 22:02.920
and reality. This is an example of a thread of a project that contains such a package and it

22:02.920 --> 22:08.360
receives a lot of complaints. Understandably, the maintainers didn't sympathize.

22:12.920 --> 22:20.120
So we also ran into some challenges with identifiers going to go through this section quickly

22:20.360 --> 22:30.600
because I don't have much time left. But the SCA problems that I mentioned are

22:30.600 --> 22:36.280
archipounded by lack of expressivity of the identifiers that we used. General identification is hard.

22:37.560 --> 22:44.040
Even if we could identify that a package is private, for example, pearlism support, creating an

22:44.040 --> 22:50.680
identifier for it. SCA tools could extract more information such as hashes, but how to do this

22:50.680 --> 22:57.960
isn't standardized. And how to compare packages with that say supplemental information isn't

22:57.960 --> 23:07.240
standardized either. So I'd like to also just call out a reference that I think a project that

23:07.240 --> 23:11.720
I think is making a lot of progress in the space guac, which is guac that we mentioned earlier.

23:12.680 --> 23:19.880
guac is looking at a strategy of disambulating and correlating identifiers to

23:21.800 --> 23:28.120
to help solve this issue. So these issues are all fall under

23:28.760 --> 23:36.760
S-bomb quality. Other issues have also highlighted this and the importance of that application.

23:36.840 --> 23:41.560
The importance of S-bombs have meant that this year we're really going to focus on S-bomb quality

23:42.280 --> 23:46.840
both on syntactic elements, but also semantics such as completeness and accuracy.

23:49.480 --> 23:55.560
This is an example of S-bomb quality that's a lot of accuracy, but I don't think I have to

23:55.560 --> 24:04.840
go that into here. So what now? We went from almost zero S-bomb to four million S-bombs a week

24:04.920 --> 24:11.960
totaling over 200 million S-bombs. Security and compliance teams are using S-bombs to triage

24:14.600 --> 24:21.400
security and compliance issues. S-bombs are part of the security posture of several organizations

24:22.440 --> 24:29.000
and we've been through a few S-bomb tools along the way. And we use S-bomb tools along the way.

24:29.240 --> 24:35.400
This is just a list of some things that we're going to, that we're looking forward to,

24:35.400 --> 24:40.680
that we're focusing on this year. And yeah, that's just on the screen.

24:41.960 --> 24:45.960
Yeah, so I'm ready to take any questions. If anybody has any.

24:47.160 --> 24:54.680
With the Wich school do you use for recognising vulnerability for example?

24:55.640 --> 24:59.640
Is that completely different? Is that from the S-bombs?

25:07.640 --> 25:17.960
Oh, yes, sorry. So the question is if we use, what do we use to detect around abilities in

25:17.960 --> 25:24.360
S-bombs and do we do that as part of, is that in scope of S-bombs or is it out of scope? Is that

25:24.360 --> 25:26.840
done separately? And in this case, it's out of scope.

25:29.880 --> 25:36.280
I also have a question. Yes. Since you mentioned the supply chain, do you reuse parts of the S-bombs

25:36.280 --> 25:43.960
or which one results in the end S-bombs that is being used at a stand-alone unit?

25:44.760 --> 25:49.000
And as it forms to, either media and S-bombs that are being.

25:49.960 --> 25:56.280
If I understand your question correctly, we use provenance to, sorry.

25:57.000 --> 26:01.640
The question is, do we, I mean, just very,

26:02.920 --> 26:12.040
explicitly do we merge S-bombs together or do we compose them? We compose them with, with a mechanism that we mentioned?

26:13.320 --> 26:16.520
Yes. You've got to change about a date set. Really good.

26:17.160 --> 26:22.680
How are you contributing the date so you're finding upstream to make the matter data better for everybody else?

26:24.280 --> 26:32.440
Well, a good question. Yes. We have a big data set. How are we contributing this back to

26:32.440 --> 26:41.960
upstream to tell the community? We, the sheer we're looking at generating error developing,

26:41.960 --> 26:45.880
and S-bomb quality library. This is something that we may be able to open source,

26:47.480 --> 26:53.480
and O-S-V-Scaliber will be a part of the S-bomb quality that we look at, and this is open source as well.

26:57.560 --> 26:59.560
So you're not, you know, thank you very much.

27:11.960 --> 27:20.600
Actually, the Copeland and Brandon was online on the Matrix, and was answering questions there.

