WEBVTT

00:00.000 --> 00:16.120
Okay, we're a little behind time, but we're cut-up, I'll be very brief, and I'll be very quick,

00:16.120 --> 00:24.320
and now, yeah, okay, so this is something that I've been working on for a very long

00:24.320 --> 00:31.200
time, but in case we don't know me, I'm Alexis, and I've been working in this

00:31.200 --> 00:44.280
bomb for many years, and so this is a preview of what the big effort that is what we're

00:44.280 --> 00:54.280
doing at work, and yeah, part of it, I think, we've been interested for more people, right?

00:54.760 --> 01:03.160
In case you haven't known by now, right, the CDX talks about three different levels, right?

01:03.160 --> 01:09.480
They're talking about package, talking about file and inside files are snippets, and this is

01:09.480 --> 01:17.080
high- and high-erarch, right, and the package, which is the largest of the three, right, in both

01:17.080 --> 01:23.880
in this definition of the CDX version two and the definition of the CDX version three, it's called,

01:23.960 --> 01:31.640
it's any unit of content that can be associated with the distribution of software, right,

01:31.640 --> 01:40.120
so it's about software being distributed, right, and this is a package, right, and in our work,

01:40.120 --> 01:46.440
we found this problem, because this is about something being distributed, right, a very concrete

01:46.440 --> 01:53.560
thing being distributed, but sometimes we want to refer to something more general, right, we want

01:53.640 --> 02:05.320
to talk about open SSL in general, right, or Z-lib, or love for J, if you were working to Christmas

02:05.320 --> 02:16.520
ago, right, or Lama, or something like that, right, so in our current terminology, in SPDX,

02:16.520 --> 02:23.160
there's no way to refer to open SSL in general, right, you have to refer to one specific

02:24.120 --> 02:33.880
package and I'll explain that, right after, so we introduced this new concept that we just

02:33.880 --> 02:41.640
called component, right, and when I say we introduce it, we have an internal system that we call

02:41.640 --> 02:47.080
Tosca, and this is the acronym of the open source component aggregate, or any because they could

02:47.080 --> 02:55.080
not find another word from me, so it's a system that handles open source essentially and

02:55.080 --> 03:01.400
is bombs and all this stuff, but essentially we had to introduce the new idea of a component,

03:01.400 --> 03:08.920
right, so the component in some abstract thing, right, that talks about a specific software

03:08.920 --> 03:16.520
component, right, open SSL, so it might refer to a specific version of open SSL or not,

03:16.840 --> 03:23.960
but remember when we were talking about a package, a package is a very specific, when you say a

03:23.960 --> 03:33.160
package, it's open SSL version 310, as distributed by Ubuntu, right, this is a package,

03:34.600 --> 03:41.240
because open SSL and other version as distributed by Ubuntu will be a different package, right,

03:41.240 --> 03:51.080
or open SSL exactly the same version as distributed by Debian, for example, is yet another package,

03:51.080 --> 03:57.720
right, so each of the package is a component with a specific version from a specific supplier,

03:58.280 --> 04:04.440
right, and when we're talking about suppliers, supplier is the entity that provides the software,

04:04.440 --> 04:09.080
right, so if we're talking about the dead things, you can get it from Debian, for example,

04:09.160 --> 04:14.760
or Ubuntu, these are suppliers, right, if you get Python packages, you usually get them from

04:14.760 --> 04:20.520
pipi, Java, you get them from name and central and PM from node, all this stuff, right, we're

04:20.520 --> 04:28.680
asked you get from great sale, and in go unfortunately there is no such thing, we'll have the

04:28.680 --> 04:35.640
issue afterwards, right, but besides the actual distributors of packages, right, there is

04:35.720 --> 04:40.840
project themselves, right, open SSL, also publishes their own source, so they're also supplier,

04:40.840 --> 04:48.680
right, and they publish it in open SSL.org, so our model our ontology if we want to call it in

04:48.680 --> 05:04.280
nice words from SPDX, right, we had the package, and the package is a package of a component,

05:05.800 --> 05:11.720
and is provided is supplied by a supplier, right, and they all belong to an ecosystem,

05:11.720 --> 05:17.560
this is an internal thing, and then we can, you know, our ontologies much larger, we have licenses,

05:17.560 --> 05:27.880
we have attribution text, we have other stuff, right, but the whole idea is that a package is

05:28.760 --> 05:34.360
repeating the same words again, right, is one specific version by one specific supplier,

05:35.000 --> 05:40.920
but we wanted to have the component part, because the component part is something abstract, right,

05:41.800 --> 05:51.720
if I want to refer to something, covering for example, two different versions of open SSL, right,

05:51.720 --> 05:58.120
I can use the component abstraction, right, a couple of example,

06:00.920 --> 06:06.040
careful example, right, you see there there are four different type of boxes,

06:06.760 --> 06:12.760
I'll show them there a label on the right hand side, right, and there are four different types of

06:12.760 --> 06:19.880
arrows, right, what is licensed under is distributed by is instance of any package of,

06:19.960 --> 06:24.760
and the box tile is licensed supplier component of the package, the package is the

06:24.760 --> 06:30.360
rectangle, and the abstract thing is the ellipse, that's the component, right, so we're talking

06:30.360 --> 06:38.840
about the curl component in the center, right, and there are two curl different versions,

06:38.840 --> 06:43.400
but they get, they're again an abstract, right, because I'm talking about in general,

06:43.400 --> 06:53.080
curl 8.9, right, and then if you get the specific version of curl by a specific supplier,

06:53.080 --> 07:01.720
then this is a package, right, and then you have, you know, the suppliers and all this stuff.

07:01.720 --> 07:08.280
The nice thing about that is that if you see the point in the center, there is a

07:08.360 --> 07:17.400
dotted line from curl to curl, the curl is the abstract component in the center, and there

07:17.400 --> 07:25.640
one in the lower part of the screen is the license, right, so I can associate the license on the

07:25.640 --> 07:33.240
abstract component, and I say, yeah, I can save the information that the curl is under the curl

07:33.320 --> 07:40.440
license, I should have chosen another example, so that you're more interesting, right, and you

07:40.440 --> 07:48.600
can save it here and not repeat this information in for every of the four different packages,

07:48.600 --> 07:57.640
right, that you have this information, the A, you know, the curl 7.8 by Ubuntu is under curl,

07:57.720 --> 08:04.920
the curls 8 under Ubuntu is under curl, and all these different versions. If we go to open

08:04.920 --> 08:11.800
SSL, it's a little even more complex, and then, you know, you have different versions of open SSL

08:11.800 --> 08:17.960
from different suppliers, and then, you know, depending on the version, the license formation

08:17.960 --> 08:24.920
might change, right, the open SSL 3 is under Apache, but open SSL 1 point, something is under

08:24.920 --> 08:31.000
open SSL license, right, so you can have, you know, again, you're building your graph, and you can

08:31.000 --> 08:36.920
have your information whenever you want, right, and you have still your suppliers and the license

08:36.920 --> 08:44.600
formation notes and stuff, so the main, you know, idea I want to present here is this introduction

08:44.600 --> 08:53.240
of this abstract component thing, right, and the advantage that we found using this is that you

08:53.320 --> 09:01.880
refer to something in, you know, in general, like I said before, you know, I want to talk something

09:01.880 --> 09:08.760
about open SSL in general, or in curl in general, right, and I can refer to that independent

09:08.760 --> 09:16.120
of the specific version, and independent of the actual supply, or who gives me the actual bits

09:16.200 --> 09:24.680
of a specific package, right, typical information that we used for, and that's where we started

09:24.680 --> 09:33.080
with that was license information, and obviously any information can also be, if there is pertinent

09:33.080 --> 09:42.200
to an abstract version of a component, you can associate it with that, right, so as I said, we're

09:42.280 --> 09:53.720
very short, so a couple of real-world data, we've been using this abstraction, I mean, we developed

09:53.720 --> 10:04.600
an internally and designed it and developed it, we've been now working in for, not two years,

10:04.600 --> 10:12.120
but a little less than two years, right, we keep updating the model, this was not even in the beginning,

10:13.000 --> 10:18.840
the same, I'm talking about the abstraction, right, the ontology, the data model, right,

10:19.400 --> 10:24.600
we had different software implementations, we started, you know, from proof of concept for the type,

10:24.600 --> 10:29.640
pilot, we have different, and every implementation, depending on the team that did it,

10:30.280 --> 10:35.080
did it something different, this is the software part is not the industry part, it's the

10:36.040 --> 10:42.280
data model part, right, and we also do continuous data collection, right, so we collect

10:43.560 --> 10:53.080
data for a lot of open source packages, right, in order to get, for example, you get packages

10:53.080 --> 11:00.760
from the suppliers, right, there are too many variations to get how to get the data from suppliers,

11:01.720 --> 11:06.440
do they publish a list of components, do they not, you know, is the list,

11:06.440 --> 11:11.640
include only the latest version, does it get all the version, do you get license formation,

11:11.640 --> 11:17.000
for example, do you get a source link, do you get other meta information, this depends completely

11:17.000 --> 11:26.440
on the ecosystem, I mean, I'm not sure if it's clear, but when I'm saying, does this apply

11:26.920 --> 11:34.280
or provide the list of components, let's see, a supplier is Ubuntu as we say, right,

11:34.280 --> 11:41.080
does Ubuntu provide the list of all its components, and the answer there is, it depends,

11:41.960 --> 11:47.960
how to get it, right, and if the question was about debut, the answer is yes, I can get it

11:48.040 --> 11:56.680
all the list of packages in debut, right, and if you're talking about Pi Pi, yes, it does,

11:56.680 --> 12:04.040
if you're talking about Rust, Crates I.O. does publish a list of packages, but not all the versions,

12:04.040 --> 12:10.440
right, it publishes the latest version of each one, right, so depending on what you want,

12:10.440 --> 12:23.640
this answer, this answers very little, so some real data statistics, for four different suppliers,

12:23.640 --> 12:32.040
right, we have a list of how many packages are we handle, right, and how many components,

12:32.040 --> 12:39.000
remember, components are the abstract idea, and what's the savings, if you put the information on

12:39.000 --> 12:45.240
the component level instead of the package level, right, and you see in debut what we collect

12:45.240 --> 12:57.480
the information once per month, right, the we have in our data 234,000 packages, right,

12:57.480 --> 13:05.320
but these correspond to 54,000 components only, right, so you save 77% if you keep the information

13:06.120 --> 13:12.680
on the component level, right, and then, you know, we won't do, we only care about LTS,

13:12.680 --> 13:19.000
we did it every three months, we get about 60% I'm going to be reading all the numbers here,

13:20.600 --> 13:28.840
Rust, we collect everything once a month, we have 6,000 things, and you know, you save 74%

13:29.480 --> 13:36.440
if you only care about, if you keep the information on the component level instead of the package level,

13:36.440 --> 13:45.240
right, and then pie pie is the wonderful thing, right, pie pie, you can query it and get the list of

13:45.240 --> 13:52.520
all everything available there, you get all the versions, right, it's not easy to remove versions,

13:52.520 --> 13:59.640
so these are the data, I think, couple of weeks ago, or maybe a week ago, couple of weeks probably,

13:59.640 --> 14:08.520
right, so I can get 13 million packages from pie pie, right, but it's only 600,000,

14:08.520 --> 14:15.000
okay, so, you know, I'm saving 95% of the data if I can talk about, you know,

14:15.080 --> 14:23.800
the three requests package, instead of having a request version of all the different versions

14:23.800 --> 14:29.000
that are available, right, when we're talking about bone information, his bone information,

14:29.000 --> 14:34.200
right, we obviously refer to a specific package, right, because you have to talk about hey,

14:34.200 --> 14:40.840
this is a specific component, a specific version from a specific supplier, right, but there is a

14:40.920 --> 14:49.320
way to organize your information on a more abstract level, what I got, and saving our time,

14:49.320 --> 14:57.880
that's all, then we have a couple of, maybe it's for questions, yes, please, we're going to

14:57.960 --> 15:05.560
do this, but projects can change license between the rules and for me the licensing

15:05.560 --> 15:11.960
information, because of this, is that we're connected to the package and not to come, okay, sorry for

15:13.240 --> 15:21.240
cutting you short with that, repeating the observation is that you cannot always abstract everything

15:21.240 --> 15:27.080
because some information on this version dependent, right, and for example, the package may change

15:27.160 --> 15:33.400
license, and this is obviously correct, and that's quite for example, on the open SSL example,

15:33.400 --> 15:39.240
the license formation is not in the abstract, but there is another note about open SSL 3 that

15:39.240 --> 15:48.280
has the information, right, so the license, you put it there, what we actually do is, when we

15:48.280 --> 15:53.560
collect information, because the correction, the information you always collect it on the package level,

15:53.560 --> 15:59.400
right, so you get a new version, you get the information, and if it's the same, then you obstruct

15:59.400 --> 16:06.520
it, half words on the tree, right, and there is always information, pieces of information like,

16:06.520 --> 16:13.080
you know, softer hash ID that we compute the content hash of everything, it's obviously specific

16:13.080 --> 16:20.920
for packets, there is no way that you can obstruct this, but different versions may have different

16:20.920 --> 16:28.040
information, so you have to be to know where you put it in, sorry, he was, we're a rather

16:28.040 --> 16:36.120
time, so yes, license has been changed, the hash is relevant for the packages, all those

16:36.120 --> 16:42.360
executable abilities are relevant for packages, with understanding correctly, then you say you could

16:42.360 --> 16:48.120
save time, but still we need to process every package to determine the differences to the

16:49.000 --> 16:54.760
right, so you, when you collect the question was, but we still have to process everything,

16:54.760 --> 17:01.000
right, the savings of there is the saving of storage, essentially, right, in order to find out

17:01.000 --> 17:07.560
whether something is apply or not, you have to check it, right, but for example, you have to check,

17:07.560 --> 17:13.000
you have to keep all the copyright holders for everything, right, instead of keeping them

17:13.000 --> 17:20.280
for 30 millions, you may keep it only for 600 if they are the same, okay, sorry, we're really

17:20.280 --> 17:25.720
out of time, next one, thank you very much.

