WEBVTT

00:00.000 --> 00:29.920
You can reach me on my email on MasterLun and I'm here on the floor but what I'm present is a

00:29.920 --> 00:35.920
an effort of a multi-year and several people. So for example, I'm 21 and you do

00:35.920 --> 00:44.920
a big Stefano team with you and so on. So who knows Geeks? Who knows software

00:44.920 --> 00:51.920
heritage? Okay, so for the people who don't know software heritage, there is a

00:51.920 --> 00:57.920
talk in open-day room a couple of years ago so you can go watch this and you

00:57.920 --> 01:03.920
will have all the details about software heritage and you also have in the open

01:03.920 --> 01:09.920
we share their room representation couple of years ago about Geeks and you will have more

01:09.920 --> 01:19.920
details about that. So today for sure I present software and Geeks. So what's the

01:19.920 --> 01:25.920
problem? The problem is software is dual. As human we with source code but the machine

01:25.920 --> 01:31.920
run binary. So we need something that transforms the source code to binary. So

01:31.920 --> 01:37.920
okay we have the yellow world program so everybody can read it and and so on and this program

01:37.920 --> 01:42.920
is transformed by something like a compiler or interpreter or whatever to be

01:42.920 --> 01:48.920
binary. And the machine runs a binary but you cannot read the binary as your human.

01:48.920 --> 01:54.920
You cannot know what is open. I mean all you are kind of alien. In general we read

01:54.920 --> 02:00.920
we audit and we verify the source code. So we need two things. We need one thing to

02:00.920 --> 02:05.920
keep care about the source code and this is software heritage and we need something that

02:05.920 --> 02:11.920
care about the transformation and this is Geeks and if we have the both the both

02:11.920 --> 02:16.920
Geeks and software heritage. So we take care about the source code and about the

02:16.920 --> 02:21.920
transformation. We can have a long term transparency to be able to know exactly

02:21.920 --> 02:27.920
what happens from the source code to the binary. So this is the main message.

02:27.920 --> 02:32.920
So a typical scenario if we are in academic is you have Alice for example

02:32.920 --> 02:39.920
she publish a results in 2022. The source code is publically available and so on and

02:39.920 --> 02:47.920
the software is identified as zero dot nine okay nice couple of years later

02:47.920 --> 02:53.920
for example in terms 2025 break need to to give a look to this paper and the

02:53.920 --> 02:59.920
publication and so on. And now it's really easy to install the version 1.2

02:59.920 --> 03:06.920
but it's really more complicated to install version 0.9. And even if you

03:06.920 --> 03:12.920
success to install this old version 0.9 the results are not the same.

03:12.920 --> 03:18.920
The results are different and the question is why and the answer is because

03:18.920 --> 03:22.920
break doesn't control all the source of variation of the computational

03:22.920 --> 03:27.920
environment and to control the source of variation you need to

03:27.920 --> 03:32.920
insert this question what is the source code what are the tools that you need to

03:32.920 --> 03:37.920
build this source code what are the tools you need to run the binaries that you build

03:37.920 --> 03:44.920
and this answer for each tool of the dependency. So the question I'm not new

03:44.920 --> 03:51.920
but answering this question means control the source of variation. So other

03:51.920 --> 03:55.920
we insert this question because it's not new so we have all the package manager

03:55.920 --> 04:02.920
the IPT and so on and so on and this is where gigs come I don't know sorry

04:02.920 --> 04:08.920
what does the package manager the package manager in fact the control

04:08.920 --> 04:15.920
graph this graph all nodes are our source code and all nodes all nodes describe

04:15.920 --> 04:21.920
build time tools how you compile it and so on and also you have the dependency

04:21.920 --> 04:27.920
so for example the first package I think you don't ever to read it but it's our

04:27.920 --> 04:31.920
money and this package depends on other package and so on so that's why you have

04:31.920 --> 04:36.920
the graph and in fact if you want to install if at least say I need the

04:36.920 --> 04:43.920
package our money in fact the graph is almost 500 nodes so it's really big you

04:43.920 --> 04:50.920
cannot describe that by end and the package manager does a job for you so that's where

04:50.920 --> 04:56.920
gigs come in gigs manage this graph so it's a software developed since two

04:56.920 --> 05:03.920
more than ten years now so the communities start to be to be nice and you can

05:03.920 --> 05:08.920
use it as a package manager or complete a system and it's based on the same

05:08.920 --> 05:15.920
principle as as nix so it's a functional software deployment model and the main

05:15.920 --> 05:22.920
topic I mean the main point is it's transparent and you can verify all the graph so

05:22.920 --> 05:27.920
if really a perfect tool for software deployment in reproducible research workflow

05:27.920 --> 05:35.920
especially in in open with search context so okay if we are back to the to the

05:35.920 --> 05:42.920
example at least say okay I use these tools at this version as a version zero

05:42.920 --> 05:50.920
dot nine and we need to fix a graph and this graph if we fix this graph we can

05:50.920 --> 05:56.920
we deploy it later but to fix this graph saying zero dot nine is not in us

05:56.920 --> 06:01.920
because where do you describe the dependency is the version of the dependency

06:01.920 --> 06:07.920
the compilation option of the dependency and so on and so on and in gigs

06:07.920 --> 06:14.920
in fact we describe a state so one state for example eb three four f f one

06:14.920 --> 06:20.920
this state this revision fix all the collection of the package all the option

06:20.920 --> 06:25.920
everything and if you know this state this revision you can really

06:25.920 --> 06:30.920
deploy the exact same graph when Alice says I use gigs at this revision

06:30.920 --> 06:35.920
black no everything and can deploy the exact same environment with the same

06:35.920 --> 06:46.920
dependency the same option everything is the same so we have a nice command line

06:46.920 --> 06:51.920
where where you say gigs time machine you specify the revision and you jump

06:51.920 --> 06:57.920
in this revision to install the tool that you want so if Alice says I use these tools

06:57.920 --> 07:03.920
at this specific revision black later or whenever when whenever where so on another

07:03.920 --> 07:10.920
machine another point in time black is able to run the exact same software

07:10.920 --> 07:16.920
environment with this command line so this is really cool but there is assumptions

07:16.920 --> 07:22.920
there is two assumptions the first one which is the second in the slide is all the

07:22.920 --> 07:28.920
bills must be deterministic so this is not always the case and thanks to the

07:28.920 --> 07:35.920
flexible bills effort we have been the progress are incredible so we are

07:35.920 --> 07:42.920
it's almost we are going doing really good job in this area and the other point

07:42.920 --> 07:48.920
is that all the software need to be publicly available so it means that we need

07:48.920 --> 07:56.920
almost 500 source code to be able to deploy later so this is the issue in fact

07:56.920 --> 08:04.920
so if for example now more or less you give a look to the source code in 2022

08:04.920 --> 08:10.920
so you say okay now I give a look to the the location of the source code

08:10.920 --> 08:16.920
that I register that I know that the source code was on this server two or three

08:16.920 --> 08:23.920
of three years ago and if you do that you see that almost 4% are meeting

08:23.920 --> 08:31.920
so in in 2000 in 2022 there is source code at some place and three years later

08:31.920 --> 08:38.920
this source code is up here so this is a link rod and so okay is an issue

08:38.920 --> 08:48.920
because if if in this three dot 6% we have something that is missing from the original

08:48.920 --> 08:55.920
URL you cannot build it so you are done but it's worse than that because this package

08:55.920 --> 09:02.920
can have dependency I mean package can depend on these packages so for example

09:02.920 --> 09:09.920
OpenGDK is a package that have a lot of package that depend on OpenGDK so if you lose

09:09.920 --> 09:17.920
OpenGDK you don't lose only OpenGDK you lose 184 dependency which depend on OpenGDK

09:17.920 --> 09:28.920
so it's more than the three dot the three dot 6% and if you go try this order order

09:28.920 --> 09:34.920
and order it becomes worse and worse so what we are doing currently in science

09:34.920 --> 09:46.920
so for example I mean 2000 in 19 it's not so all but you see you have almost 10% that it just lost

09:46.920 --> 09:53.920
so science currently is building on sand and that's why we need software heritage

09:53.920 --> 09:59.920
because we have this problem the link rod project are created postdoc or is a

09:59.920 --> 10:06.920
IP and so on and and the source code are just disappearing from internet or was you have

10:06.920 --> 10:11.920
kind of big company that just closed the service for example Gitu use Google code

10:11.920 --> 10:17.920
bit bucket and so on so when you have a website that is up here where you go you go to

10:17.920 --> 10:23.920
internet archive for example but where you go if you have a link repository that is up here

10:23.920 --> 10:28.920
what do you go you go to software heritage so software heritage collect preserve

10:28.920 --> 10:34.920
and share all the publicly available source code and currently to my knowledge is the

10:34.920 --> 10:41.920
the largest publicly available archive of source code so they they and just a lot a lot

10:41.920 --> 10:49.920
a lot of stuff cool so you ask me cool but you just said to be to fail so why

10:49.920 --> 10:57.920
software heritage cannot fail I mean it's the same story why and I don't know maybe they

10:57.920 --> 11:03.920
me that but the guarantee is that it's it's a it's a good project and the strong

11:03.920 --> 11:09.920
project is because it's international and non-profit so for example you have UNESCO

11:09.920 --> 11:14.920
you have a lot of company that that shares the vision that put money that sponsor and

11:15.920 --> 11:21.920
so it's widely supported cool so what is good with software heritage with software heritage

11:21.920 --> 11:29.920
you have this identifier the identifier you don't identify for example with the label

11:29.920 --> 11:35.920
but we identify with the content so it's content address it like a digest a checksum

11:35.920 --> 11:41.920
and when you know that you know exactly how to refer to the object that you want so for example

11:41.920 --> 11:47.920
this is a file but you also can use the same system for snapshot really is revisions

11:47.920 --> 11:53.920
directory content and so on and so on so here you have the address where you know exactly

11:53.920 --> 12:03.920
the the the revisions the the file how he was when he was archive and this is an

12:04.920 --> 12:08.920
image and standard so you have a specification and so on you can go in S

12:08.920 --> 12:16.920
W ID dot org and you have information so okay the point is we have a content

12:16.920 --> 12:21.920
address server so this is the first I mean this is a point with software heritage we have a long

12:21.920 --> 12:28.920
term content address server so now back to geeks weeks for for the origin so to fetch

12:28.920 --> 12:35.920
a source code we have a different a table geet mercry and so on and so on and so on and

12:35.920 --> 12:41.920
we have a location and we ensure that the content is exactly what we expect to have

12:41.920 --> 12:48.920
the the quick-up to the f-e-h so source code in geeks is also essentially content

12:48.920 --> 12:57.920
addressed so if if the the the the the location is no longer available or there is

12:57.920 --> 13:02.920
something wrong in fact what what what what you can do so black can work around

13:02.920 --> 13:08.920
if a copy is available elsewhere so for example elsewhere if it is a new URL you

13:08.920 --> 13:15.920
can use geeks download and and or or you can use a content address server so the geeks

13:15.920 --> 13:21.920
project or the next project are these server for these files or also the software

13:22.920 --> 13:31.920
uh heritage initiative so this is the the architecture so in geeks we we we we save

13:31.920 --> 13:39.920
software in software heritage in two ways the first one is we list all the the

13:39.920 --> 13:44.920
source code we have in geeks all the packages we have in geeks and we feed software

13:44.920 --> 13:49.920
is that and a user can also say okay I have this package on my own and I want to

13:49.920 --> 13:55.920
save it to to software heritage so you send a request with the this command geeks

13:55.920 --> 14:02.920
length dash c occurred okay and when the upstream is up here in fact the user

14:02.920 --> 14:09.920
sent a query to software heritage and you have the the the the content back and

14:09.920 --> 14:13.920
you use with the vault and the user of the data can build and so on so what the

14:13.920 --> 14:20.920
point with this these archive uh uh uh part the the the part is in fact the

14:20.920 --> 14:28.920
tarball is a really complicated problem because for example if in geeks for example

14:28.920 --> 14:38.920
with the software harmony we have this arch 1-D if 1-D f7 but if you

14:38.920 --> 14:43.920
decompress and recompress with one level of compression or you decompress and

14:43.920 --> 14:47.920
recompress with another level of compression we have different arch so this is

14:47.920 --> 14:53.920
work all the content address system so in fact you need a way to to extract

14:53.920 --> 14:59.920
some information of the level of compression and and so on and so other other

14:59.920 --> 15:04.920
is a well known it's in case you need a standard problem you know we have you

15:04.920 --> 15:13.920
know uh uh four 14 standard and asian we have 15 so this is what fix this archive

15:13.920 --> 15:18.920
this archive creates this bridge between all the different standard and extract the

15:18.920 --> 15:26.920
meta information of that that you need to recreate later the exact same tower

15:27.920 --> 15:32.920
so in fact what what what what what we do we have a tower the tower we disassembled

15:32.920 --> 15:36.920
the tower we extract the meta data for example the compression level the the

15:36.920 --> 15:41.920
times 10 of the five and so on and this is going to the desired archive database

15:41.920 --> 15:46.920
and the content itself the pure content is archived in in software

15:46.920 --> 15:51.920
heritage then when we need it we ask the to software heritage give me the

15:52.920 --> 15:57.920
content we ask to this archive give me the meta data and we will build the exact same

15:57.920 --> 16:06.920
compressed tower ball okay so this is what I'm just said so does it work yes

16:06.920 --> 16:14.920
more or less or it's improving so over the year you see that we are improving

16:15.920 --> 16:23.920
the situation so the unetermined is we don't know so maybe it's it's stored in

16:23.920 --> 16:28.920
and we are able to to to to to use it or maybe not it just we are not able to

16:28.920 --> 16:36.920
determine if it works or not yet so we are working to fix that but the

16:36.920 --> 16:42.920
the interesting part is everything is is decreasing we are more or less improving

16:42.920 --> 16:55.920
the situation so the the ID is to say okay I have a source code from 2019 and I I

16:55.920 --> 17:02.920
I I want to spell is the exact same version and with this command line gives

17:02.920 --> 17:07.920
time machine dash dash commit blah blah blah you can recreate the exact same

17:07.920 --> 17:12.920
version of the of the computational environment whatever whatever what happens to

17:12.920 --> 17:17.920
GitHub GitHub whatever blah blah blah and this I mean we are improving the

17:17.920 --> 17:22.920
situation and we are fixing all mean yeah what I said but what is not working

17:22.920 --> 17:29.920
currently so conclusion conclusion sit and reference source code using software

17:30.920 --> 17:38.920
we take identifier and yes use geeks and yeah thank you

17:47.920 --> 17:49.920
questions yeah questions

17:49.920 --> 17:58.920
yeah

18:19.920 --> 18:25.160
So, so the, we put the question. The question is, when the, we want to

18:25.160 --> 18:29.360
rupture do something the other way as an impact on the rapid reproducibility

18:29.360 --> 18:38.800
artifact. So, yes. And, the, I mean, the answer is, I don't know. So, my

18:38.800 --> 18:43.240
opinion is, do we want to ru, to rebuild everything? This is the first

18:43.240 --> 18:50.440
question. And, do we want to pay the cost with, for example, energy, see a carbon

18:50.440 --> 18:54.040
emission and so on? Do we want to rebuild everything? So, the first question. And, if

18:54.040 --> 18:58.640
we don't want to rebuild everything, what could we have to, to be able to

18:58.640 --> 19:03.320
re-re-re-ify that everything is correct. In fact. And, this is where gigs, I mean,

19:03.320 --> 19:09.720
provide really good answer, because you can trace all, all the dependency,

19:09.720 --> 19:19.400
everything and check everything. So, my answer is, for, for the binary reproducibility, I

19:19.400 --> 19:25.520
don't know. And, this is something to do. And, on others, on other end, I think that we also

19:25.520 --> 19:34.240
need to improve the verification part by, be able to audit without, rebuild everything.

19:34.240 --> 19:42.800
Does the answer, or the question? Yeah. Yeah, because it's a huge amount of data,

19:42.800 --> 19:49.040
where it is stored and, as you encountered, which are next on libraries, because it's a

19:49.040 --> 19:57.360
storage problem. So, so, software heritage, the main storage is in France.

19:57.360 --> 20:05.360
There is a mirror in Italy, and probably some other mirror will pop-ups soon all

20:05.360 --> 20:11.640
around the world. So, this is for the, for the answer. And, the other one is, what about the

20:11.640 --> 20:20.880
relationship with the National Library, also is for? So, software heritage is, is, is, is, is there

20:20.880 --> 20:27.100
the agreement with UNESCO? For example, so they try to take care about all this,

20:27.100 --> 20:31.380
I mean, national library and so on, archivist.

20:31.380 --> 20:38.380
All these people doing this kind of job to try to keep care

20:38.380 --> 20:43.020
about the long-term artefacts that human knowledge,

20:43.020 --> 20:46.060
the human knowledge, that's right,

20:46.060 --> 20:49.860
to be in touch with everybody.

20:49.860 --> 20:52.220
So I don't know exactly for these countries,

20:52.220 --> 20:53.660
these countries as a country, but for example,

20:53.660 --> 20:58.660
in France, there is connections.

20:58.660 --> 21:00.660
Yeah, Nikola?

21:00.660 --> 21:01.660
Just one comment.

21:01.660 --> 21:04.660
Software heritage has a blue tomorrow and yes.

21:04.660 --> 21:07.660
Or if you have more questions about software heritage,

21:07.660 --> 21:08.660
come and see us tomorrow.

21:08.660 --> 21:10.660
Excellent.

21:10.660 --> 21:12.660
What's the goal of this question?

21:12.660 --> 21:14.660
Yeah.

21:14.660 --> 21:15.660
Thanks for your talk.

21:15.660 --> 21:16.660
I have a question.

21:16.660 --> 21:20.660
If for some of these packaging weeks, as a package,

21:20.660 --> 21:23.660
if you make the link between the original source of the package,

21:23.660 --> 21:26.660
you'll be at the bottom of each item.

21:26.660 --> 21:30.660
Is it made of a matching package or do you have to specify this item

21:30.660 --> 21:33.660
if there's a field of packaging?

21:33.660 --> 21:34.660
No.

21:34.660 --> 21:35.660
So.

21:35.660 --> 21:36.660
Ah, yeah.

21:36.660 --> 21:42.660
So how do we connect the gift package definition

21:42.660 --> 21:46.660
with the location in the gift package and software heritage?

21:46.660 --> 21:49.660
So this is what I keep a bit because I always a bit late,

21:49.660 --> 21:52.660
but in geeks, you have the checksum to say,

21:52.660 --> 21:56.660
okay, you have the checksum.

21:56.660 --> 22:02.660
And in fact, we send this checksum to software heritage

22:02.660 --> 22:05.660
as a request and software heritage,

22:05.660 --> 22:08.660
say, okay, we know this checksum and this checksum

22:08.660 --> 22:14.660
of拿 and so on could match the hour identifier.

22:14.660 --> 22:17.660
There is another question.

22:18.660 --> 22:21.660
I was wondering because, as I understood,

22:21.660 --> 22:24.660
the sub name, that a bit special.

22:24.660 --> 22:27.660
You're all the way to the software heritage?

22:27.660 --> 22:28.660
Yeah.

22:28.660 --> 22:31.660
Good. Good question.

22:31.660 --> 22:34.660
So the question is, is,

22:34.660 --> 22:37.660
okay, your story is really nice,

22:37.660 --> 22:39.660
but what about these archives in the picture?

22:39.660 --> 22:42.660
Is it in software heritage or not and so on?

22:42.660 --> 22:44.660
So the answer is not yet,

22:44.660 --> 22:49.660
but there is a work in progress to try to have mechanism

22:49.660 --> 22:55.660
to be able to rescue this database.

22:55.660 --> 22:56.660
Yeah.

22:56.660 --> 22:58.660
This is a work in progress.

22:58.660 --> 23:02.660
And today's the goal we discuss with the software heritage team

23:02.660 --> 23:05.660
to see what can be done in this direction.

23:05.660 --> 23:06.660
Next, thank you.

23:06.660 --> 23:07.660
Thank you.

23:07.660 --> 23:09.660
Thank you.

