WEBVTT

00:00.000 --> 00:12.800
Okay, let's start. Thanks for coming to my talk. I really appreciate that. So, my name

00:12.800 --> 00:20.560
is Jean-Marie Verde. I'm French. You can probably hear that from my very strong accent. I do

00:20.560 --> 00:26.080
currently live in the US. I work for HPE within the advanced engineering group. The world

00:26.080 --> 00:30.720
of my group is to work on the technology we might be needing within the next three to five

00:30.720 --> 00:36.800
years to integrate into our product. So, I'm part of the team who is designing the next generation

00:36.800 --> 00:42.320
of the product and servers. How family are you with product? Do you know at least the name? So,

00:42.320 --> 00:50.080
this is our H86 suburb brand. Okay. I live in San Francisco, Barrier. So, my Apollo giant

00:50.080 --> 00:54.800
if I look a little bit blurry or not awake. It's currently at your clock in the morning to my

00:54.800 --> 01:03.200
body. I wouldn't say it's odd. My brain doesn't really know why it is. So, I'm involved in

01:03.200 --> 01:08.160
framework development and I'm also involved at supporting new chip architecture. So, I was part

01:08.160 --> 01:14.960
of the team with design and arm-based systems within HPE. So, we have into our product catalogue,

01:14.960 --> 01:21.040
multiple chip architecture from Intel and AMD, but also on peer and in VGA. And on peer and

01:21.040 --> 01:26.240
in VGA are pretty new to the catalogue. And we faced a couple of challenges when we tried to

01:26.240 --> 01:32.560
part of firmware stack to this architecture, which asked which trapped some research activity within

01:32.560 --> 01:38.880
the company and within the engineering. So, the common issues that we faced is we have a common

01:38.880 --> 01:45.680
firmware stack between our Intel and AMD platforms. And we wanted to offer the exact same family

01:45.840 --> 01:54.160
stack to our customers for in IRM. And due to the very strong legacy included into this

01:54.160 --> 02:01.440
firmware stack, the port has been pretty complex and difficult to feel. So, one we think is

02:02.320 --> 02:09.040
we are currently struggling because the firmware technology are currently aging. And all the

02:09.040 --> 02:15.120
concept of the firmware, at least within a servers, are pretty old. So, I don't know if you are

02:15.120 --> 02:22.080
familiar with the UE5, but UE5 is currently about 25 years old. So, which in the tech industry

02:22.080 --> 02:28.240
make it very old. And it has been designed at a time where system where we are very much less

02:28.320 --> 02:39.360
complicated than today. And security threat, one also very smaller. And this has been a very

02:39.360 --> 02:43.680
good technology during the past 20 years, but we think that we need to work on something else

02:43.680 --> 02:50.800
and the next stage. The other thing is the BMC side of MIU with baseball management

02:50.800 --> 02:57.360
controllers. So, raise your hand for the one who knows at least who what it is. So, the BMC is in

02:57.360 --> 03:04.080
charge of managing the server health. So, this is roughly a system into a systems. So, that is the

03:04.080 --> 03:10.320
first thing which starts when you put the power inside a server. It will provide remote management

03:10.320 --> 03:15.120
capability like turning on and half the machines, but it will be also monitoring the system

03:15.120 --> 03:23.760
status just to cool down the machines by spinning up the fans and recording all many issues or any

03:23.760 --> 03:29.360
hardware errors, which might be happening on the whole side, which could be for example, some

03:29.360 --> 03:35.680
demarrows, which happen pretty frequently in some cases. And this is used to debug the systems.

03:35.680 --> 03:41.360
So, that's BMC is running a firmware. And this firmware is getting closer and closer to

03:41.360 --> 03:48.480
full operating system. So, which is mainly based currently on Linux. And that's an embedded Linux

03:48.560 --> 03:54.400
distribution. It's getting bloated to, it's getting harder and harder to maintain because of the

03:54.400 --> 04:01.120
complexity of the software stack. And we think that there might be some challenges to address the future.

04:01.920 --> 04:07.680
So, in parallel, we envision that there will be needs for more and more cheaper architecture

04:07.680 --> 04:15.680
with the upcoming AI requirements. And also the innovation around RISPA, IRM, X86,

04:16.640 --> 04:24.080
GPU computing, FPGA, all of these things needs to be integrated with servers. And we started to

04:24.080 --> 04:30.800
look at how to do that at the hardware level. So, during the past 20 years, we designed system

04:30.800 --> 04:35.520
through a vulnerability approach, which means that we had an engineering team, which was focused

04:35.520 --> 04:41.920
at designing a specific product and they even get to the market. So, this ended up to be a massive

04:41.920 --> 04:47.360
motherboard where you have all the features and then massive motherboard. So, that doesn't scale

04:47.360 --> 04:53.760
in a world where you have to support with people, chip architecture. And we created into the

04:53.760 --> 05:01.280
open compute project foundation program just to build modular servers. So, when we speak about

05:01.280 --> 05:06.480
modular servers, there are a couple of challenges to discuss with the older industry partners,

05:06.560 --> 05:13.840
which means how to interconnect all the pieces together and avoid that there is incompatibility.

05:14.400 --> 05:19.280
So, this is not common into the IT industry. So, we are used to do that to the PCI Express

05:19.280 --> 05:24.400
consumption. For example, if you plug in a PCI Express card within an HP machine or

05:24.400 --> 05:29.920
third systems or any other vendors, it works. So, there's no miracle behind that. There has been a

05:29.920 --> 05:35.120
lot of works between all the two vendors and all the system architect just to make it work and

05:35.120 --> 05:41.040
ensuring that everything is working smoothly. So, while we are overseeing is that the

05:41.040 --> 05:46.960
future is going to be modular servers. Whatever happens. And when we think about modular servers,

05:46.960 --> 05:51.680
it means that the combination of the systems is going to be a way much more complicated than within

05:51.680 --> 05:56.960
the modular and monolithic world. It means that the firmware stack we have first to discover

05:56.960 --> 06:03.360
what's inside the machine. So, right now when we start a firmware on the servers, it knows exactly

06:03.440 --> 06:08.640
what to expect into the machine. So, within the next couple of years, the firmware we have to

06:08.640 --> 06:15.760
discover on what's inside the systems and try to initialize it properly without any issues.

06:16.320 --> 06:22.000
So, that approach of modular machines is not cannibal with the current aromatic energy that we have

06:22.000 --> 06:28.160
and work currently reviewing what's needed to be done. So, on the left-hand side, you have

06:28.160 --> 06:34.480
all the current behavior of the firmware stack, which is static, design for monolithic machines,

06:35.280 --> 06:44.320
and it's coming also with another pain mainly the update process is�d devices. So, which means

06:44.320 --> 06:52.560
that you have different update processes for an NVMe drive for a graphic card, a CPU, or a PMC.

06:53.520 --> 06:59.760
So, that also doesn't scale in a world where you have modularity. So, what we think is,

06:59.760 --> 07:06.560
we would be needing to put in place some front points where the hardware is going to get the firmware

07:06.560 --> 07:12.320
from and the firmware is going to become more and more volatile. So, we believe that it is not going

07:12.320 --> 07:17.840
to stand into the machine anymore, it is going to be somewhere on the network and the only one thing

07:17.920 --> 07:22.800
that the machine is able to do is to boot from the network and reach with its firmware from the

07:22.800 --> 07:30.080
network. And based on that, we might be able to have a very highly adaptative firmware stack,

07:30.640 --> 07:37.360
which means that we got a big database on firmware, which is sitting somewhere into the data

07:37.360 --> 07:41.920
centers or in the internet and the systems we are going to connect to that database.

07:42.960 --> 07:47.120
The idea is not to become proprietary around firmware distribution or whatever. So,

07:47.200 --> 07:52.480
all of this works, it is done into an open source program and it is performed within the

07:52.480 --> 08:01.280
open compute project foundation. So, it is fully public. So, one we are proposing for the next generation

08:01.280 --> 08:09.760
of systems and upcoming platforms is to cut the firmware into a separate part. So, the left-hand

08:09.760 --> 08:16.400
sign is something that if you are familiar with the ROM, like the PI stage. So, roughly we start

08:16.400 --> 08:21.840
the low-level hardware and put it into a stage where it is going to boot from a networking stack.

08:22.480 --> 08:27.760
So, you are going to tell me, but there is no networking stack on any NVMe drives.

08:28.720 --> 08:33.840
So, while working at the hardware level to add some internet interface is to an NVMe drives,

08:33.840 --> 08:39.360
image, which is signed and validated through the security stack of the ubiquitous loaders.

08:39.520 --> 08:48.080
And then launch from memory, it is retrieved through the network by using the TFTP protocol

08:48.080 --> 08:54.480
or the HTTP protocol. And during the boot process of that specific Linux kind of image,

08:54.480 --> 08:59.920
we are attaching a block storage devices which is coming from the network by using the ISC's

09:00.000 --> 09:09.040
protocol. So, everything is loaded from the network and we are just breaking the storage buyers

09:09.040 --> 09:13.920
and the firmware is totally volatile. So, if you unplug the machine from the data centers and you

09:13.920 --> 09:18.560
try to plug it somewhere else, it won't be able to boot as long as you do not have the

09:18.560 --> 09:24.720
right-hand stack. So, this is also increasing the security if you are hanging over system

09:24.720 --> 09:31.600
to between a user to different end users. And we are also able to that process to integrate

09:31.600 --> 09:36.480
some package manager on the firmware level. So, if you have to run a firmware update,

09:36.480 --> 09:41.520
you don't need to refresh the whole machines and you are not going to break the systems

09:41.520 --> 09:47.840
because you are updating the firmware remotely into a storage area which is in the rig right

09:48.720 --> 09:56.720
operation load. So, we are working as I say on implementing that technology at the NVMe

09:56.720 --> 10:03.200
level. So, what we do is we still have a boot block within every NVMe device.

10:04.240 --> 10:09.200
I don't know if you are familiar with NVMe device architecture but this is very amazing devices.

10:09.920 --> 10:18.320
So, the data center gradient NVMe device currently are coming with a NACOR arm 64B chip

10:18.320 --> 10:27.680
into the controller. Most of them are coming with 4 to 16GB of RAM. So, roughly you can run

10:27.680 --> 10:33.120
Linux locally into every NVMe device that you are running within your data center. It's really

10:33.120 --> 10:37.680
amazing when you think about it. But this is also coming with older complexity.

10:38.400 --> 10:45.120
So, that is why we think that we need to be able to keep this firmware up today. It's also for

10:45.120 --> 10:50.320
security purposes because the firmware which is running into the NVMe device is becoming more

10:50.320 --> 10:56.080
and more complex. And what we have done is when using the CPU with the process done for the

10:56.080 --> 11:03.440
VMC. So, this is exactly the same complex system which is replicated into different devices.

11:03.680 --> 11:10.080
And we use the VMC as a software firmware database. So, we boot the VMC from the network.

11:10.080 --> 11:18.160
The drive is starting up and is issuing a DHCP request to the VMC. And it's gaining the

11:18.160 --> 11:25.760
firmware image associated to that drive into its own memory. And then you can start the device.

11:26.320 --> 11:32.000
So, roughly that whole process will ensure that every time you reboot the host, it will

11:32.000 --> 11:38.320
gather the letters and letters firmware targeting that specific devices. And we can apply that

11:38.320 --> 11:45.040
boot processes to all the PCIX device devices that we have known hope to now. So, one of the key

11:45.040 --> 11:51.200
advantages is during the discovery processes, the client are just going to issue DHCP request

11:51.840 --> 11:58.400
with an incoming device descriptors. And that device descriptors is used to retrieve the right

11:58.400 --> 12:03.920
firmware image to the target. And this is giving us the opportunity to be way much more modular

12:03.920 --> 12:10.400
under the firmware level. So, we are not pushing the firmware. We are just delivering firmware based

12:10.400 --> 12:21.520
on incoming requests from the systems. So, I have a small demo. I hope it works. So, we have

12:21.520 --> 12:26.720
what we call a firmware database manager, which is roughly a top-of-fract, switches, which has

12:26.720 --> 12:33.680
been modified to then if you have found my content to servers. And I have a server which is

12:33.680 --> 12:40.320
based in Houston that I'm currently turning on, I hope it is going to turn home. And you should

12:40.320 --> 12:49.840
be able to see on that screen. I don't know why it's not full screen. Yeah, it should be full screen.

12:49.920 --> 12:56.000
So, on the left-hand side, you see the, why don't you see the screen? Because I need to

12:56.640 --> 13:02.800
exit the presentation mode. So, on the left-hand side, you see the BMC, which is starting up.

13:03.600 --> 13:10.480
So, right now, it has retrieved an IP address on the switch. And it is receiving the firmware image

13:10.480 --> 13:17.600
from the switch to the TFTP protocol. So, as soon as the firmware has been retrieved, it is validating

13:17.600 --> 13:24.960
the feed images. So, which has been encrypted and signed. So, the private key has been validated.

13:24.960 --> 13:31.520
And that's a channel trust from the sequence of trust. The init RAMFS is also signed. So,

13:31.520 --> 13:38.960
we are validating the init RAMFS images, which is containing the iSCSI mount protocol, which is used

13:38.960 --> 13:46.800
to mount the Wi-Fi system from the BMC. All of that is stateless in some way. So, if we turn

13:46.800 --> 13:52.720
off the power, the firmware is no longer inside the machine. So, the only one thing which sits inside

13:52.720 --> 14:01.120
the platform is the bootloader, which is U-Boot, which can be held in it through the, um, the, the,

14:01.120 --> 14:07.680
the internet stack. So, the Linux kernel is booting and we will be seeing that it is issuing

14:07.680 --> 14:15.200
the DHCP request. So, this is just before mounting the root Wi-Fi system. And it is going to

14:15.200 --> 14:20.160
mount the root Wi system just after it's received the IP address. So, this is what we see there.

14:20.160 --> 14:25.040
There is an exclusive mount, which has been issued for devices, which is about 5.2 megabytes.

14:25.040 --> 14:31.600
So, this is also coming with one core advantages. So, there is no space limitation. So, you can

14:31.600 --> 14:37.600
envision to have a firmware with 10 gigabyte of storage or 100 of gigabyte of storage. You are

14:37.600 --> 14:45.440
going to tell me, but why do we need that? Um, I would, I would tell you, we don't know yet

14:45.440 --> 14:52.080
why we may be needing that. But what we, what we know is that the spino is currently

14:52.080 --> 14:58.080
painting the apps, because we need to use compressed, fine system just to make fit on the 32

14:58.080 --> 15:05.440
meg of 64 meg or 120 meg devices, all the software that we need to turn on, complex devices.

15:05.680 --> 15:16.320
So, well, this kind of approach, we can, we can definitely remove that limitation. So, you can have

15:16.320 --> 15:21.040
one gig, 10 gig or even smaller is needed, but there is no space limitation.

15:22.160 --> 15:28.480
Let's wait for the boot to finish. And, currently, done with the presentation.

15:29.440 --> 15:35.920
I don't know if you have any questions about this concept. So, let's start with the

15:35.920 --> 15:38.480
questions. Who want to start? I'll see you too.

15:38.800 --> 15:41.440
OK.

15:48.080 --> 15:48.160
Yes.

16:08.480 --> 16:15.480
I would say that's the same thing when dealing with security.

16:15.480 --> 16:18.480
If something goes wrong anywhere, you're just crude.

16:18.480 --> 16:24.480
So we can secure, I excuse me.

16:24.480 --> 16:28.480
If something goes wrong with the TFTP servers,

16:28.480 --> 16:31.480
what would happen for the security standpoint.

16:31.480 --> 16:33.480
So while using TFTP today,

16:33.480 --> 16:37.480
the main goal is to use HTTP as protocol.

16:38.480 --> 16:41.480
With encryption, and our next generation of BMC

16:41.480 --> 16:45.480
is currently including what we call the security

16:45.480 --> 16:48.480
and play, where we can store security.

16:48.480 --> 16:53.480
Some certificates just to initiate the link.

16:53.480 --> 16:58.480
And what we envision is that we will be using this kind of reports on any devices.

17:02.480 --> 17:05.480
The gentleman at the back, it was the first one to raise his end.

17:06.480 --> 17:12.480
You understand that correctly, yes?

17:23.480 --> 17:27.480
Yeah, it needs the bootloader of the firmware.

17:27.480 --> 17:32.480
But while we look at the firmware complexity from a server,

17:32.480 --> 17:38.480
the bootloader is nothing compared to everything else around that firmware stack.

17:38.480 --> 17:43.480
So in each device, what we are working on is how the bootloader,

17:43.480 --> 17:47.480
which is just turning on the basic hardware,

17:47.480 --> 17:50.480
and issuing the TFTP HTTP as a request.

17:50.480 --> 17:55.480
So we think that this is trying to get a very minimalistic firmware

17:55.480 --> 17:58.480
on building into the device,

17:58.480 --> 18:01.480
and get all the complex software stack from the website.

18:21.480 --> 18:25.480
Okay, so the question is about the PCI Express specification,

18:25.480 --> 18:30.480
which we set up a minimum time just to turn on the devices.

18:30.480 --> 18:33.480
I would say, why would we do right now,

18:33.480 --> 18:35.480
as being part of advanced engineering,

18:35.480 --> 18:37.480
as well trying to prove that it could work,

18:37.480 --> 18:40.480
and if it works, specification I'll make to evolve.

18:40.480 --> 18:43.480
So, and if it makes sense to everybody,

18:43.480 --> 18:46.480
we just get back to the PCI Express specification,

18:46.480 --> 18:48.480
and then let's make that develop.

18:48.480 --> 18:52.480
And right now, we have all the technology at the BMC level

18:52.480 --> 18:56.480
to keep in reset states the CPU complex,

18:56.480 --> 18:58.480
and the PCI Express would complex,

18:58.480 --> 19:00.480
just to wait for all the PCI Express devices

19:00.480 --> 19:04.480
to receive their firmware before releasing the PCI with complex.

19:04.480 --> 19:06.480
So we can do a lot of things.

19:08.480 --> 19:10.480
So we have time out.

19:10.480 --> 19:12.480
I'll be there during the next two days,

19:12.480 --> 19:14.480
so if you have questions, feel free.

19:14.480 --> 19:15.480
Thank you.

19:22.480 --> 19:24.480
Thank you.

