WEBVTT

00:00.000 --> 00:11.280
Hello everyone, I hope the audio works and the last rows, speak up, okay.

00:11.280 --> 00:17.080
So welcome to measure what you manage, transparent energy consumption of cloud infrastructure.

00:17.080 --> 00:22.560
So in the following 10 minutes, I will really quickly try to take you on our journey for

00:22.560 --> 00:28.440
measuring and understanding energy consumption in a focused on cloud environments.

00:28.440 --> 00:34.400
And yeah, I covered some challenges that we faced while mapping software workloads with

00:34.400 --> 00:37.400
physical research usage.

00:37.400 --> 00:43.880
And we know that today cloud and it's up to 4 or up to 6% of the world comes a mission

00:43.880 --> 00:49.040
which is a lot and also we know that this will contribute in the near future even more.

00:49.040 --> 00:54.400
So normally I asked, do we have any public or private cloud providers with us today?

00:54.400 --> 01:00.480
And if so, I would like to ask if you know the ecological footprint of your cloud environment

01:00.480 --> 01:06.080
for example, for the last year, we are not only talking about energy consumption.

01:06.080 --> 01:12.840
So yeah, normally if I would say like if anyone does it already, my reply would be

01:12.840 --> 01:17.280
wow, let's talk about after this presentation, but if not why.

01:17.280 --> 01:21.360
And isn't it something that your consumers and also the consumers of your consumers would

01:21.400 --> 01:23.400
probably like to know?

01:23.400 --> 01:31.920
Yeah, one to make it even more concrete in 2027, globally I demand could be responsible for

01:31.920 --> 01:40.000
2, 4.2 to 6.6 billion cubic meters of water withdrawal.

01:40.000 --> 01:46.640
And that's 4 to 6 times Denmark's total annual water withdrawal, which is a lot.

01:46.640 --> 01:51.120
I would argue that it's really important to determine the whole environmental impact

01:51.120 --> 01:55.600
of distributed digital systems and the software running on them.

01:55.600 --> 02:01.600
So when we look at the impacts actually, let's talk at holistically looking at the environment

02:01.600 --> 02:03.400
impact that we face.

02:03.400 --> 02:08.040
So there are various approaches and attempt to really systematically map these potential

02:08.040 --> 02:09.040
impacts.

02:09.040 --> 02:13.440
For example, the computer related, which includes the embodied and the operational,

02:13.520 --> 02:20.960
which we talked about, or the session before talked about, and immediate application impact,

02:20.960 --> 02:25.960
they can be optimized because it's something like transportation and electricity costs,

02:25.960 --> 02:29.920
but more importantly, we also have the system level impact, right?

02:29.920 --> 02:36.240
So the rebound affects, for example, by increasing the GHD emissions.

02:36.240 --> 02:42.040
So the solution could actually be looking at the lifecycle assessment, which enables

02:42.040 --> 02:47.400
us to actually look at the environmental impact of these digital services and record them

02:47.400 --> 02:50.400
over the entire lifecycle.

02:50.400 --> 02:54.720
Environmental impact categories, since I only have two minutes, I will quickly let you

02:54.720 --> 02:55.720
refrew it.

02:55.720 --> 03:01.960
So it's about the energy demand, global warming potential, the resources it costs, water,

03:01.960 --> 03:08.200
consumption, electronic waste as well, and the pollutant effects.

03:08.200 --> 03:10.480
Coming to our research project, I could itch it.

03:10.480 --> 03:16.360
It's funded by was funded as a part of the Green Tech Innovation Competition of the German Federal

03:16.360 --> 03:18.920
Ministry of Economics and Climate Action.

03:18.920 --> 03:23.360
Here's a website as well with all of the information.

03:23.360 --> 03:31.480
And our research was the goal of our research project is as said to assess the environmental

03:31.480 --> 03:37.760
impact for distributed systems, but it's a cross cloud edge and antivisers.

03:37.760 --> 03:43.040
I will talk mostly about cloud, but it's a bigger project because we consist of a community

03:43.040 --> 03:47.800
of partners from the science, research and industry, backgrounds, so our desolate

03:47.800 --> 03:53.880
remains, but also the Institute of Applied Ecology and the German Society of Informatics

03:53.880 --> 03:54.880
Apart.

03:54.880 --> 04:00.000
And me as well, so we have the open source business alliance, a general on profit that

04:00.000 --> 04:06.800
operates and at work of companies and organizations developing, on building, using open

04:06.840 --> 04:07.800
source software.

04:07.800 --> 04:16.040
So, a quick look into what actually happens in Ecology, I try to method out as like a

04:16.040 --> 04:23.160
workflow, so I want to really quickly just show you what we are like developing on the

04:23.160 --> 04:29.960
major levels, so it's called the test bench, and then you can define with IAC, your infrastructure,

04:29.960 --> 04:35.720
then digital twins get creative of these different infrastructure, so antivisers, cloud

04:35.720 --> 04:41.720
infrastructure, etc, and a system and a test can be executed on these test bench, and

04:41.720 --> 04:48.440
then we collect metrics and use the mythology proposed by the Institute of Applied Ecology.

04:48.440 --> 04:55.200
They implemented, but it's using so-called energy profiles and regression models to combine

04:55.200 --> 05:02.720
not only the usage phase, but also as explained there's a using the prior shown environmental

05:02.720 --> 05:07.600
impact categories we focused on, so in the manufacturing and in the disposal phase, but also

05:07.600 --> 05:14.040
the research and energy consumption of the workload while running, so in the usage phase.

05:14.040 --> 05:20.240
Yeah, this would be how we actually, but as said limited time, so maybe make a picture and

05:20.240 --> 05:25.520
it's also online as well, but that's actually how we get the data, and so it's actually

05:25.520 --> 05:32.320
like static energy profiles, and as a result, to move on further, you could define how

05:32.320 --> 05:39.160
an actual infrastructure combining of different platforms contribute in its complete life cycle.

05:39.160 --> 05:44.600
And for cloud, obviously, energy consumption is the most important aspect for us, I mean,

05:44.600 --> 05:50.160
life cycle as well, but getting this information is really hard.

05:50.160 --> 05:51.160
It's even harder.

05:51.160 --> 05:56.920
I actually also want to look at the hyperscalers, because their access to relevant data is highly

05:56.920 --> 05:57.920
limited.

05:57.920 --> 06:05.480
I researched a lot, so I actually, so there's no exposure of MSR, and there are some solutions

06:05.480 --> 06:09.640
like the Buddhist API and how they do it, and for example, there's also the cloud comfort

06:09.640 --> 06:14.920
print, which is also an open source community, I'm using the billing APIs of the hyperscaler,

06:14.920 --> 06:23.160
but what's very limited in these, what's very limited is still the information on the

06:23.160 --> 06:27.080
management overhead, but that's a lot on hyperscalers.

06:27.080 --> 06:35.320
So yeah, so to create energy profiles, we use the really, I'm similar way, it was first

06:35.320 --> 06:42.640
introduced by Feds, and now it's part of the board, this API, but this, yeah, actually,

06:42.640 --> 06:49.480
you look at the bare metal, you do stress, you collect certain rapid metrics, line them

06:49.480 --> 06:53.840
with the utilization of certain processes, and then you use an instance, number of virtual

06:53.840 --> 06:59.320
CPU ratio, and compare to the bare metal, number of virtual CPUs, that's really quickly

06:59.320 --> 07:02.160
how an energy model is, I created.

07:02.160 --> 07:08.920
So this is rather an estimation, and you use regression to estimate the actual usage after

07:08.920 --> 07:17.120
deploying it as a digital twin, but how nice would it be to use an open cloud, where you

07:17.120 --> 07:22.800
can actually access host levels, and find out what is the scaling effects, what is the

07:22.800 --> 07:27.760
baseline of the management service, and how does it contribute to the actual workload?

07:27.760 --> 07:35.080
This is what we did with the help of the Sering CloudStack, it's based on OpenStack for

07:35.080 --> 07:41.520
the infrastructure as a service component, and yeah, so with our cloud provider, a technical

07:41.520 --> 07:46.680
partner's cluster, and scale up, we build two different cloud environments, and then

07:46.680 --> 07:53.000
we use sensors, so power distribution units, to get the physical actual measurements.

07:53.000 --> 07:58.240
So in order that we can at some point really see how accurate is, for example, a solution,

07:58.240 --> 08:04.120
a few of you might maybe know, Skafandra, or maybe in Kepler, MSR, metrics, how accurate

08:04.120 --> 08:08.200
are they when you take the physical measurement as a baseline, maybe you need some sort

08:08.200 --> 08:12.040
of distribution factor or something like that.

08:12.040 --> 08:16.640
So yeah, we faced a few issues, starting off with two different environments, one is based

08:16.640 --> 08:25.240
on OCP hardware, meaning you don't really have the option of having energy measurements,

08:25.240 --> 08:30.440
per device, you only can measure it correct, so you need a distribution factor of how much

08:30.520 --> 08:40.040
which serve us or note actually contributes, so we needed to make a concept that works

08:40.040 --> 08:47.440
for different infrastructure, which can be a very heterogeneous, and yeah, that was the

08:47.440 --> 08:52.440
goal, so we can really estimate the baseline of the management services, and then as

08:52.440 --> 09:00.000
well, the workload running on it, I need to skip it, so it's just time-wise, but this

09:00.000 --> 09:04.240
is, I actually got this graph from because it explains it really well what we are doing,

09:04.240 --> 09:10.840
but it's from the Kepler project as well, so we are trying to find out we have a stressor

09:10.840 --> 09:16.760
which runs a different number of virtual machines with the different flavors, we can estimate

09:16.760 --> 09:22.760
it inside these machines, we have stress on G, and other tools to collect these metrics,

09:22.760 --> 09:28.920
and quickly jump into this, we have the physical measurements with different

09:28.920 --> 09:35.920
solutions, where as Rappel is the one we really want to look at how accurate it is,

09:35.920 --> 09:40.880
so we focus on physical, actual physical measurements and allocations, and for the

09:40.880 --> 09:44.600
software based measurements and for the software in CloudStack, for me, it has higher

09:44.600 --> 09:49.240
availability status already implemented, so we use a lot of different exporters for all

09:49.240 --> 09:54.440
the process, container and virtual machine levels, where we actually just use Kefanga,

09:54.440 --> 10:01.000
which is running really nicely, but also I cannot, like, with 100% of confidence tell you

10:01.000 --> 10:05.840
if these numbers are accurate, and that's the goal of what we are trying to do, so

10:05.840 --> 10:11.680
and then to the on-time, the next steps, we still need to work on the data isolation and

10:11.680 --> 10:16.680
the really payment of the allocation rules, and we need to integrate the runtime monitoring,

10:16.680 --> 10:21.920
because I explain we have the static energy profiles, which we can already create, but the

10:21.920 --> 10:26.600
question is, when you have a running, a re-production cloud, how would these static energy

10:26.600 --> 10:31.600
profiles change, so this is why we need it, and now it's still working on it, and in a

10:31.600 --> 10:35.840
reporting functionality at the moment is only implemented as Kefanga dashboards, but we

10:35.840 --> 10:41.080
want to have it as a service, as a standard for the software in CloudStack, okay, this

10:41.080 --> 10:44.080
was 10 minutes, but very fast, but thank you any questions.

10:51.920 --> 10:56.400
How do you measure the power of the use for cooling the data set?

10:56.400 --> 10:57.400
For again, sorry.

10:57.400 --> 10:58.400
Yeah.

10:58.400 --> 10:59.400
Who were your data set there?

10:59.400 --> 11:00.400
Yeah.

11:00.400 --> 11:01.400
How do you measure it?

11:01.400 --> 11:02.400
Yeah.

11:02.400 --> 11:06.720
This is why we have this technology partners closer than scale up, because I actually, I'm

11:06.720 --> 11:12.080
not very, I don't have a deep knowledge of how data centers are actually what they are

11:12.080 --> 11:19.040
already measuring, and I thought they had, like, very great, they had already solution

11:19.040 --> 11:24.960
implemented to really have maybe some sort of exporter to get these information, the cooling,

11:24.960 --> 11:29.680
but allocated to this one rack, because they have 100 from racks, different cloud platforms,

11:29.680 --> 11:32.920
they have private cloud, but also their own public cloud.

11:32.920 --> 11:36.960
And in our rack, for example, a few service of a different cloud environment, so how can you

11:36.960 --> 11:42.920
allocate the cooling data, which they collect, specifically, this cloud environment?

11:42.920 --> 11:48.000
That's a great question, and I'm still working with the cloud providers to completely

11:48.240 --> 11:53.600
answer it, but at the moment, it's more like we collect the bits of materials with the

11:53.600 --> 11:58.560
information on the, like it's this, that you could say is static approach, they fill out,

11:58.560 --> 12:04.400
like, how much cooling energy cost and, like, they give us the details, and we can then use

12:04.400 --> 12:10.960
some sort of allocation, okay, you have certain amount of cloud environments, our has this

12:10.960 --> 12:16.800
size, and then we try to allocate it, a great solution we don't already have, but it's a really,

12:16.880 --> 12:19.880
good point, it's still in the research.

12:19.880 --> 12:20.880
Thank you.

12:20.880 --> 12:21.880
Thank you.

