WEBVTT

00:00.000 --> 00:08.720
So welcome, everyone.

00:08.720 --> 00:12.960
In this talk, we will actually go through our experience

00:12.960 --> 00:15.600
in provisioning stuff of us.

00:15.600 --> 00:17.040
So my name is Matiba Lucco.

00:17.040 --> 00:22.520
I work for the certificate services at TTH Zurich,

00:22.520 --> 00:27.880
which is a section of the TTH Zurich IT services.

00:27.880 --> 00:35.000
We will go through a little bit what was the context,

00:35.000 --> 00:38.600
in which a surface provisioning started,

00:38.600 --> 00:44.040
or this idea started, how we provision the system,

00:44.040 --> 00:48.480
the benchmarks that we run, and then the production phase,

00:48.480 --> 00:50.560
and also what we would do differently

00:50.560 --> 00:54.160
if we were to do everything again.

00:54.160 --> 00:55.920
So regarding the context,

00:55.920 --> 00:58.960
as a fast TTH Zurich serves the learn-and-med

00:58.960 --> 01:01.000
trusted research environment,

01:01.000 --> 01:05.440
it's a platform used by around five other researchers

01:05.440 --> 01:10.440
that stores three petabytes of secure data.

01:11.440 --> 01:14.760
And as you can see from the graph,

01:14.760 --> 01:19.160
it has been a web option as been quite sustained

01:19.160 --> 01:22.280
and growing from year to year.

01:22.280 --> 01:25.880
So we needed a solution that could actually accommodate this growth.

01:26.840 --> 01:30.640
What's the starting point of our surface journey?

01:30.640 --> 01:35.520
So we needed a storage system with a usable capacity

01:35.520 --> 01:37.360
of around the two petabytes.

01:37.360 --> 01:40.640
It had to be possible compliant and validated

01:40.640 --> 01:46.520
to be reliable, scalable, affordable, reasonably performant.

01:46.520 --> 01:49.920
And the idea was to have it accessed by virtualized workers

01:49.920 --> 01:52.760
of various size and capabilities.

01:52.760 --> 01:56.760
This last requirement is mostly due to the security

01:56.760 --> 01:58.200
requirements of the infrastructure

01:58.200 --> 02:02.600
where surface would actually be placed.

02:03.920 --> 02:08.040
What does it mean with reasonably performant?

02:08.040 --> 02:10.440
We are looking for something that has a, for a system

02:10.440 --> 02:13.560
that has predictable performance,

02:13.560 --> 02:15.160
smooth metadata operations,

02:15.160 --> 02:20.160
most of the system users experience it's really bad,

02:20.800 --> 02:24.640
gets really bad if the metadata operation,

02:24.640 --> 02:29.160
like least indirectories and very manipulation is as low

02:29.160 --> 02:34.160
as low or doesn't work at all.

02:34.160 --> 02:36.240
And also, we would like to have the system

02:36.240 --> 02:37.840
that describes as much as possible

02:37.840 --> 02:40.440
performance posed by bad behaviors

02:40.440 --> 02:43.680
to the specific project creating such behaviors.

02:46.360 --> 02:48.160
So what did we do?

02:48.160 --> 02:49.920
So we started with Blueprint,

02:49.920 --> 02:52.920
the initial configuration quite a standard one.

02:52.920 --> 02:56.640
So 16 was OK, so as the servers distributed

02:56.640 --> 02:59.080
in four distinct racks.

02:59.080 --> 03:05.080
Each server would have 24, are described, of 18 terabyte each.

03:05.080 --> 03:09.160
And for NVMe, of two terabyte each.

03:09.160 --> 03:12.880
We would have also predicated mode servers

03:12.880 --> 03:15.640
that would host mode and manager services

03:15.640 --> 03:20.920
and PM gas servers that would host the AM gas services.

03:20.920 --> 03:23.880
Already we've had year hosting more than one AM gas demand

03:23.880 --> 03:24.600
per server.

03:24.600 --> 03:27.160
We will see why later.

03:27.160 --> 03:31.560
All these servers had the same networking connectivity

03:31.560 --> 03:36.560
based on LSEP bond with two links at 25 gigabit.

03:39.880 --> 03:44.200
The guidelines that we followed for the deployment

03:44.200 --> 03:49.120
were the usual ones that are available in the community

03:49.120 --> 03:50.720
in the documentation.

03:50.720 --> 03:56.280
We're using spindles, so we provision fast devices

03:56.280 --> 04:02.120
for wall and db, so for the right-hand log and the rocks db.

04:02.120 --> 04:05.160
We provisioned for OSDs per NVMe device

04:05.160 --> 04:10.560
to try to get a little bit more performance out of them.

04:10.560 --> 04:14.120
We put the metadata pool on fast devices.

04:14.120 --> 04:16.800
Our NVMe is, if I want, but we saw before.

04:16.800 --> 04:19.600
And we pick a IR clocked the CPU model

04:19.600 --> 04:22.680
for the server hosting AMG-Ss.

04:22.680 --> 04:27.520
This is due to the mostly single-treated nature of AMG-S processes.

04:27.520 --> 04:30.400
So again, trying to get a little bit more performance

04:30.400 --> 04:34.640
out of the file system.

04:34.640 --> 04:38.520
And then we switched to the benchmarking phase.

04:38.520 --> 04:43.360
So as most of you know, benchmarking requires time

04:43.360 --> 04:46.360
and the benchmarking data are generally

04:46.360 --> 04:48.760
gathered and stored, but sometimes never

04:48.760 --> 04:51.560
properly analyzed, or also like gathered,

04:51.560 --> 04:54.360
but only used for coarse-grained estimation,

04:54.360 --> 04:57.000
or in many cases, it was not gathered at all.

04:57.000 --> 04:59.480
So with the wisdom gathered from previous deployment

04:59.480 --> 05:02.800
experience, we decided to actually bring in some external

05:02.800 --> 05:05.200
out, in this case in the personal mountains,

05:05.200 --> 05:09.560
some fast HPC, 12-pass benchmarking process.

05:09.560 --> 05:12.160
Our plan was to characterize the hardware

05:12.160 --> 05:15.800
to use a 16 virtualized worker nodes as client.

05:15.800 --> 05:19.920
Each virtual node, in this case, uses an entire advisor.

05:19.920 --> 05:22.880
And the reason for that is that you wanted something

05:22.880 --> 05:26.080
that would mimic the final layout of the system,

05:26.080 --> 05:27.760
while dispensing from the fact of having

05:27.760 --> 05:31.720
collocated virtual machines on the same hypervisor.

05:31.720 --> 05:34.280
We also aim the comparing replica,

05:34.280 --> 05:36.520
or replications, the replication strategy

05:36.520 --> 05:39.280
with a ratio coding 2 plus 2, and a ratio coding

05:39.280 --> 05:40.520
of 4 plus 2 schemes.

05:44.600 --> 05:49.520
From the very beginning, we encounter some other issue.

05:49.520 --> 05:53.240
I guess the most interesting is this one.

05:53.240 --> 05:55.920
We're benchmarking one of the SD nodes.

05:55.920 --> 06:00.160
All of a sudden, we add what appeared to be a device

06:00.160 --> 06:04.080
that performed twice as fast as we have a device is,

06:04.080 --> 06:09.240
except the actual explanation was the other way

06:09.240 --> 06:10.200
around.

06:10.200 --> 06:13.960
So actually, we snowed that the 4-MPM devices,

06:13.960 --> 06:16.520
one of those was actually performing as he should,

06:16.520 --> 06:21.160
and free devices were performing half speed and half IOPS.

06:21.160 --> 06:22.840
So this node actually got a screw

06:22.840 --> 06:28.920
from the benchmark phase, and we got the MBMU replace by the vendor.

06:28.920 --> 06:32.040
So it's always nice to have an initial characterization

06:32.040 --> 06:34.840
of the hardware, because these issues can actually

06:34.840 --> 06:42.040
lurk, and then actually review themselves at a later stage.

06:42.040 --> 06:44.840
The initial data that we gathered from the benchmark

06:44.840 --> 06:48.400
excluded the EC4 plus 2.

06:48.400 --> 06:53.640
We did a rough evaluation, and EC4 plus 2 actually

06:53.640 --> 06:56.920
had the sizeable performance gap compared

06:56.920 --> 06:59.200
both to replica and the EC2 plus 2.

06:59.200 --> 07:03.680
And the economy of space that you would have actually

07:03.680 --> 07:07.760
granted us was not as interesting as to give away

07:07.760 --> 07:10.080
more performance.

07:10.080 --> 07:15.120
The first data comes in, and usually, at this point,

07:15.120 --> 07:18.800
the benchmarking effort would be over and by other priorities.

07:18.800 --> 07:22.240
In this case, fortunately, we managed to actually

07:22.240 --> 07:25.600
go a little bit further, and obtain some data

07:25.600 --> 07:31.360
for what we thought would be the most interesting case for the 5-System,

07:31.360 --> 07:36.320
that is actually a small file random read.

07:36.320 --> 07:42.160
So the main goal, the main use cases for this 5-System,

07:42.160 --> 07:46.720
was to store data to be analyzed by means of a batch

07:46.720 --> 07:50.640
system, or that are set by means of multiple batch

07:50.640 --> 07:52.160
system.

07:52.160 --> 07:55.440
The workload, the anticipated workload would be

07:55.440 --> 08:01.280
read most if not read only, with a random read pattern.

08:01.280 --> 08:08.080
So we picked this use case as kind of the most

08:08.080 --> 08:10.880
interesting to benchmark against.

08:10.880 --> 08:16.560
And here we see the number of clients active on the x-axis,

08:16.560 --> 08:22.320
vis-à-vis compute nodes, and the attainable bandwidth, of course,

08:22.320 --> 08:26.480
a value are quite low, because they are kept by the maximum number

08:26.480 --> 08:30.240
of IOPS, the 5-System, can cope with.

08:30.320 --> 08:32.880
And in fact, if we actually switch to a viewer,

08:32.880 --> 08:38.560
we have the IOPS on the y-axis.

08:38.560 --> 08:43.920
We see that we reach around 80,000 IOPS with this configuration,

08:43.920 --> 08:49.440
and we're actually looking at the aggregated IOPS for a replicated pool.

08:49.440 --> 08:54.880
So in this case, we have our 5-System that has a flash,

08:54.880 --> 08:56.480
so an MVM unit, a data pool.

08:56.480 --> 08:59.680
So SD with the based on MVM devices,

08:59.680 --> 09:04.800
and then replicated the full data pool underneath

09:04.800 --> 09:08.000
where the test were run.

09:08.000 --> 09:13.040
The scaling is, so we see that we reach around the maximum

09:13.040 --> 09:18.320
performance at 5 clients' active clients,

09:18.320 --> 09:24.560
that we're running actually 16 threads each of IOPS.

09:24.640 --> 09:30.880
And then, like we decrease somehow not linearly,

09:30.880 --> 09:34.160
but with some bumps along the way,

09:34.160 --> 09:38.960
but overall, like in spite of the results.

09:38.960 --> 09:42.240
So here you see that the maximum value is 80,000.

09:42.240 --> 09:44.960
Now, if we switch to the same test, but then,

09:44.960 --> 09:48.160
with the regular coding, so again, this is the bandwidth,

09:48.160 --> 09:51.840
which is not super interesting, because it's again limited by IOPS.

09:51.840 --> 09:55.600
But if we go to IOPS, we see that now the maximum value,

09:55.600 --> 10:00.480
it's a little bit more than 50,000 IOPS,

10:00.480 --> 10:05.520
and we can actually max it out already with three clients.

10:05.520 --> 10:08.560
But this is, remember, like everyone is a random read,

10:08.560 --> 10:15.760
so we are actually really paying the regular coding policy,

10:15.760 --> 10:18.080
that needs to actually read from multiple devices

10:18.080 --> 10:21.120
to provide the data back, as opposed to the replica.

10:21.120 --> 10:26.560
That just needs to read from the primary SD.

10:26.560 --> 10:30.720
This shows also that in the regular coding case,

10:30.720 --> 10:35.920
actually the trend seems to be a little bit more regular,

10:35.920 --> 10:40.560
with the increase of in number of the clients.

10:40.560 --> 10:46.160
And it also gives us hope that if we not actually

10:46.160 --> 10:48.720
keep this trend and not go down linearly,

10:49.440 --> 10:52.160
very steeply, after we added a few more clients.

10:55.680 --> 10:58.160
Here, I believe it may be a little bit small,

10:58.160 --> 11:02.400
so I'll try to actually describe what this graph shows.

11:04.160 --> 11:07.120
While the initial idea, or the initial goal,

11:07.120 --> 11:11.440
was to have a file system for read most workloads,

11:11.440 --> 11:16.320
we wanted to also see how it behaved, mixing IO types,

11:16.320 --> 11:19.440
so read and write, read is read here,

11:19.440 --> 11:24.640
and write it in blue.

11:24.640 --> 11:27.840
And then we have like the columns,

11:27.840 --> 11:30.800
are different, sorry, the row are different block sizes.

11:30.800 --> 11:34.240
We start from 32k to 1 megabyte, the middle one,

11:34.240 --> 11:37.280
and 4 megabyte, the bottom one.

11:37.280 --> 11:40.960
And then we actually changes a few parameters,

11:40.960 --> 11:44.000
so we started with the first column,

11:44.000 --> 11:47.840
where we have all sequential read and write.

11:47.840 --> 11:52.640
And where we can see that actually the absence or write,

11:52.640 --> 11:55.360
create quite a big effect on the file system,

11:55.360 --> 11:58.720
if the block size is more, a little less,

11:58.720 --> 12:02.080
so when the block size increases.

12:02.080 --> 12:04.960
And then what we see is also that

12:04.960 --> 12:07.840
starting to any of the other columns,

12:10.960 --> 12:13.760
which means actually adding some random,

12:14.720 --> 12:18.960
into the mix, the performance would actually remain stable,

12:18.960 --> 12:22.880
but decreased drastically, in particular in the first case.

12:22.880 --> 12:29.200
So in the first case, we see that this random IO would actually

12:29.200 --> 12:33.120
create a lot, like decrease a lot with performance.

12:33.120 --> 12:37.120
This actually it's a bandwidth graph,

12:37.120 --> 12:41.040
but the IO scraps looks exactly the same,

12:41.040 --> 12:43.200
the same kind of behavior.

12:43.200 --> 12:48.880
As you can see, the maximum IO, so we can obtain

12:48.880 --> 12:50.960
with a four-under megabytes, in this case,

12:50.960 --> 12:59.600
is then limited by the bandwidth and no longer by the number of IOs there.

13:05.200 --> 13:09.840
So this is where we stop with a benchmarking part.

13:09.840 --> 13:13.760
We were quite happy, the system that we had to replace,

13:14.480 --> 13:19.200
had some parameters, or had some performance target,

13:19.200 --> 13:23.520
but we're similar to the one we saw with benchmarking part.

13:23.520 --> 13:25.360
So we decided to go in production.

13:25.360 --> 13:29.280
Here we're talking about three years ago,

13:29.280 --> 13:33.760
where it is, we went in production with was specific.

13:33.760 --> 13:41.520
And we started with a single active and yes,

13:41.520 --> 13:50.560
and all the default settings except for a little bit more cache for the MDS,

13:50.560 --> 13:52.880
so the active and yes, MDS.

13:52.880 --> 13:55.600
I believe the default was the eight of the time,

13:55.600 --> 14:01.440
and we actually gave the MDS, so we started with a 30 gigabytes of cache.

14:01.520 --> 14:04.000
We actually managed to migrate more than one bit

14:04.000 --> 14:06.160
about a bit of data from the previous storage system,

14:06.160 --> 14:11.040
and ramp up the usage by migrating individual project

14:11.040 --> 14:14.480
to the new system once the data migration was completed.

14:14.480 --> 14:17.840
But it became a sum clear that a single MDS divone

14:17.840 --> 14:21.440
could not cope with the amount of request originating from the clients.

14:21.440 --> 14:31.440
So just a super brief look at the results architecture.

14:31.440 --> 14:36.160
The client has a direct data path to the Rados pool,

14:36.160 --> 14:42.560
but the metadata path is mediated by the NDS.

14:42.560 --> 14:45.360
In our case, with a single active MDS,

14:45.360 --> 14:49.200
the clients could talk with the vicinity,

14:49.200 --> 14:54.720
but would actually hold the cache and the journals

14:54.720 --> 15:03.360
with the five system, and then write the metadata to the metadata pool.

15:03.360 --> 15:06.000
When we decided, when we saw that this approach was

15:06.000 --> 15:08.720
in a longer feasible, we decided to actually increase the number

15:08.720 --> 15:16.240
of active MDSs, which in the case of Seth provides a coherent caching,

15:16.240 --> 15:19.600
a coherent cache for the metadata part,

15:19.600 --> 15:26.720
and also a coherent journal from point of view of the client.

15:26.720 --> 15:30.560
So the initial step of our corrective measures

15:30.560 --> 15:34.480
to try and cope with the actual workload that the cluster

15:34.480 --> 15:40.320
was experiencing, was to increase gradually in maximum DAS,

15:40.320 --> 15:44.240
and we started from one and increasing in steps,

15:44.240 --> 15:48.080
until we found an optimal value that in our case is 7.

15:48.080 --> 15:52.880
The first increase was actually to just free,

15:52.880 --> 15:57.840
but we noticed that that brought a sudden increase

15:57.840 --> 16:01.840
of the quest per second counter counters,

16:01.840 --> 16:06.480
and initially we thought that everything was working fantastically,

16:06.480 --> 16:10.560
because before the numbers were much lower,

16:10.560 --> 16:14.080
but we later realized that it was mostly to be an MDS balancer,

16:14.080 --> 16:16.800
exporting some trees between ranks,

16:16.800 --> 16:20.480
because some of the workloads were actually insisting

16:20.480 --> 16:24.240
on very specific directories, and before the balancer would actually

16:24.240 --> 16:30.400
try to export them to less BZMDS.

16:30.400 --> 16:35.040
We went after that, we pinned the most active project directories

16:35.040 --> 16:40.080
to dedicated MDS demons, and the scene that it was working quite well,

16:40.160 --> 16:43.840
we landed decided for strategies to pin them all.

16:43.840 --> 16:50.880
Namely, we decided whether we want to pin a project directories

16:50.880 --> 16:57.120
to dedicated MDS, if actually story sizeable amount of files,

16:57.120 --> 17:04.160
and they actually also are quite active in analyzing those.

17:04.160 --> 17:07.360
Or on the other hand, if we have like a small project,

17:07.360 --> 17:13.280
but have a few files that sit quietly,

17:13.280 --> 17:18.400
we just put multiple of those in a single MDS.

17:18.400 --> 17:20.800
This actually gives us also a way to understand

17:20.800 --> 17:23.760
which project is actually creating the most traffic,

17:23.760 --> 17:30.240
which is a metric to use the also to track the file system status,

17:30.240 --> 17:33.440
and eventually potential MISB heavier,

17:33.440 --> 17:40.560
or MISB heavier, or lack of, let's say,

17:40.560 --> 17:43.520
appropriate knowledge to how to use such a fast features,

17:43.520 --> 17:49.440
like still trying to use the U-1s of a fast, or doing some other things,

17:49.440 --> 17:52.560
but instead of asking them to do that differently.

17:52.560 --> 17:59.440
So one of the other things that we tuned was the cache of the MDS,

17:59.440 --> 18:02.720
based on the documentation, the release

18:02.720 --> 18:04.160
that we added at the time.

18:04.160 --> 18:10.320
We had some of these changes were based on the documentation,

18:10.320 --> 18:13.280
on the explanation provided by documentation.

18:13.280 --> 18:16.800
We'll over the number of capabilities,

18:16.800 --> 18:21.440
a single client can hold it from a million to a million.

18:21.440 --> 18:25.360
We have around 600 clients right now registered to a file system,

18:25.360 --> 18:30.240
and we wanted a way to keep that in track.

18:30.240 --> 18:33.680
We incrementally increase the MDS cache memory limit,

18:33.680 --> 18:39.200
and right now we're currently having a value of 100 and 10 gigabytes.

18:39.200 --> 18:42.960
This is actually not needed for all the MDS demons,

18:42.960 --> 18:49.440
but unfortunately our pinning strategy is not too,

18:49.440 --> 18:51.120
it's a little bit coarse-grained.

18:51.120 --> 18:56.000
So there are some MDSs that have a lot of the files,

18:56.000 --> 19:01.040
I have to say a lot about it, and before we had to have a slightly higher value there.

19:02.480 --> 19:06.640
We did some changes, but helped us to understand a little bit better,

19:06.640 --> 19:12.880
and prepare the file system to the varsity use case that arise

19:12.880 --> 19:17.200
from having a batch system insisting on the file system.

19:17.200 --> 19:20.400
Namely, we increase a little bit the cache reservation,

19:20.400 --> 19:23.840
bringing it from 5% to 10%.

19:23.840 --> 19:27.200
And we also lowered the MDS cache threshold,

19:27.200 --> 19:36.720
which is the threshold by which the monitor would actually complain

19:36.720 --> 19:39.600
about an oversized cache in the MDSs.

19:39.600 --> 19:43.680
So we wanted to make sure that we were warned early enough,

19:43.680 --> 19:47.920
that something was going wrong with our settings.

19:47.920 --> 19:51.920
What we also did was to lower the cache trim decay rate,

19:52.000 --> 19:55.440
and increase the cache trim threshold.

19:55.440 --> 19:59.440
So the value that we used, so one is still the default for both

19:59.440 --> 20:04.560
Pacific and Quincy, the value that we used for the cache trim threshold,

20:04.560 --> 20:10.160
which is a 3123K, is actually a little bit higher than what Quincy has right now,

20:10.160 --> 20:12.800
but it's way higher than what Pacific has as default.

20:12.800 --> 20:18.720
So Pacific had 64K as a default, and already Quincy has it two times higher.

20:19.280 --> 20:26.880
This before doing all these changes, aside from driving a little less control

20:26.880 --> 20:33.760
on the cache size, we would also have the MDS complaining

20:33.760 --> 20:36.160
about being behind on trimming.

20:36.160 --> 20:40.640
This change has actually helped a lot on that records.

20:40.640 --> 20:47.680
And a very issue that we had was the deletion operation

20:47.680 --> 20:52.080
that some user trigger on millions of files at once,

20:52.080 --> 20:55.200
or even like, because they wanted to think faster,

20:55.200 --> 20:56.720
inside is worm jobs.

20:56.720 --> 20:59.360
So we would actually have worm jobs trying to clean up things,

20:59.360 --> 21:04.560
and this would actually really put the MDS demo that was serving

21:04.560 --> 21:07.040
was requested to his knees.

21:07.040 --> 21:11.840
So we decided that taking longer, it's fine, as long as it doesn't affect

21:11.840 --> 21:16.400
the world cluster stability, with a increase to that.

21:16.400 --> 21:19.440
So we decrease the max, and yes, we'd file to 64, and also the

21:19.440 --> 21:21.360
opsperpG20.5.

21:21.360 --> 21:25.840
It takes longer to delete, but deletion are no longer affecting the world system.

21:28.720 --> 21:33.120
When this is something that I find out preparing this slide,

21:33.120 --> 21:35.760
as well with the value that we saw before,

21:35.760 --> 21:38.080
we actually tune a few of this value that are

21:40.800 --> 21:44.720
the tunable for how TMDS recall the capabilities from the client,

21:44.720 --> 21:47.280
or when it decided to record them from the client.

21:47.280 --> 21:52.240
And interestingly enough, for me, our values are now in between

21:52.240 --> 21:53.680
Pacific and Quincy.

21:53.680 --> 21:56.960
So we had the value of 20k,

21:56.960 --> 22:00.880
while it was 5k in Pacific, and it's now 30k in Quincy.

22:00.880 --> 22:06.560
The max decay rate that we set was 2, which was 2.5 in Pacific,

22:06.560 --> 22:08.080
1.5 in Quincy.

22:08.080 --> 22:11.200
The decay rate here is actually works on the other way around,

22:11.200 --> 22:16.800
so it's actually part of the formula as a logarithm inside,

22:16.800 --> 22:26.160
so the higher the decay rate, the less the less,

22:27.360 --> 22:32.320
the first example, the less capabilities are recalled per second,

22:32.320 --> 22:38.720
and in this case, the less caps are recalled there.

22:39.680 --> 22:42.880
The threshold is again in between,

22:42.880 --> 22:46.480
so we have the value of 65k in our cluster,

22:46.480 --> 22:51.520
and the value of the current value is 120k for Quincy.

22:51.520 --> 22:54.720
All these values were actually, when we started with Pacific,

22:54.720 --> 22:57.120
were a little bit difficult to tune,

22:57.120 --> 22:58.880
but the documentation explains what they do,

22:58.880 --> 23:03.520
but then it's difficult to actually map them to the operation going on

23:03.520 --> 23:07.920
in the file system, and there were like a few posts

23:08.000 --> 23:14.480
either in the mailing list or even like on some step storage product,

23:14.480 --> 23:17.760
and we were advising just picking a set of these,

23:18.800 --> 23:25.120
max caps, decay free sholes, and I think also of the max decay free sholes,

23:25.120 --> 23:29.520
and just double them until you found a stable situation.

23:30.160 --> 23:34.400
So we went a little bit of way there, like halfway through these strategies,

23:34.480 --> 23:39.840
so we decided that we would actually try to use a multiplier,

23:39.840 --> 23:44.800
but that seems sensible, so not like 10x, but yes,

23:44.800 --> 23:48.240
now but we're really happy that also the defaults are actually being increased.

23:50.480 --> 23:54.960
Warning, threshold is again something that we change and it's in between.

23:56.480 --> 24:01.840
So one thing that we learned the hard way is that in our case,

24:02.720 --> 24:05.120
having a busy system with a large cache,

24:06.320 --> 24:13.680
when an MGS demon fails, and it can fail either because an operator needs to free up an MGS server,

24:13.680 --> 24:20.240
and once the MGS to manually fail, so you can reboot the hardware,

24:20.240 --> 24:25.360
use maintenance, or like the MGS gets killed because there is a misconfiguration

24:25.360 --> 24:30.320
in the cache value, and the out of memory killer takes it out.

24:30.480 --> 24:35.840
The new LES, MGS assigned to the fail rank, in our case needs more time than the default value,

24:35.840 --> 24:40.640
but I believe it's 60 seconds. And after several attempts, we find out that if we wanted to

24:40.640 --> 24:47.280
be sure that the MGS could take up the, could join the cluster, again, we added 12 at least

24:47.280 --> 24:52.720
30 seconds, probably like a little bit less would work as well, but we were afraid that if anything

24:52.720 --> 24:57.280
would happen, that would actually be recovered automatically by the system, if we had the lower

24:57.360 --> 25:03.600
value here, I would not have just wait for an operator to do something about it.

25:05.280 --> 25:06.880
So what we would do differently?

25:09.040 --> 25:20.080
So I would now set up the defaulted set that up, it's on spindle.

25:21.040 --> 25:29.520
So the defaulted app will need to be replicated since nowty loss, but it's recommended to

25:29.520 --> 25:35.280
be replicated since nowty loss, because of the tiny objects that are created on the route of the

25:35.280 --> 25:42.640
file system, but contain the trace information and not link references. So the idea is that there

25:42.640 --> 25:50.080
is one of these objects per line order in the system, and currently we have like 560 million

25:50.080 --> 25:56.480
nine nodes in our system, and doing some rough calculation, it means that we have 3.5 million of

25:56.480 --> 26:02.160
this tiny object per device. And if a device actually needs breaks and needs to be balanced,

26:02.960 --> 26:11.120
that's like 3.5 million objects to move and with a relative observation, which are not so cheap

26:11.120 --> 26:19.680
in a system that are disk-based. So in our system it takes as much time to the balance, but the

26:19.680 --> 26:25.040
full time I'm the surface data pool, so the folder replicated one, as it takes to the balance,

26:25.040 --> 26:28.960
like the actual data contained on the disk, so we're talking about 12 terabytes of data,

26:28.960 --> 26:35.600
as opposed to like 350 million tiny objects. So the idea, if we had to do it again, would actually

26:35.600 --> 26:44.320
to make this like a flash pool as well, so to leverage the kind of free IOPS, but those devices

26:44.320 --> 26:53.760
can give. And everything that I feel we did not put enough, let's say, effort on was to evaluate

26:55.040 --> 27:02.720
at the other SC algorithm and other than the default. So there is a very nice talker from

27:02.800 --> 27:10.880
Jimmy Pride, from IBM that was given at the last several of them, but actually go through

27:10.880 --> 27:20.400
a few of these algorithms. Also the banking, the fact that the eyes are national is not working

27:20.400 --> 27:27.200
only on Intel CPUs, but also works in AMD CPUs. And it presented a few benchmarks, but make

27:27.280 --> 27:35.440
us like wish we had done that part as well, because the actual performance gain is more than two

27:35.440 --> 27:41.280
acts from certain workloads. So depending on the use case, it would have been interesting to

27:41.280 --> 27:47.040
actually look at. So this actually concludes my presentation. It's very question.

27:47.040 --> 28:11.120
Yeah, every benchmark that you saw was everything on, sorry, yes. If like the graph that I showed

28:11.120 --> 28:18.400
before for the IOPS and numbers were on a spindle or not. All the benchmark that you saw

28:18.400 --> 28:22.160
all the graphs that are in the presentation are always benchmarking spindle, because in

28:22.160 --> 28:24.960
in the end that's what we have to work with.

28:27.360 --> 28:33.200
I think my question was, what did you actually use to process the benchmark, so I didn't actually

28:33.200 --> 28:40.640
look at this data? Oh, that was just, so to be honest, I don't remember. Sorry, the question was

28:40.720 --> 28:47.840
what we actually used to process the data from the benchmark, and the answer is that I don't

28:47.840 --> 28:55.520
remember, I'm sorry about that. The processing was done by StackHPC, so I don't have actually

28:55.520 --> 29:03.920
the actual script for the dynamic processing part. Yes.

29:04.000 --> 29:09.600
Now, we're currently in the process of designing a new set cluster. We've got the advice

29:09.600 --> 29:18.560
for the so-called R&O of the servers who use single socket, I suggest that it's an

29:18.560 --> 29:27.760
single socket. Yes, so indeed, so we never really managed to benchmark the difference between

29:27.760 --> 29:35.280
dual socket and I'm sorry about that. So the question was, we used dual socket servers,

29:36.480 --> 29:41.760
and the isn't it better to actually use a single socket, the server for this use case.

29:43.760 --> 29:51.440
There was a test done on dual socket system with everything separated, we've like dedicated storage

29:51.520 --> 29:58.720
controller attached to both CPUs and mix for each CPUs, but it didn't really seem to have

29:58.720 --> 30:05.920
such an advantage over like having like a dual socket mix case. So I'm not sure if this

30:05.920 --> 30:12.240
actually is also true for a single socket server, but yeah, like this is something that I heard

30:12.240 --> 30:16.720
several times, but then when we actually went to the vendor and asked for actually an hardware

30:16.800 --> 30:24.960
that could fit the like the video, the space or the badge that we had, single socket was always

30:24.960 --> 30:32.720
out of reach for us, so we never really managed to that. So thanks everyone, and I'm a little bit