WEBVTT

00:00.000 --> 00:10.240
The leader node for that draft group. So that's roughly how it works. So let's say

00:10.240 --> 00:16.760
take a V2, volume 2 goes down, there'll be a local election only on that region, it'll

00:16.760 --> 00:24.320
be elected and you'll leader and then let's say you've done big in, select, insert whatever,

00:24.320 --> 00:27.920
the transaction will not roll back it. Behind the scenes, there'll be a election and

00:27.920 --> 00:32.040
the transaction will still succeed. Provided the other two, the majority of regions

00:32.040 --> 00:37.400
are still up. To get application won't even know that there's been some problem and behind

00:37.400 --> 00:45.680
the scenes, tidy, we will start moving the other nodes around in order to maintain the

00:45.680 --> 00:52.160
safety for data safety. So that's roughly how it works. If you have any questions,

00:52.160 --> 01:03.840
it's not a little physically unit, it's more like a logical unit, the region. It's

01:03.840 --> 01:09.680
ordered, so you will always get the same rows of the same table, same index and the same region,

01:09.680 --> 01:18.520
but they'll be ordered. That's what the region is. I have to finish in quickly, that's why.

01:18.520 --> 01:23.800
So it uses percolator, it supports the read committed in an abstract isolation, level

01:23.800 --> 01:30.040
it's compatible with NWDB's repeatable read. There are some minus subtle differences,

01:30.040 --> 01:37.400
but overall it's more application than just work. PDB is the, as I mentioned, is the, sorry,

01:37.400 --> 01:44.160
the PDB is the better data server that tracks all the region across the cluster. It does,

01:44.160 --> 01:49.680
originally tidy, we was written with optimistic currency control, but we found that many people

01:49.680 --> 01:58.480
struggled with that. So by default, it's disabled now and it does pessimistic, currency control,

01:58.480 --> 02:07.640
but it is still used in auto commit, optimistic currency control. So this stuff works pretty

02:07.640 --> 02:14.680
well. What are the challenges that we face? So compaction causes a huge problem. So now

02:14.680 --> 02:20.200
let's say you have a multi-tenant system or you have and it's all running on the type

02:20.200 --> 02:27.680
of a cluster and the compaction kicks in because of some application that's doing heavy

02:27.680 --> 02:37.160
rights. It affects everybody. Can I get some water? So you want to avoid things like

02:37.160 --> 02:44.960
compaction causing jitter in your database. So that's one problem that we face. Provision

02:44.960 --> 02:52.800
can take a long time. So let's say you add a new node. Data has to be moved around behind

02:52.800 --> 02:56.880
the scenes and that causes problems. It takes a long time to add a new node, even though

02:56.880 --> 03:02.480
it's automatic. You want it to be instant. You add a new node and instant it should be

03:02.480 --> 03:07.880
available. That's another problem you want to solve. We have people asking us, or

03:07.880 --> 03:12.800
we're talking about 200 petabytes of data. There is no cluster that I'm aware of, that

03:12.800 --> 03:18.680
even more than six petabytes. We want to solve those kind of problems. I don't think anybody

03:18.680 --> 03:25.800
has the answers, but we will are talking about these kind of clusters. People are consolidating

03:25.800 --> 03:30.680
lower to reduce cost. And they don't want the applications to interfere with each other.

03:30.680 --> 03:35.160
They want to isolate them, but they want to have one cluster for all these applications.

03:35.160 --> 03:41.160
Reduce cost and reduce complexity of managing these large clusters. You want to solve that

03:41.160 --> 03:48.200
problem. So in any distributed system, a trade-off usually becomes a knob. Somebody wants

03:48.200 --> 03:53.560
something, then you add another parameter to fiddle things to balance it for whatever you

03:53.560 --> 04:00.640
want to solve there. The other thing is it should be on demand. It should scale down as quickly

04:00.640 --> 04:08.360
as scaling up. So one thing you can't do is storage you still have to pay. But compute,

04:08.360 --> 04:13.800
you don't want to pay if you're not using it. So that's another thing you want to solve.

04:13.800 --> 04:20.880
So you want compute to be cheap. So you don't want expensive storage and you're not using

04:20.880 --> 04:28.680
it. Let's say you have like a 20-pater by database. The hot set is only 100, one-tater by

04:28.680 --> 04:32.520
let's say. You don't want to be paying premium for the entire data set. You only want the

04:32.520 --> 04:39.720
hot set. And the cold data you want to pay as little as possible. And for companies that

04:39.720 --> 04:45.440
are on which type that reduce the IDB, that one at petabyte scale, this is a big cost for

04:45.440 --> 04:54.520
them. Use cost. So how do we solve this? So this is like really the bulk of the talk.

04:54.520 --> 04:59.800
So we want developers to start small. It doesn't mean that this because the system can

04:59.800 --> 05:04.840
handle, I don't know whatever, one-pater by it, that you have to set up a one-pater by

05:04.840 --> 05:11.360
you should be able to start with 100 megabytes or 10 megabytes system. Start small, but

05:11.400 --> 05:22.180
it's scale easily, very easily. So that's one target. It should have, you should be

05:22.180 --> 05:28.200
able to use any kind of storage if you want, but ideally the cheapest, cheapest is not

05:28.200 --> 05:35.480
just in storage. Because if you look at the cloud list, I guess 3, it's cheap to store,

05:35.480 --> 05:42.440
but it's extremely expensive to access. So that's another thing you want to fix. You want

05:42.440 --> 05:49.120
to grow with demand. It should be elastic. It should give you the, whatever, 29, the 5,

05:49.120 --> 05:55.800
9, whatever it is, resilience. These are very difficult things to achieve at scale. And so

05:55.800 --> 06:00.120
if Amazon or Google or whoever has solved that problem for you, you want to leverage it.

06:00.120 --> 06:07.440
Don't want to create your own data center in your deployment. You want to leverage the good

06:07.440 --> 06:12.080
stuff that they've already done for you, the hard stuff that they've done for you. Distributed

06:12.080 --> 06:16.880
systems and notoriously complex. I would say they're probably the most complex thing that

06:16.880 --> 06:21.080
I have ever worked on. That doesn't mean that there are not other, but I have not worked

06:21.080 --> 06:28.040
on anything more complex than this. The other thing is that what we see in applications

06:28.040 --> 06:33.680
is that applications don't just talk to the database. They talk to many other services.

06:33.680 --> 06:38.720
So in the cloud, you want to use the cloud to leverage all the other services that the

06:38.720 --> 06:45.120
clouds provide. And the last point I've just mentioned. So in the system or the data is

06:45.120 --> 06:49.120
not being accessed, you just want to pay for the cheap storage. So they should be the cloud.

06:49.120 --> 06:56.200
What cool thing? So this is roughly how the costing is on the cloud with S3 and blocks

06:56.200 --> 07:04.200
stored like EBS. Storage cost of objects storage is low. EBS is high. Objects stored

07:04.200 --> 07:12.160
like S3 is unlimited. EBS is sort of limited. Durability of S3 is very high. Lock storage

07:12.160 --> 07:19.600
is less. Latency is higher for S3, but it's low for blocks. How do you get the best

07:19.600 --> 07:27.440
of both? That's what the cloud-native database needs to leverage. How do you fix this problem?

07:27.440 --> 07:34.920
And the cost is higher for request cost is high for S3 and it's low for EBS. So this is

07:34.920 --> 07:39.320
really the environment in which this has to work and leverage and make use of it in the

07:39.320 --> 07:47.200
most optimal way. So TIDV has something called serverless. It's multi-tenant. So it has

07:47.200 --> 07:52.640
a disaggregated storage. So we've replaced Rock's TD. In the same way, anyone can download

07:52.640 --> 07:58.520
TIDV to pull Rock's TD out and start cracking on it and trying to the same thing. Hopefully

07:58.520 --> 08:05.200
I'll be able to get that across in some and waving way. So it uses a local disk as a cache.

08:05.200 --> 08:11.640
That's the core idea. It uses S3 as its primary store. So think about it. Now imagine

08:11.680 --> 08:19.960
you have a stateless compute nodes and you have a storage layer like S3 and you have

08:19.960 --> 08:27.960
let's say EBS as a cache. If I fire up a compute node, I don't need to do any balancing

08:27.960 --> 08:32.560
moving data around. I just connect directly to S3 and instantly it's available. So that's

08:32.560 --> 08:37.120
one huge thing for people who have to fight fires in emergencies and there's like a spike

08:37.160 --> 08:42.360
and you just go, okay, spin up. And you reduce the latency so that the user doesn't see

08:42.360 --> 08:49.360
anything. You have a pool of these hots, what would like all of them? Spot nodes, spot

08:49.360 --> 08:53.880
instances, instances. So if you were doing it yourself, you would have some kind of Kubernetes

08:53.880 --> 08:58.680
thing with a pool of spot instances and they could be all kinds of different shapes, sizes,

08:58.680 --> 09:03.880
whatever. You fire one up and just connect to S3 and it just works because S3 can handle massive

09:03.880 --> 09:11.040
scale of throughput. Without affecting your quality of service, it's like 2019 in its

09:11.040 --> 09:16.800
fantastic. So this is what you want to leverage and this is really the core idea here.

09:16.800 --> 09:21.920
So it from a shared nothing architecture of like any, it becomes a shared architecture.

09:21.920 --> 09:28.200
Simply because you use S3. Now what does the EBS do? It's like any other cache in a database

09:28.200 --> 09:34.640
like above a pool. So in theory, if you want to reduce the latency further, you could write

09:34.640 --> 09:40.200
your own memory, shared memory, thank you very much. You could write your own shared memory

09:40.200 --> 09:47.560
distributed cache. So if you're storage engine models, let's say an abstraction of a distributed

09:47.560 --> 09:52.760
file system. So when it does the right, it first writes to S3 for durability, that's guaranteed

09:52.760 --> 09:57.360
them then it writes to the cache. It'll use it like a right through cache. And so your

09:57.360 --> 10:01.760
hotspot, your hot data will always be in your cache. So depending on how much effort you

10:01.760 --> 10:05.800
want to put in and how much latency you could have. EBS as a cache, you could have a, like

10:05.800 --> 10:10.920
a sophisticated, distributed memory cache, it's up to you and they could even be spot instances.

10:10.920 --> 10:17.200
If you do it, just add some more EBS, increase your cache. This is the core idea behind

10:17.200 --> 10:23.080
it. And to achieve it, if you want to write your own, just download Typev, rip out Rocks

10:23.080 --> 10:30.200
DB, and start working on it. It's a very simple API from Typev to Rocks DB. It's like a

10:30.200 --> 10:38.640
key value store. It's like, get port delete and few other things and let's say. One key

10:38.640 --> 10:46.680
point. So in our case, which makes sense for anybody who wants to do it, we use, okay,

10:46.680 --> 10:51.520
so how do you guarantee durability? So you have one thing that we do is we put the raft

10:51.520 --> 10:58.800
raft log on EBS. So you want low latency. You have a currency, I haven't tried S3 Express,

10:58.800 --> 11:02.880
but I don't think it has the latency you need for something like a raft, a raft log. But

11:02.880 --> 11:11.800
that you use EBS. But a raft log is not a large data object. It's, it just needs to be active

11:11.800 --> 11:16.040
to make sure it's like any other log like a right or a log of NTV. You'd fronkated when

11:16.040 --> 11:19.520
you don't need, you know that all these entries have been applied to your fronkated. So it's

11:19.520 --> 11:24.800
very cheap on storage. It just depends how much, how much of the log you want to keep.

11:24.800 --> 11:33.560
It's a small, just enough to get the appliance nodes to apply whatever changes you're making.

11:33.560 --> 11:41.280
It's not a big deal. So the raft log is on EBS and that is what gives us the guarantee

11:41.280 --> 11:47.600
of or durability. So if the raft log says yes, it's been distributed. It's an atomic

11:47.600 --> 11:54.040
commit. It's all there. It can lazily from there be applied to S3, no big deal.

11:54.040 --> 12:02.080
Okay. One of the problems with the data structure that you see, we use is LSM. I have not

12:02.080 --> 12:06.120
seen a beetry that stored on S3, so I won't go into that. So LSM we think is the best thing

12:06.120 --> 12:13.120
for the indexing technique for this. The LSM is very easy to move around. They're immutable.

12:13.120 --> 12:20.120
So once it's on S3, nobody's going to change it. So it makes things so much easier. The

12:20.120 --> 12:23.720
pricey pay is latency, but then people have bloom filters and all that's going to stuff

12:23.720 --> 12:28.880
that makes it faster. But one problem they do have, and we see it in production, is

12:28.880 --> 12:36.240
legitimate. And how does this architecture solve the jitter? You can have spot instances that

12:36.240 --> 12:44.280
work directly on S3. And because it's immutable on S3, you can run the spot instance as many

12:44.280 --> 12:53.120
as you like doing all the merges, compaction in the background. It will not affect your compute

12:53.120 --> 12:59.120
nodes at a serving business request. That's all the huge problem in LSM too. One thing

12:59.120 --> 13:07.000
that is not in my slides or I'll mention is that in our serverless design, we have query

13:07.000 --> 13:12.360
push-down. So what we've also done is we have spot nodes for what we call the remote

13:12.360 --> 13:18.920
co-processor. So the optimizer, we'll look at the query, created

13:19.000 --> 13:24.720
lag for aggregates and index scans or tables scans or whatever. And push it down to these

13:24.720 --> 13:28.720
remote co-processor for long queries. And figure out, okay, this query is like a big one.

13:28.720 --> 13:33.200
This is a small one. Small ones go directly to EBS, read data, send it back, process it

13:33.200 --> 13:39.000
themselves. But for larger queries, we can fire up spot instances, which are called remote

13:39.000 --> 13:46.040
co-processors and offload the query to those. So that the short requests are not impacted

13:46.040 --> 13:50.600
by this. And all the larger queries are all done by these remote. So the only thing that

13:50.600 --> 13:54.400
the remote, I don't, I've bit of this code. So I know this a little bit. The only thing

13:54.400 --> 14:01.080
they need to do is, because the latest data is in the mem tables inside the storage nodes,

14:01.080 --> 14:07.440
it can't get that from S3. You have to synchronize the mem table contents from. So that's

14:07.440 --> 14:11.840
a bit of latency. But because these are large queries and if there's a few milliseconds

14:12.080 --> 14:16.600
synchronizing the mem table for the data, you need to sort of engage in practice. So that's

14:16.600 --> 14:22.240
the extra step it needs to do. So you can keep, it's a design to be scalable. So you can

14:22.240 --> 14:27.880
keep adding these remote co-processor, you can have bigger machines with more RAM to do more

14:27.880 --> 14:32.680
processing of these remote co-processors. And they send their aggregated results to the computer

14:32.680 --> 14:39.680
nodes and of here. So it's designed to be really, really scalable at every aspect of

14:39.680 --> 14:46.080
it. So this is what I've been talking about. So you have the compute nodes, it's a

14:46.080 --> 14:53.040
stateless. You have the storage there. And the storage there, it should work theoretically

14:53.040 --> 14:56.320
you can work without a cache. Cache is just an optimization technique. It's not really

14:56.400 --> 15:03.400
requirement for correctness. So if you look at the diagram, you have storage nodes that

15:03.400 --> 15:09.920
are writing directly or accessing the S3. And the part it's keeping is the rough log, which

15:09.920 --> 15:17.920
I'll have in the next slide. And the resource pool, which is, you can have Kubernetes doing

15:17.920 --> 15:25.920
this stuff. You have MPP would be your Typhlash node that do parallel analytics queries and

15:25.920 --> 15:30.320
stuff like that. DDO. So there's a DDL worker. So we also have distributed DDL, which

15:30.320 --> 15:35.880
does, if it follows Google's F1 online scheme, a change, which can do parallel DDL across

15:35.880 --> 15:41.320
all your nodes. It's not in the serverless one, but in our regular Tydb, I think we've

15:41.320 --> 15:48.920
got 40 terabytes add indexes like half another something. So it all the nodes are working

15:48.920 --> 15:53.320
in parallel. So it's really impresses stuff. So this is what the general architecture looks like

15:53.320 --> 16:01.280
at the hand way we level. This is the same thing, but in more detail. So in our deployment,

16:01.280 --> 16:06.680
you want to, like, a more sophisticated production to form, you need gateways and you need

16:06.680 --> 16:10.520
like manager server that are managing your pool. So that's where the complexity. I won't

16:10.520 --> 16:14.280
talk in detail about because that's got a little, do with the storage side of things. So

16:14.280 --> 16:18.400
you will need all these things. You'll need some kind of pool controller that goes and looks

16:18.400 --> 16:24.160
at the stateless pool of compute nodes. And this is operational stuff that is very specific

16:24.160 --> 16:28.600
to how people want to run it. But you will need some layer that makes it automatic and makes

16:28.600 --> 16:36.960
it easy to manage. So this is the storage part. So if you look at the storage node,

16:36.960 --> 16:42.440
Tydb at the top would be the raft part of it. Sorry, that's the interface it has to

16:42.440 --> 16:47.480
the compute nodes. And the raft engine is on EBS. And if you see the green part, so that

16:47.480 --> 16:54.440
the EBS drives or EBS storage. And the raft engine writes to that. So you have automicity

16:54.440 --> 17:02.400
in consistency and all that sort of durability come from the raft log, the right to EBS.

17:02.400 --> 17:06.880
And there's a raft engine on each of these nodes. And your KV engine for the data will go

17:06.880 --> 17:15.440
to S3. And as a cache, which is missing from here, it would also use EBS for cache.

17:15.440 --> 17:20.440
So this is what it would look like as a complete picture. So on the left is how the current

17:20.440 --> 17:27.120
Tydb gave you that you can download from anywhere works. And it's more or less what I

17:27.120 --> 17:31.200
explained. So that's a multiple instances. You have the main tables. So if you know how

17:31.200 --> 17:42.600
LSM works, you have a main table. And then once that fails, I think RoxyB has 2 by default.

17:42.600 --> 17:47.640
So then it makes it like an old main table. And immutable puts it on one side. And then

17:47.640 --> 17:52.880
when the second one feels out, it writes it. The immutable one to L0. Then once it goes to

17:52.880 --> 17:57.840
disk, it's immutable. And those little pink things with the dotted arrows are the ones

17:57.840 --> 18:05.720
on disk. So this is the current one. And this is the more cloud enable architecture. And

18:05.720 --> 18:10.440
we actually have our own version of this, but anyone can take a code and rip it out and

18:10.440 --> 18:15.000
do their own thing. And so in this case, you can see the raft engine is right into the

18:15.000 --> 18:23.040
raft log. That's an EDS. The main table part is common to both. Except in this case,

18:23.040 --> 18:31.360
if you want to even offload your expensive or large queries, analytic queries to separate

18:31.360 --> 18:36.960
spot node, then you would need some kind of synchronization with the different main tables

18:36.960 --> 18:43.160
of the regions. And they would all go to S3. Anyone can, because it's immutably, you

18:43.160 --> 18:50.520
just go pick it up and get massive amounts of data. So what does this do? It reduces our cost

18:50.520 --> 18:56.680
by using the cache to reduce access to S3. So that reduces the cost on EBS, because the

18:56.680 --> 19:01.120
only keep the hard data there, just by design. You don't have to do anything special.

19:01.120 --> 19:06.840
The more you access, whatever you're caching, LRU, whatever caching strategy you use, based

19:06.840 --> 19:13.760
on that, you'll get, and you can have, if you, just like a regular buffer pool in a

19:13.760 --> 19:21.360
single load engine, if you have a smaller cache, and you have page misses. If you have

19:21.360 --> 19:25.080
hits, the small EBS works for you. If you have misses, just like a regular cache, well,

19:25.080 --> 19:29.360
then you have to fetch from EBS and there's more cost. So you increase your EBS cache,

19:29.360 --> 19:36.000
just like you're increasing the buffer pool, exactly the same. And so the only mental

19:36.000 --> 19:39.200
thing that you need to be sort of think about is that rather than you just memory for

19:39.200 --> 19:46.200
cache, in this case, you can also use disk. So it's like a CPU L0, L1, L2, L3 counter cache.

19:46.200 --> 19:51.640
Except the weird thing is that the analogy where it breaks is accessing, accessing memory

19:51.640 --> 19:58.200
is not expensive. S3 is expensive. That's it. That's it. There are the latencies

19:58.200 --> 20:04.200
I, but there's no extra expense, but where is it? In the cost factor, accessing S3 is extremely

20:04.200 --> 20:12.200
expensive. So that's what you reduce. So you have multiple, so the rough plot is exactly the same.

20:12.200 --> 20:15.800
There's no difference. So if you have your own draft implementation, you can use that or you can use

20:15.800 --> 20:22.800
the one we have, it's not a big deal. So the remote storage and services is essentially

20:22.800 --> 20:28.400
hot spot instances for doing compaction, for doing backup recovery, so allowing that.

20:28.400 --> 20:32.800
So one of the things that we struggle with initially, when we went to like 4,500 terabytes,

20:32.800 --> 20:39.800
the instances is backup and recovery. That's a huge problem. Backing up one terabyte,

20:39.800 --> 20:45.440
500 gigabytes is not such a big deal. Try doing a one petabyte backup. Even incremental

20:45.440 --> 20:50.840
will be hundreds of terabytes. How do you solve this problem? And how do you solve that problem

20:50.840 --> 20:56.080
and not impact by using S3? It gives you such fantastic bandwidth. They've solved those problems

20:56.080 --> 21:04.240
for you. That's why you want to leverage these facilities and reduce your own complexity

21:04.240 --> 21:09.800
in your database design. It is a lot simpler to, in my opinion, today it will write really

21:09.800 --> 21:15.760
scalable databases. You can go all fancy using whatever time sink and whatever other stuff

21:15.760 --> 21:24.280
that Google or AWS offering, use atomic blocks now. And write your own protocol to reduce

21:24.280 --> 21:31.800
latency to the same in the database. It's not so difficult. I think that's my last slide,

21:31.800 --> 21:39.960
probably. Yes. So I think whatever this QR code is, if you use it, it's fighting open source

21:39.960 --> 21:45.520
developers. We give credits to use our cloud service for free. I don't know how much.

21:45.520 --> 21:52.760
Yeah, there you go. $2,000 credit? $1,000. If you want to see this architecture and action,

21:52.760 --> 21:58.880
you can use this. If you want to write your own, just rip out rock TV, label the rest of

21:58.880 --> 22:04.720
this stuff and write your own engine. And run Minio because it has an S3 compatible layer.

22:04.720 --> 22:09.120
You don't have to go to Amazon to write your own engine. Just write up one Minio and start.

22:09.120 --> 22:14.920
And write a little lesson and start using it. And that's about it. So you have any questions?

22:15.840 --> 22:16.840
Feel free to ask.

22:33.360 --> 22:39.000
Thanks for presentation. So you said multiple times that anyone can take open source

22:39.000 --> 22:47.640
like we take rocks to be out and implemented like API on a S3. So like have you actually

22:47.640 --> 22:53.320
seen anyone trying to do it in open source? No, so this is the first time I've said it,

22:53.320 --> 22:58.480
but we've done it and I have worked on it. So I know it's not difficult. Okay, yes.

22:58.480 --> 23:11.840
How's your mind? Anyway, the guy who asked the question actually hacks on Tygavey.

23:11.840 --> 23:19.640
When pushing. Yep, please. Object on S3. Do you have an optimal size for the

23:19.800 --> 23:23.640
object? Sorry? Do you have an optimal size?

23:23.640 --> 23:33.640
Oh, so the size can vary. So and there's a reason for that. So the PD, one of the bottlenecks

23:33.640 --> 23:38.360
in this design is still the metadata server that does the moving of the object. So the bigger

23:38.360 --> 23:42.840
the object, the less they have to manage, the smaller the objects, the bigger the overwritten

23:42.920 --> 23:50.520
managing so many objects. So you can have any size. So yeah, so what we do currently

23:50.520 --> 23:55.880
is that when the hot, we know that this is a hot region, we start splitting it very aggressively.

23:57.400 --> 24:02.280
So it doesn't even work that it has to reach this size and split. We'll just keep splitting

24:02.280 --> 24:04.280
it more and more.

24:12.920 --> 24:14.120
All right, one last question.

24:28.600 --> 24:34.360
Hi, thanks for the great talk. So my question is like, how do you handle the batching

24:34.360 --> 24:38.840
when writing or reading from S3? Because the S3 cost model is something different. Sorry.

24:39.320 --> 24:43.880
The S3 cost model is based on number of calls rather the storage. So how do you handle the

24:43.880 --> 24:51.800
batching of write and read to, is that a problem for the initial write, we need to do the

24:51.800 --> 24:59.160
rough log apply, right? It really related to the previous question. So how do you find the balance?

24:59.160 --> 25:06.920
So you do you want to write, collect it and write it fully, like bigger chunks or you want to send

25:07.000 --> 25:11.000
the smaller chunks as you split them. So this is the job of the apply thread.

25:12.920 --> 25:19.960
So even though you're aggressively splitting the thing, right, but that is, let's say you got a very

25:19.960 --> 25:25.880
heavy load, right? And you, this could come from many mem tables. It's not coming from one, right?

25:25.880 --> 25:31.720
So you can batch different mem tables together, create a bigger batch to reduce the request.

25:31.720 --> 25:36.440
It's like let's say a read to log of into DD. If you're just going to write one little log at

25:36.440 --> 25:41.080
that time, it's extremely expensive. It's like group commit. So you group commit as many

25:41.080 --> 25:47.880
mem tables and fields as you can together. And the PD knows where they are. So you can slice and dice

25:47.880 --> 25:51.720
whichever you like. Thank you.