WEBVTT

00:00.000 --> 00:17.220
So, we are going to talk about Vites which is a different approach to creating cloud

00:17.220 --> 00:23.600
native than you have seen before and we are taking an approach of explaining how you

00:23.600 --> 00:33.480
can migrate massive databases into Vites to flesh out the architecture of Vites.

00:33.480 --> 00:38.240
My name is Roy Thnak, I am a core maintainer of Vites for the last five years, I work

00:38.240 --> 00:40.840
at Nanick's scale along with the Mathias.

00:40.840 --> 00:45.320
I am a CS, I work also at Tanstier, I am in customer engineering, so whatever he builds.

00:45.320 --> 00:50.040
I get to implement.

00:50.040 --> 00:58.800
So, Vites was created at YouTube in around 2010 when people felt the need to distribute

00:58.800 --> 01:06.100
lots of cat videos, they were running out of bandwidth, so they built Vites on top of

01:06.100 --> 01:10.320
my SQL, so this is a very different architecture from what you have seen before, it uses

01:10.320 --> 01:14.280
my SQL as the storage layer, so they are not reinventing the storage layer, you get all

01:14.280 --> 01:18.840
the benefits of my SQL and fill in the gaps that are needed to make it into massively

01:18.840 --> 01:21.840
scalable database.

01:21.840 --> 01:31.200
It is a graduated CNCF project, it is open source, so I will go in the next slide over

01:31.200 --> 01:39.240
what makes it cloud native since it is a cloud native room, so Vites was built from scratch

01:39.280 --> 01:46.520
to work on Googleborg, which is the predecessor of which is inspiration for Docker and

01:46.520 --> 01:54.520
Kubernetes, so it was built to be cloud native from ground up, and we will see what that

01:54.520 --> 02:02.360
means soon, it is in supposed to massively scalable, so that is given highly available, and

02:02.360 --> 02:07.400
one of the core principles was that it has to be my SQL compatible, so that my apps should

02:07.480 --> 02:13.680
be able to talk to it like it is a regular my SQL database, underneath you can have several

02:13.680 --> 02:19.520
databases, they can be vertically scale, meaning you have lots of your databases across my

02:19.520 --> 02:27.640
SQL databases, or horizontally shared it, where one table 6, one row will be a one

02:27.640 --> 02:33.320
shot, but the rows are distributed across multiple my SQL databases, that gives the

02:33.320 --> 02:40.320
massive scale, that is one of the key things of it is the horizontal charting, and

02:40.320 --> 02:45.840
because it gives you a single connection, the entire database underneath is seen as a logical

02:45.840 --> 02:51.440
database, you still get a single connection string, so database frameworks like Rails,

02:51.440 --> 02:59.440
Larawail or ORM tools, everything just work WordPress for example, a lot of very large users

02:59.440 --> 03:03.960
just to give you a flavor that it is actually running on massive scale, most of a slacks

03:03.960 --> 03:11.560
traffic runs on Wittes, by runs on Wittes, Hub sort square cache, Americans were here

03:11.560 --> 03:18.640
who use square cache, that is entirely on Wittes, Shopify and some gaming companies, so

03:18.640 --> 03:25.360
it runs at really high scale, planet scale itself, where we work for run managed Wittes,

03:25.360 --> 03:30.800
and we have, I know, several hundred thousand my SQL servers for across clients, and

03:30.800 --> 03:35.240
the largest customer we have has probably right now two six shots, we are going to move

03:35.240 --> 03:43.960
to 500 to 12 shots, so it is pretty massive, so since the cloud native data room, let's

03:43.960 --> 03:50.080
devroom, let's see what is meant by cloud native, and why Wittes, what are the principles

03:50.080 --> 03:55.160
that we used to make Wittes cloud native, so Microsoft is oriented design, all the

03:55.160 --> 04:00.600
components that I use will see it soon, the different components, they are decoupled,

04:00.600 --> 04:08.560
they communicate to each other over GRPC APIs, so you can containerize each of these within

04:08.560 --> 04:15.760
a Kubernetes container, for example, and so you can scale it vertically by making each container

04:15.760 --> 04:21.600
through hardware of the problem if it's, you know, you need more performance, or horizontally

04:21.680 --> 04:29.520
sharded, horizontally add more of these components, so what that lets you do is you can leverage

04:29.520 --> 04:34.240
commodity hardware, so I think a sunny talked about price performance, and that's really

04:34.240 --> 04:38.640
important in cloud because you are running these massive number of instances, you need

04:38.640 --> 04:45.200
them to be replicated, so you want to cut costs, so when you are able to use commodity hardware

04:46.120 --> 04:54.400
and shard, so horizontally sharding, you spread your the cost across the database bandwidth

04:54.400 --> 05:02.600
across multiple my SQL servers, and you can leverage cloud scaling, cloud allow you to add

05:02.600 --> 05:08.000
new instances, drop new instances, very fast, so when you want to upscale, for example, around

05:08.080 --> 05:13.360
thanksgiving event or you know black Friday, we have many clients who will just ramp it up

05:14.080 --> 05:18.800
two, four times, and then bring it down immediately afterwards, and witness allows all that

05:18.800 --> 05:25.440
very quickly, you will see how that is done, high availability is done, so we are using my

05:25.440 --> 05:32.320
SQL, so we get the benefits of my SQL, the replication and semi-sync, and on top of that,

05:32.320 --> 05:37.600
we have orchestrator, which is a cluster manager, which looks at all the instances that

05:37.680 --> 05:42.320
are going on, there is a health check that is going on, on everything that is running, and

05:42.320 --> 05:47.360
if like a primary fails and you cannot write, it will promote a replica automatically,

05:49.680 --> 05:54.800
Kubernetes integration there is a Wittis operator, which has the IDs for all the different

05:54.800 --> 06:00.480
components that we are Wittis components, so you can use just Kubernetes regular Kubernetes

06:00.560 --> 06:07.840
for scaling, self-healing, give declarative information on how you want to cluster to be,

06:07.840 --> 06:16.320
and Kubernetes will create it for you, global availability, so there is a concept of availability

06:16.320 --> 06:23.200
zones similar to AWS, so you can have your clusters spread across different zones,

06:23.280 --> 06:30.080
you can also be on different cloud providers, so there are users who really need it, I mean,

06:30.080 --> 06:39.520
who want it because they want really don't want downtime at all, so I am just going to explain

06:39.520 --> 06:44.720
the architecture, so we have very little time in general, but I want to give a flavor of what

06:44.800 --> 06:53.200
Wittis is, and what does all this magic, and using an example of user real user, it was

06:53.200 --> 07:01.520
a massive user who why they wanted to move from RDS to us, to Wittis, so they started scaling,

07:01.520 --> 07:05.760
they reached the limits of their vertical scaling, you couldn't get bigger instances,

07:07.680 --> 07:10.560
they tried manual sharding, manual sharding is hard for anyone who has tried it,

07:11.440 --> 07:16.640
they had costs were getting really high because you are using these top to the line instances,

07:17.520 --> 07:24.720
and they wanted more control over when to upgrade, when some of the configuration options were not

07:24.720 --> 07:32.080
available, this is specific to one user, there are different reasons people do, just giving a flavor

07:32.080 --> 07:38.400
of it, Wittis was important because it was my SQL compatible, horizontal sharding was one of the

07:38.480 --> 07:46.400
key reasons they adopted it, because then they could use cheaper hardware mentioned before,

07:46.400 --> 07:52.080
and the high availability features, another feature which is for schema change, right, when you

07:52.080 --> 07:58.560
want a DDLs, your tables keep changing, you are developing all the time, so the time it takes

08:00.080 --> 08:06.080
for a schema to be changed, you cannot run it directly on like a 400 terabyte database,

08:06.800 --> 08:10.960
because it will just take hours and days, and you cannot it will bring down your system,

08:10.960 --> 08:18.240
so you have the system like ghost and online schema change tools, so Wittis has something

08:18.240 --> 08:26.720
online DDL which is built on top of the old orchestrator, that does the distributors schema changes,

08:26.720 --> 08:32.400
and there is one of the reasons why this customer used switch to Wittis, so

08:33.360 --> 08:41.520
I am now going to move to see it, sort of animation of what are the steps involved in

08:41.520 --> 08:50.320
migrating to Wittis, this is the initial stage, right, and we will just using it, so first thing

08:50.320 --> 08:58.640
you do is you set up a Wittis cluster, so everything in orange is a Wittis component,

08:59.520 --> 09:05.680
normally you have like many of them, like Wittigates, okay, I will go into each of them step by step,

09:06.800 --> 09:13.840
so XCD is the topology service that we use for coordinating between all the components

09:14.800 --> 09:25.280
and my SQL of course, right, so initial focus will is on the external key space that's here,

09:25.840 --> 09:31.840
so a key space is a logical database within Wittis terminology, Wittigablet is a component that

09:31.840 --> 09:41.520
controls the database, the my SQL instance, so Wittis components never talk directly to my SQL,

09:41.520 --> 09:47.440
they talk via Wittigablets, there are many features in Wittigablets, one which we are going to be

09:47.440 --> 09:55.200
looking at right now is about how it does migration and replication, but it's also used for

09:56.160 --> 10:07.840
connection pulling, query caching, a protecting against bad queries, etc, so here the database is external,

10:07.840 --> 10:13.840
right, it's ideas, so we call it a unmanaged tablet, it doesn't bring down, when Wittigablet

10:13.840 --> 10:17.840
comes out, it doesn't look at the health of RDS, it expects it to be up, so it's not controlling

10:17.840 --> 10:23.840
RDS, but it's used so that the other components can talk to RDS, all it needs is a connection

10:23.840 --> 10:33.600
string to the RDS cluster, now there are workflows within Wittis, which are launched on the

10:34.880 --> 10:41.120
control plane, so this is a control plane demon, there are command line and Ui and GRPC

10:41.120 --> 10:47.680
APIs available, so you start something called a move tables workflow, which moves all the tables

10:47.760 --> 10:57.920
from RDS into Wittis, now realize that this is happening at scale and the system is online,

10:57.920 --> 11:04.560
right, so we have users running at millions of QPS and terabytes of data, so it doesn't happen

11:04.640 --> 11:12.640
like instantly, it takes days, sometimes hours, days for the whole thing to happen, so,

11:14.720 --> 11:21.600
right, so we let's say the user, this is a very simple case, but I'm just talking to two

11:21.600 --> 11:30.880
shots, just to show that it's a shattered architecture, so you tell us take a table, which has auto

11:31.040 --> 11:37.520
increment ID, now what you do is you hash the ID and the space 64 bit space of that,

11:37.520 --> 11:44.000
split into two and say okay, first half goes into minus 80, so 0 to minus 80, 0 to 0, and the other

11:44.000 --> 11:49.200
is going to be other shots, so a very simple basic sharding scheme I'm explaining here, and we need

11:49.200 --> 11:56.640
to do this when the data comes in, so the originally RDS is not sharded, so we run like sometimes

11:56.720 --> 12:02.080
we are moving, now we are moving something to 32 shots, it's not this one, so several terabytes

12:02.080 --> 12:07.040
are going into different 32 shots, so that's filtering is happening, okay, we'll see by how

12:07.040 --> 12:17.040
the time, so, right, so we start the workflow and data, the app is still talking to RDS, nothing

12:17.040 --> 12:24.720
is changed, but there are these workflows, processes, there are started on each of these

12:24.800 --> 12:36.560
shards, which is pulling data through this and filtering out the, is asking only for those

12:37.920 --> 12:45.360
rows that should reside on this, right, and this VT tablet makes the calls to RDS, it converts it

12:45.440 --> 12:56.960
into internal objects called V events and they talk on GAPC, now the copy is eventually consistent,

12:56.960 --> 13:00.960
before we, I mean of course, it has to be consistent when we switch, but initially we do what's

13:00.960 --> 13:11.600
got a copy face, where we do bulk copies of tables, right, so we take each table, select chunks

13:11.680 --> 13:19.120
of it using snapshots and these are all sent to the target, here you get the V events,

13:19.840 --> 13:31.040
which are then converted into SQL and the queries are applied on the mySQL database, and at

13:31.040 --> 13:37.840
this time there is still load, right, those rights are happening and the corrosi of copied

13:38.160 --> 13:45.520
changed, so mySQL has something of bin logs, I'm not sure how many of you are aware of it,

13:45.520 --> 13:53.040
but people know what bin logs are streaming, okay, so essentially it's a transaction log of all

13:53.040 --> 14:01.600
the rights that are happening, right, that's provided by mySQL itself, so we tail the bin logs

14:01.760 --> 14:07.920
and those are also processed along with this copy face, once you have copied all the tables,

14:09.600 --> 14:22.240
so this happens really fast because yeah, so we copy it, then tail it and at some point you will see

14:22.240 --> 14:30.000
that we have reached, we are very close to where the RDS clusters, so we call it the V replication

14:30.880 --> 14:38.400
we replicate the module we use for this, so the log is very low, we can now switch, so typically

14:38.400 --> 14:43.120
you can switch all traffic or you can say first switch just the reads and reads are specified by

14:44.240 --> 14:50.080
you can specify at replica to a database and it will go do reads or if you are the separate

14:50.080 --> 14:54.080
connection string for reads, you would only change that to point to bit as, this is the time

14:54.080 --> 15:01.520
when you point to bit as, all rights are still happening into RDS, right, because there are some

15:01.520 --> 15:08.240
I'm not going to go to into detail because of time, but there are these we call it routing rules

15:08.240 --> 15:13.040
which are in the topology which are created when the workflow is created and so we take it

15:13.040 --> 15:17.840
knows all the VT gets there are typically hundreds of VT gets running if you have millions of transactions,

15:18.400 --> 15:23.760
so it knows and it only pulls does reads selects from here, DMLs are going there,

15:24.960 --> 15:34.400
you can at any time just reworked and the replication continues or at some point you say fine

15:34.400 --> 15:41.200
now I want to write switch rights, so rights which reads and writes are all happening from

15:41.440 --> 15:48.560
bit as, however we also run a reverse replication scheme into RDS so all changes are

15:48.560 --> 15:54.800
tailored again from binlocks, set back to RDS, this is a safety feature that if so if you want to

15:54.800 --> 15:58.640
try it out something does not work, you are not provisioned your bit as in half or whatever you

15:58.640 --> 16:04.560
just want to try so you want to switch back you can easily reward everything back and so

16:04.640 --> 16:16.720
so that is one of the key features that gives you comfort that you have that you can move

16:16.720 --> 16:22.880
in or out of it as and we also have tools like VDF which do rho by rho comparisons for snapshots

16:22.880 --> 16:28.880
at a particular point in time and to do actual comparisons to tell you that it is all copied

16:28.880 --> 16:40.640
so I think I will so when you are using everything is fine you just work flow complete and

16:40.640 --> 16:45.920
then you can deprovision your RDS and you are fully in return at this point so this is the point

16:46.000 --> 17:00.480
of the return so Matthias will now talk about some actual migration that we do and what changes

17:00.480 --> 17:08.080
and what special things we need to do for larger customers. Thank you RIT. So in the example

17:08.080 --> 17:14.240
that RIT was showing you only show one RDS instance typically that would be your writer your primary

17:14.880 --> 17:20.160
in many cases when you have a very busy system you don't want to copy from your writer because

17:20.160 --> 17:27.040
that takes all your rights and you want to copy from a replica. If you want to copy from a replica

17:27.040 --> 17:32.080
what is important is that you have global transaction IDs enabled in my scale because otherwise

17:32.080 --> 17:38.400
it is very hard to synchronize when you want to do the cover to the new instance with global

17:38.400 --> 17:44.480
transaction IDs that is just waiting for the transaction ID that you see at the beginning

17:44.480 --> 17:50.240
when you start the process wait until that ends up on your target system and then you can do

17:50.240 --> 17:58.240
the switching easily. RIT also talked about fee events that are the events that are coming from

17:58.240 --> 18:03.760
the binary logs and in order to process those into SQL you need to have the binary log

18:03.760 --> 18:09.120
formats that you row and the binary log image that you fold so you get all the rows in the

18:09.120 --> 18:16.160
all the fields in the the bin log event so you can easily convert. RIT also talked about the two

18:16.160 --> 18:24.080
phases quickly the copy phase of the migration and then the catch up phase so the copy phase is

18:24.080 --> 18:29.200
basically where it runs a long running select like select star from table and start streaming

18:29.200 --> 18:37.200
that to your other site by default it runs for one hour and then after one hour a catch up phase

18:37.200 --> 18:41.920
reading the binary logs is started and all the delta that were processed in the binary logs

18:41.920 --> 18:47.760
of the row to the already copied will be applied any events that are targeting a row that you

18:47.760 --> 18:53.760
did not copy yet can be dropped because we will copy those rows later and you will copy the state

18:53.840 --> 19:01.760
that are in attack point in time. We do throttle the copy phase because if you have a long running

19:01.760 --> 19:06.560
select in my SQL your history list length will start building that's what my skill needs to

19:06.560 --> 19:13.760
guarantee the MVCC constraints and so by default if your history list length is one million

19:13.760 --> 19:21.360
over one million we do not start a new copy phase we will finish the current copy phase but we will

19:21.440 --> 19:26.320
not start a new one until the history list length has dropped below same thing if you're copying

19:26.320 --> 19:31.920
from a replica if your source replica is more than 12 hours behind we will not start a new copy

19:31.920 --> 19:36.080
phase we will wait until replication lag drops and then we will start a new copy phase.

19:38.480 --> 19:46.400
Special case when you have a row of my SQL for Amazon as your source database there are no binary

19:46.400 --> 19:56.320
logs on a row or a replica just no not possible to enable so what we do for that if we need to

19:56.320 --> 20:01.200
copy from a very busy row or a cluster we cannot use the row or a replica we set up a secondary

20:01.200 --> 20:09.840
cluster that does binary log replication from the original cluster and then you use the primary

20:09.840 --> 20:14.160
of that cluster because that again is the only instance that has binary logs because it's

20:14.160 --> 20:21.680
replicates the binary log from your original cluster it rides binary log itself and that is

20:21.680 --> 20:29.760
what we then use to connect our replica tablet to the copying the good thing about the test as a

20:29.760 --> 20:38.560
source this is something we came up with for a customer that had a own premise in their own

20:38.560 --> 20:44.240
data center with us running and they wanted to move into the cloud and they wanted to do this

20:44.240 --> 20:49.920
not by stopping everything on source and starting everything in the cloud because they were

20:49.920 --> 20:56.640
doing transactions and business and stopping everything would be bad for business so we came

20:56.640 --> 21:05.360
up with a chart by chart relations strategy important here is that you have the same

21:05.360 --> 21:16.240
charting scheme for source and target database key space sorry because we then met them one to one

21:16.240 --> 21:23.440
and we can then migrate them one by one you need to have the flag there enable partial key space

21:23.440 --> 21:28.720
migrations on VT gate during the migration after migration you can leave that one out again

21:28.720 --> 21:33.680
because that does a little bit of overhead on the VT gate level because it needs to

21:33.680 --> 21:40.880
barge more things and it also briefly touched on routing rules for this special case

21:42.080 --> 21:46.800
we have introduced something called chart routing rules so we can then create a routing rule

21:46.800 --> 21:52.080
per chart like you can say like chart dash 80 like the one in the example that the road hit

21:52.080 --> 21:57.920
showed earlier will now start going to the cloud fittest and not the own premise fittest

21:58.880 --> 22:08.640
but the other chart will still go to the own premise fittest and we executed this recently

22:08.640 --> 22:15.120
with a customer up to 256 that's not a hard limit that's just a number of charts that they were

22:15.120 --> 22:24.000
running at that point in time after you do the chart by chart migration it's not just the complete

22:24.000 --> 22:29.360
command like normally you would complete and it would clean up everything in this case of chart

22:29.360 --> 22:34.480
by chart you have to do so many opinions afterwards just to be aware there is documentation for this

22:34.480 --> 22:40.080
also on the VTests documentation website so if you want to read more definitely go there

22:42.240 --> 22:50.960
and then another type of migration that we are currently working on is a multi-talent what

22:50.960 --> 22:57.680
do I mean with multi-talent that is when you have a set of tables per customer of yours

22:57.680 --> 23:02.480
per tenant like if you have each of your customers being a tenant in your environment

23:04.320 --> 23:09.760
they have a set of tables and then your business grows and you start off with maybe a thousand

23:09.760 --> 23:14.800
customers or thousand tenants let's say you have a hundred tables so you have a hundred thousand

23:14.800 --> 23:19.120
tables if you go to a hundred thousand customers you have ten million tables in your database

23:19.200 --> 23:31.360
which is not so great for my skill so we were facing facing this and then I was talking to

23:31.360 --> 23:37.120
Rohit and we came up with this multi-talent migration so what we do to each of the tables we add a

23:37.120 --> 23:44.720
unique identifier identifying the tenant like can be an ID like customer ID starting at one

23:45.280 --> 23:51.760
counting upwards and then all your all your queries should include that where

23:51.760 --> 23:59.360
condition where your customer ID or whatever equals the the ID of what you do and then in your

24:00.880 --> 24:07.360
Fiskima which is what Fitas use uses to identify where the chart are

24:08.320 --> 24:12.880
you add an identification the multi-talent spec here and then you say this

24:13.840 --> 24:18.880
fields like your tenant ID customer ID and this is the type of the field and then to your

24:18.880 --> 24:26.560
move tables command you add a tenant ID value when you create it and at that point in time all

24:26.560 --> 24:31.840
your V replication queries will have an extra wear condition so it's always filters on that tenant ID

24:31.840 --> 24:39.200
so you can then again do the cutting forward and backwards per tenant without needing to do the

24:39.200 --> 24:45.440
one big bang for everything and this can merge hundreds or thousands or hundreds of thousands of

24:45.440 --> 24:53.440
schemas into one chart of TASKee space and for this we also introduce a special kind of routing rules

24:53.440 --> 24:59.520
because if you would have normally you would have routing rules per table and that is going to

24:59.520 --> 25:04.720
expose very quickly if you have a lot of tenants and a lot of tables per tenant so we basically

25:04.720 --> 25:11.920
added something called key space routing rules that we can route your queries per tenant

25:11.920 --> 25:20.400
back and forth into the Vitas yeah and then what you best do in this case is chart your Vitas

25:20.400 --> 25:26.400
cluster on that same tenant ID field so then you can spread out your different tenants across

25:26.400 --> 25:36.160
multiple charts and not overload one database server again some members we ran thousands of

25:36.160 --> 25:43.200
your replication imports the ones that are that Roy talked about earlier in Salserv mode so

25:44.000 --> 25:49.920
basically if you sign up for Plan scale and you want to import data that's a V replication stream

25:50.000 --> 25:56.720
is what happens under the hood and examples we have imported for customer in Salserv mode

25:56.720 --> 26:04.480
64 terabytes into 16 charts without much issues and we do multiples of multiple millions of

26:04.480 --> 26:13.120
QPSs across multiple Vitas clusters at Plan scale just last weekend we were working on a

26:13.920 --> 26:19.760
multi terabyte data set copy from a short from a Aurora cluster into a shortest Vitas and we

26:19.760 --> 26:27.040
observed speeds in more than one billion rows per hour in copying those databases so we copied the

26:27.040 --> 26:34.640
entire data set basically over one weekend which was pretty cool the multi tenant migration that I

26:34.640 --> 26:40.640
was talking about were currently executing with a customer that has hundreds of thousands of tenants

26:40.640 --> 26:46.960
on multiple Aurora clusters and total size there is about 250 terabytes which we are migrating into

26:47.680 --> 26:55.600
128 charts currently and we're basically going from nine million tables per Aurora source to 90

26:55.600 --> 27:02.880
to a shortest Vitas so imagine the online DDL you would need to do to manage nine million tables per

27:02.880 --> 27:09.600
cluster now you just need to manage 90 tables in the shortest Vitas and Vitas takes care of the

27:09.680 --> 27:15.120
entire online DDL and then you have some case studies here the Shardbush Hector migration

27:15.120 --> 27:22.640
we posted a case study of the cash app migration for square they did over 400 charts in total

27:23.520 --> 27:30.640
Shardbush hard successfully without downtime and additional case studies are also available on the

27:30.720 --> 27:39.680
website and that was any questions no time for questions unfortunately

27:43.360 --> 27:47.360
all right so Josh are you I don't know what we're Josh