WEBVTT

00:00.000 --> 00:15.960
Hi everyone so after the previous MLIR talk I will continue on MLIR sorry yeah so I will

00:15.960 --> 00:21.320
also be talking about MLIR but also about like a hardware architecture a bit so

00:21.320 --> 00:29.640
bit on the in between the two so it's yeah MLIR based title tiling for

00:29.640 --> 00:36.280
Ryzen AI MPU which is about like like an MPU in a laptop so laptop like this one

00:36.280 --> 00:44.400
we have a CPU like an integrated GPU and an integrated like MPU so I'll be

00:44.400 --> 00:49.400
giving an overview of like what this MPU looks like in these laptops how we

00:49.400 --> 00:54.800
targeted like how we use data packing to describe like copying high-dimensional

00:54.800 --> 00:59.320
copying through the MS and then I'll be showing like how you can map a gem in

00:59.320 --> 01:04.520
different ways to this device and how we use MLIR to describe like this

01:04.520 --> 01:11.480
parallelization and data broadcasting through the MPU so first an overview

01:11.480 --> 01:18.560
the MPU is like an array of cores and memory so we have like four by four array

01:18.560 --> 01:23.480
of cores and then we have like a row with memory tile and then we have another

01:23.480 --> 01:28.280
row with like connections to the outside world to like a DRAM basically and then

01:28.280 --> 01:32.360
at the bottom we have like a microcontroller that like kind of programs and

01:32.360 --> 01:38.720
controls the device there's a different generations of it like you have XDNA1 so AI

01:38.720 --> 01:43.760
engine XDNA imputes like all kind of the same thing XDNA1 is the first generation

01:43.760 --> 01:48.720
that has kind of a capability of 10 tops on the entire array and then you have

01:48.720 --> 01:54.320
XDNA2 which can reach up to like a 50 tops you're easy

01:54.320 --> 01:58.840
stops with like vector processors which can do for example like 512 in 8

01:58.840 --> 02:09.480
Mac on each tile or for XDNA1 like 256 in Macs across the array I'll be

02:09.480 --> 02:15.080
talking about like XDNA1 mostly in these lights like the XDNA2 as like a slightly

02:15.080 --> 02:19.040
different variations but they're like very similar so what you can do it is

02:19.040 --> 02:22.760
as you have an array you can offload some applications to add AI

02:22.760 --> 02:26.960
applications like for example like a video audio some content creation some

02:26.960 --> 02:31.760
LLMs and you can partition is array to do like multiple of these applications

02:31.760 --> 02:36.600
at the same time by allocating like columns for example like two columns to

02:36.600 --> 02:43.040
audio and like four to content creation now how do you program the device

02:43.040 --> 02:49.120
we're basically generating three things initial like boot PDI core code for

02:49.120 --> 02:56.320
all of the tiles and like microclone total code to control like the array in

02:56.320 --> 03:01.960
that boot PDI also have things like routing for example so the most important

03:01.960 --> 03:05.840
part about like this device and why it's like to get like power efficient

03:05.840 --> 03:11.080
computing is basically that it is like a push model in contrast to like a

03:11.080 --> 03:14.480
pull model like typically have a CPU where you'll basically say I need this

03:14.480 --> 03:18.880
data like a pointer difference and get your data and somehow the data will

03:18.880 --> 03:22.840
come into the like registers there's no such thing here like if you want

03:22.840 --> 03:26.640
data on like one of these tiles you'll have to get it through there and that's

03:26.640 --> 03:29.920
where the power efficiency comes from and I think that's why it will see more

03:29.920 --> 03:34.800
and more of these kind of devices in like specialized GPUs and so on as well

03:34.800 --> 03:39.360
because you can like when you control like where the data goes very precisely

03:39.360 --> 03:43.280
you can get a lot of like power benefits like out of it and performance

03:43.280 --> 03:49.400
because of that reason as well so how do you program it like first the boot PDI

03:49.400 --> 03:54.000
like you need to actually program the routes like the routes from the outside

03:54.000 --> 03:57.680
world like truly connections through the like shared memory which is the

03:57.680 --> 04:04.480
mentals on the second row into the AIE course and these routes like can go like this

04:04.480 --> 04:10.480
you can go like vertically like up in some column you can go down you can

04:10.480 --> 04:13.560
have a route between the mentals and like between the outside world and the

04:13.560 --> 04:17.840
mentals and you can figure that in different ways for example you can have

04:17.840 --> 04:21.080
some horizontal routes you're basically here with saying that we have some

04:21.080 --> 04:26.320
data that is shared like a closer role like all these cores on the same like a

04:26.560 --> 04:30.800
row they will have like the same data packet and they can forward it to each other

04:31.360 --> 04:36.240
and this is useful because like in these AIE type of applications like

04:36.240 --> 04:40.320
convolutions and matmos you typically have a lot of data reuse so when you

04:40.320 --> 04:44.320
paralyze it costs multiple tiles they basically a lot of those styles will need

04:44.320 --> 04:48.880
the same data and instead of like loading that data again and again to like from

04:48.880 --> 04:53.280
DDR like shared memory or whatever to those styles you can basically forward it

04:53.280 --> 04:57.040
from like one tile to another so they all at the same data and that's called like

04:57.040 --> 05:02.800
data about costly so you can go horizontally you can go vertically you can go like down

05:04.000 --> 05:08.480
you can go up and you can do things like this for example like you can share across

05:08.480 --> 05:12.320
different columns that are not like next to each other for example here you have

05:12.320 --> 05:16.960
column 1 and column 3 that will basically share data and column 2 and column 4 that

05:16.960 --> 05:22.320
will share like different data and all these kind of routing configurations are part of

05:22.320 --> 05:27.360
like this boot PDI or you can also do it later on but basically it's a configuration that you

05:27.360 --> 05:32.880
all describe to get this behavior where certain tiles will collaborate on like the same

05:33.600 --> 05:41.440
data or partially the same data of course so then another piece of it is like the core

05:41.440 --> 05:48.560
program basically for vector processor and scalar processor that each core has and that's

05:48.560 --> 05:55.520
like a typical like it's a very long instruction board processor and that's a typical like

05:55.520 --> 06:02.480
what would you target with LLVM for example and you just compile an L loaded into each tile

06:02.480 --> 06:06.960
and each tile will start executing that program but it will expect that the data somehow

06:06.960 --> 06:13.280
already is locally on those styles which you can reach by using these connections that we

06:13.280 --> 06:17.920
program so not all these styles need to have the same off like each of those styles could have

06:17.920 --> 06:24.400
like a completely different program and that way like collaborate in fancy ways with each other

06:26.080 --> 06:32.560
and then lastly we have like the microcontroller which can like the two basically program

06:32.560 --> 06:37.360
the data connections into the array and basically say which addresses on the DDR you're getting

06:37.360 --> 06:44.720
like which data from looking at the core we have like a vector processor scalar processor we have

06:44.720 --> 06:51.120
some local memory 64 kilobytes and then around that we have like stream connections so we have

06:51.120 --> 06:56.080
like some horizontal, some vertical streams in this case for example vertically we have six

06:56.080 --> 07:02.560
to retubate a north channels and four to retubate like south channels so basically a little bit more

07:02.880 --> 07:08.080
data bandwidth going up then then going down typically we need more data bandwidth going up

07:08.080 --> 07:12.720
we have important to remember is that this gives us around like 24 gigabytes of bandwidth like

07:12.720 --> 07:20.800
into each column which I will get back to later on we also have like the memory tiles so

07:21.920 --> 07:26.880
the streams are similar we have some north streams some south streams some east some west

07:27.520 --> 07:32.960
but now instead of course we just have like some banks of memory and some DMAs like we have

07:32.960 --> 07:40.320
six DMAs to write into it six DMAs to read from it and those DMAs can do like some fancy high

07:40.320 --> 07:46.400
dimensional reading with a single instruction so basically here we have like some DMA some stream

07:46.400 --> 07:53.120
to memory DMA so it writes into some bank in like a strided fashion so for example here we'll

07:53.120 --> 07:59.200
have like on the blue tiles you see like for contiguous tiles being written then we have a jump

07:59.200 --> 08:06.320
we write another four tiles we have another larger jump and we write another like six eight tiles

08:07.120 --> 08:11.200
and this can be done with like a single instruction and like in this way the making

08:12.000 --> 08:16.640
basically reform at data while we're moving it like while we're reading it or while we're writing it

08:17.520 --> 08:22.400
it can reform at this data into like a packed data layout which is very friendly for the vector

08:22.480 --> 08:27.200
processes so by the time this data gets to the vector processor the vector processor will just

08:27.200 --> 08:32.400
load like a large block of data into a traitors and it doesn't have to do like any V4

08:32.400 --> 08:41.920
mapping because that's already done like by these TMAs so here it's here I show the similar thing

08:41.920 --> 08:47.120
but how we describe it in MLAR so basically you would expect like like some copy kind of

08:47.120 --> 08:52.080
construct that can move data from some memory to another memory from DDR to shared memory

08:52.080 --> 08:58.320
from shared memory to local memory local to local local to shared and so on and that's exactly

08:58.320 --> 09:03.440
what we do but we also need like a transpose kind of operation and so it happens that in

09:03.440 --> 09:08.480
MLAR when we start looking at this there was already this construct which is a pack operation

09:08.480 --> 09:13.360
which is basically like a copy and like a transposition and we use this copy and transposition

09:13.360 --> 09:18.720
to basically describe like these TMA operations that the copy data from like one memory to another

09:20.160 --> 09:25.760
and then we transform those into a like our own dialect which is the same DIAE TMA copy and D

09:25.760 --> 09:30.000
it's basically the same thing where we're saying we have a DMA we have a source we have a destination

09:30.000 --> 09:34.160
and we have a certain way of like addressing on the source and a certain high-dimensional

09:34.160 --> 09:41.920
addressing on the destination now looking at a gemtiling problem so the question is if we have a

09:42.000 --> 09:48.240
gem problem how do we map it to this array and there's different ways that you can do this

09:48.240 --> 09:53.280
for example suppose we have the entire inner dimension that we're going across like for

09:53.280 --> 10:01.760
example we divide the N and the N sorry the two yeah the AB into like slices but we take the

10:01.760 --> 10:09.280
entire inner dimension for each slice to if A1, A2, B1, B2 and we have to compute the products

10:09.360 --> 10:15.360
of those now this we can do by for example mapping this like we can parallelize this by mapping this

10:15.360 --> 10:20.240
to like four cores and like each of those cores will get like an entire slice with like a full

10:20.240 --> 10:25.200
inner dimension of A and a full inner dimension of B so they can compute like the actual output

10:25.200 --> 10:31.760
and then send it send it out and this works if like your entire slice A1, B1 and so on fit into

10:31.760 --> 10:38.720
local memory and if it doesn't then you kind of have to split that up as well so then we get here basically

10:38.800 --> 10:43.440
we split it across the K dimension as well and now these styles will not be computing like the

10:43.440 --> 10:49.120
actual output result but it will be computing like some partial output result and store it locally

10:49.120 --> 10:53.280
potentially send it out or keep it there locally and then you will send like kind of the next

10:53.280 --> 11:00.400
batch of data we can do this another way as well so instead of like keeping it locally

11:00.400 --> 11:05.120
something in the next batch of data you can forward it to the next style so in this case you

11:05.680 --> 11:11.600
some streams which I showed earlier in the slides that will go like down from like one core to another

11:12.160 --> 11:17.200
and basically you can forward your partial result to the next style which will kind of compute like

11:17.200 --> 11:22.640
the complete output result and then you can keep that locally to some other operation on it like

11:22.640 --> 11:28.240
some softmax on adding something to it or you can send it out through the output streams

11:29.120 --> 11:34.800
here showing the outer product again in the rest of the slides I will just be

11:34.800 --> 11:39.440
concentrating on like the outer product not like the inner product where we forward to another

11:39.440 --> 11:45.040
tunnel here basically we'll be forwarding slices and parallelize it across an array of

11:45.040 --> 11:49.600
tiles but no like forwarding like every slice will stay local until and that core will

11:49.600 --> 11:54.000
keep repeating them at mobile on those slices until it has the output result like no sharing

11:54.080 --> 12:01.040
between tiles but that is possible so how we describe that in MLAR is for example like this

12:01.600 --> 12:09.440
we tile the loops and eventually we end up like some inner loop this SCF40 which goes from 0 to 2

12:09.440 --> 12:16.480
so basically we're describing we have two dimensions we're describing an array of a 2 by 2 array

12:17.520 --> 12:22.160
and that is because you here have this core mapping in the wide direction and the x direction it's

12:22.160 --> 12:27.920
basically saying that we have two dimensions to one's y 1x and each is 2 goes from 0 to 2 so basically

12:27.920 --> 12:34.800
2 by 2 array and then we have this DMA copy and these that we saw earlier which describe

12:34.800 --> 12:40.320
copying data from like these mem tiles to these air itals and depending on like this f i

12:40.320 --> 12:45.520
an expression which depends on like the induction variables which are mapped in vertically or horizontally

12:45.520 --> 12:50.480
it will basically save which data slice from the shared memory needs to go to which tile

12:51.040 --> 12:55.520
and this f an expression because it depends on both the induction variables like each tile

12:55.520 --> 13:02.160
will basically have like a different slice like one tile f a 1 b 1 the other a 2 b 2 a 3 b 3

13:02.160 --> 13:07.680
and a 4 b 4 and if you calculate that and see what the bandwidth is the average bandwidth needed

13:07.680 --> 13:14.240
into each column it's around like 16 gigabytes bandwidth at one gigahertz and if we go back

13:14.240 --> 13:19.840
to like the streams that we have available you see that we have 24 gigabytes available going north

13:19.920 --> 13:25.600
with needing 16 but the problem is that we are about only looking like two rows and we have

13:25.600 --> 13:29.760
another two rows on top of that so if we double the bandwidth and for this configuration we need

13:29.760 --> 13:34.640
like turn it to gigabytes of bandwidth which we don't have available we only have 24 so we

13:34.640 --> 13:42.160
can't run like at full speed so we can find the solution for that like we use more data broadcasting

13:42.160 --> 13:47.840
in this example like we again have the same loop the same as f roll the same mapping in the

13:47.840 --> 13:52.080
wind direction but those the m a copy and these they don't depend on like the nf and

13:52.080 --> 13:56.800
expression of both induction variables they basically just depend on the induction variable which

13:56.800 --> 14:02.560
means that like a 1 will be shared across the rows b 1 will be shared across the columns b 2 as well

14:02.560 --> 14:08.800
and so you get this like a full outer product of a 1 a 2 b 1 b 2 and this results in the

14:08.800 --> 14:14.240
rounds like eight gigabytes per second of bandwidth needed into each column if we double that to 16

14:14.240 --> 14:22.000
or well within like the 24 we have available another configuration here where we can for example

14:22.000 --> 14:27.600
if a is small like it's vector and like b is a matrix we will factor matrix multiplication like

14:27.600 --> 14:34.880
you can broadcast a to like all the cores in the 2 by 2 array and send a different b to each core

14:34.880 --> 14:39.600
and combine those we need 10 gigabytes spent with this will still work out if we scale to like

14:39.920 --> 14:48.880
4 rows adding dimensions so if we add another 2 columns to it like we can even more efficiently

14:48.880 --> 14:55.360
like share data uh resulting in like only six gigabytes of bandwidth on average needed into in

14:55.360 --> 15:00.160
the column important is this average because you can actually do the routing you could

15:00.160 --> 15:05.040
proud is in the wrong way where like all the bandwidth goes to like a single column and basically

15:05.040 --> 15:10.800
you end up still in trouble but if you do this well then like on average you only need like six

15:10.800 --> 15:20.480
gigabytes of data um here similar thing are we get to six gigabytes of data bandwidth that we

15:20.480 --> 15:26.320
need but only using like two dimensions so here you saw we add another dimension to achieve

15:26.320 --> 15:32.240
this configuration you can get like a similar bandwidth using two dimension as well by like

15:32.560 --> 15:41.120
permitting the dimensions and now we can add blocks as well so we can add like an address here for

15:41.120 --> 15:46.400
all as well so now we have like nested as here for all loops and this is like how we uh

15:46.400 --> 15:52.320
tile and then eventually map it to like cores and to blocks in in certain directions and here you can

15:52.320 --> 15:59.120
have like yeah we can have like uh two rows for columns but like each of those uh two columns like

15:59.120 --> 16:04.080
two sets of two columns they behave independently um and that's because of like this bulk mapping

16:04.080 --> 16:09.440
so we're basically doing like similar computations and each of those columns um getting to yeah

16:09.440 --> 16:14.400
an average of the eight gigabytes of bandwidth so this works out um and that's useful for example

16:14.400 --> 16:18.080
if you want to do partitioning if you don't know exactly how many columns you might have

16:18.080 --> 16:23.920
available you can like kind of map like these outer dimensions to blocks as well to scale from

16:23.920 --> 16:28.960
like two to four columns to eight columns and and so on and yeah achieve this partitioning on the

16:28.960 --> 16:36.720
across the device yeah so to recap um we use MLA are basically mostly the pack operation

16:37.440 --> 16:42.400
and tile and fuse to get this back into into the loops and then we map this to certain like

16:42.400 --> 16:46.800
a horizontal vertical dimensions to describe all these different configurations and which data

16:46.800 --> 16:52.080
can be shared across this a engineering um and which will result in like different uh bandwidth

16:52.080 --> 16:57.600
which uses different compute and result in different performance uh on this device all of this is

16:57.600 --> 17:03.440
open source so I put the link here if you're interested and you can also reach out to me so then

17:03.440 --> 17:15.280
that's it and I'll take some questions yeah I think you're very much really nice so if you

17:15.280 --> 17:20.640
have these providers different possible mountains with different products in bandwidth etc so my

17:20.640 --> 17:25.600
question is if do you have currently or do you have any plans to build an automatic tool to generate

17:25.600 --> 17:30.960
or to optimize those matrix many based on CGRM offers etc but not only works with makes it

17:30.960 --> 17:39.600
that we multiply but maybe with other kind of code any kind of MLR uh not really sorry yeah so

17:39.680 --> 17:45.520
whether we have any tools available for like doing this optimizing tuning for example these

17:45.520 --> 17:52.160
mappings um and we we currently like don't but there's like a lot of interest from like people because

17:53.200 --> 17:57.280
we have kind of not opened it up all the way like there's a lot of things like routing you can

17:57.280 --> 18:02.320
describe so you can completely parameraturizes and like start tuning for best performance

18:02.320 --> 18:07.440
how we are just currently working at like things that we know what work so we're target specific

18:07.520 --> 18:11.600
configurations but you could totally do that and there's a lot of interest from people to

18:11.600 --> 18:21.840
like start doing that yeah yeah if I start taking my signal processing algorithms and

18:22.800 --> 18:31.440
platform uh what kind of a debugging options I can compare to debugging signal processing

18:31.440 --> 18:39.120
algorithms for the CGRM. Out of the box it depends a lot like here in this project which is

18:39.120 --> 18:46.880
completely open source there there's no not a lot of good debugging tools yet for this there's some

18:46.880 --> 18:52.320
work on basically uh basically if you want to get something out of this core right you have to

18:53.040 --> 18:57.680
uh it has to go through the streams so you basically have to program those streams and say like oh

18:57.680 --> 19:02.880
I will add in like a debug stream and then locally I will put some events some traces

19:03.760 --> 19:08.960
and stream those out and then into some buffer and I'll inspect them there's some work on that

19:08.960 --> 19:14.080
but uh I think a lot more work on that is needed to make it like convenient to work for like any kind of

19:14.080 --> 19:24.480
design. What mechanisms do you have for the type of communication with one another?

19:25.280 --> 19:30.080
What mechanisms for the between the tiles? Yeah so um yeah the question is what

19:30.080 --> 19:34.720
what mechanisms do we have to communicate between the tiles there's different

19:35.360 --> 19:39.200
mechanisms once you have the streams you can put something on the stream rather than another

19:39.280 --> 19:44.720
core and get it from the stream so that's one another way is like some of those styles that

19:44.720 --> 19:50.000
are neighboring tiles they can access each other's like data memory so basically they can just

19:51.120 --> 19:56.400
generate like a DMA instruction to read some memory from like the tile above for example so that's

19:56.400 --> 20:03.200
another way um another way is that you like send some XMM requests to basically you write some data

20:03.200 --> 20:08.320
on some of the other tiles you use some logs to communicate between them as well and you can also

20:08.320 --> 20:13.120
like I mentioned before you can collaborate on like a partial result so there's this like a

20:13.120 --> 20:18.320
cascaded stream where you can put some data onto that stream and it will basically go down

20:18.320 --> 20:24.320
to the next tile and that's like a higher dimension for against 64 bits or 48 bits and you can

20:24.320 --> 20:28.400
specifically collaborate on like these kind of like high dimensional intermediate results like

20:28.400 --> 20:33.760
from Metmus for example or convolutions so there's a bunch of different ways and it depends

20:33.840 --> 20:38.480
kind of the use case like a retrain for collaborating on a mat will on a soft max or different

20:38.480 --> 20:52.720
kind of computation yeah thank you very much thank you

