WEBVTT

00:00.000 --> 00:13.120
It's written very small, so I'm just going to watch there, I guess.

00:13.120 --> 00:16.520
Okay, more or less, welcome, everybody.

00:16.520 --> 00:17.800
My name is Paul Stackley.

00:17.800 --> 00:21.200
Can you guys hear me on the back?

00:21.200 --> 00:26.120
I guess, how many people saw my talk yesterday?

00:26.120 --> 00:28.720
Oh, okay, quite a few, okay, nice.

00:28.720 --> 00:31.840
So you guys, we have the real picture.

00:31.840 --> 00:36.480
I'm going to give you a talk about how to write a diner X step-by-step.

00:36.480 --> 00:41.400
A little disclaimer, I'm not going to explain how to write your diner X step-by-step.

00:41.400 --> 00:46.000
I'm going to explain how I did write mine.

00:46.000 --> 00:47.000
What's a diner X?

00:47.000 --> 00:50.880
It's kind of a long in a medicine scene.

00:50.880 --> 00:52.840
It stands for dynamic reconpiler.

00:52.840 --> 00:58.360
It's basically a jit compiler, just in time compiler, for binary translation.

00:58.360 --> 01:01.760
To understand what the diner X is, you need to understand what an interpreter is.

01:01.760 --> 01:07.720
An interpreter will read instruction by instruction and simulate them.

01:07.720 --> 01:14.920
On the other hand, dynamic reconpiler will take a chunk of instructions and simulate that

01:14.920 --> 01:21.720
chunk of instructions by generating code for the host CPU.

01:21.720 --> 01:25.880
The reason why we want diner X is they're just fast.

01:25.880 --> 01:32.680
I'm not talking 50% fast, I'm talking 10 times faster.

01:32.680 --> 01:35.800
My own diner X is called life-wreck.

01:35.800 --> 01:38.120
You can click on that thing and give it a start.

01:38.120 --> 01:39.120
Thank you.

01:39.120 --> 01:46.800
It's a mix to everything diner X for PlayStation, and later, notably PCSX.

01:46.800 --> 01:51.880
I started working on it in 2014 as a site project.

01:51.880 --> 01:57.360
The reason was why I decided to work on it was because for me, diner X was just

01:57.360 --> 01:58.760
weird, just black magic.

01:58.760 --> 02:00.760
I just didn't know how they worked.

02:00.760 --> 02:02.160
I didn't understand how they worked.

02:02.160 --> 02:06.080
I'm the best way to understand something, to just do it.

02:06.080 --> 02:13.160
I wanted to do another diner X, but with something that brings something to the table.

02:13.160 --> 02:16.400
The thing I bring, it's a cross-platform diner X.

02:16.400 --> 02:22.800
If you can imagine a CPU that's not M68K, it can run on it.

02:22.800 --> 02:28.440
It uses good lighting as the code generator, and that's how it achieves the cross-platform

02:28.440 --> 02:29.440
ness.

02:29.440 --> 02:33.560
Good lighting is the code generator I was talking about yesterday.

02:33.560 --> 02:37.280
That supports all kinds of architectures.

02:37.280 --> 02:44.640
It runs on PC, you probably used it already, without knowing it, it's the diner X.

02:44.640 --> 02:51.280
In the PCSX, Libertural Core, it runs on GameFube, it runs on Dreamcasts, it runs on other platforms,

02:51.280 --> 02:55.240
on Wii U, etc.

02:55.240 --> 03:02.200
Before even considering writing a diner X, you need to have an interpreter.

03:02.200 --> 03:08.760
For various reasons, the first one being, it's important to emulate the rest of the system

03:08.800 --> 03:15.560
of having a system that works with the interpreter before trying to go into writing a diner X.

03:15.560 --> 03:22.240
It also provides a baseline that will allow you to compare the behavior of the diner X versus

03:22.240 --> 03:25.880
something that you know is working correctly.

03:25.880 --> 03:33.160
Having an interpreter will allow us, or me, to introduce the diner X, by little, and I'm

03:33.160 --> 03:36.080
going to talk a bit more on that later.

03:36.080 --> 03:41.240
In my own diner X, it's complementary to the diner X, basically anything, any piece of code

03:41.240 --> 03:49.760
that is just too hard to translate, I just don't, and I just run the interpreter on it.

03:49.760 --> 03:55.480
I'm going to introduce a few concepts, the first one being a block is a set of consecutive

03:55.480 --> 04:03.360
emulated instructions from the entry points, which I call also PC address, PC address

04:03.360 --> 04:07.120
will be an entry point of a block.

04:07.120 --> 04:11.440
An entry point is point sense if you have a gem to give an address, that given addresses

04:11.440 --> 04:17.800
the entry point of a block, and it ends at an unconditional exit point, for instance, if

04:17.800 --> 04:25.640
you have a gem or a branch somewhere else, that is not conditional, that's an unconditional

04:25.640 --> 04:29.200
exit point, and I'm stopping the block there.

04:29.200 --> 04:34.880
The reason why I'm creating blocks like that is that at compilation time, you just have

04:34.880 --> 04:42.280
no clue about what the code is going to do, for instance, if somewhere in your block you

04:42.280 --> 04:48.000
have an unconditional branch that to give an address that uses the address stored in a

04:48.000 --> 04:52.000
register, you need to know the value of that register, maybe you don't know the value

04:52.000 --> 04:55.280
of that register at that point, so you don't know where it is branching.

04:55.280 --> 05:03.080
You can't just connect this block to another block or just because you don't know where

05:03.080 --> 05:04.080
it is yet.

05:04.080 --> 05:11.400
You can't do static binary translation, many people tell me, why don't you just statically

05:11.400 --> 05:18.400
compile a playstation game for PC, that's just not possible, nobody does that.

05:18.400 --> 05:23.120
We need a block cache, and that's basically just a hash map of the blocks from their

05:23.120 --> 05:24.120
PC address.

05:24.120 --> 05:32.000
So we can easily retrieve the blocks, and the next thing we need is a dispatcher.

05:32.000 --> 05:36.200
The dispatcher, the name I gave it for it, I don't know if there's another term for it,

05:36.200 --> 05:42.920
is the main loop of the dynamic, what it's going to do is fetch the next block from its

05:42.920 --> 05:53.640
PC address, run the interpreter on it or execute it, get the address of the next PC address

05:53.640 --> 06:01.280
as written from the block, and then continue the process, until it's interrupted by an

06:01.280 --> 06:09.560
IRQ of the emulated system, until the time allocated for the CPU emulation is up.

06:09.560 --> 06:14.600
The other role of the dispatcher is to switch between what I call the emulator space and

06:14.600 --> 06:15.720
the gitspace.

06:15.720 --> 06:22.440
Gitspace would be the space of all generative code, and emulator space is everything, as is

06:22.520 --> 06:25.560
all the C code of the emulator, and everything.

06:25.560 --> 06:31.400
The thing is, in C you have rules, for instance, you're not allowed to modify the value of

06:31.400 --> 06:37.000
a register of a colleague, colleague-saved register, there are a few registers, you just

06:37.000 --> 06:42.680
can't modify, and if you modify them in your function, you need to restore the old value.

06:42.680 --> 06:49.960
Yeah, that's the emulator space, in gitspace, you can whatever you do, whatever you want to do.

06:49.960 --> 07:01.320
So what the dispatcher does is save the execution context before restore it afterwards.

07:01.320 --> 07:06.840
The first real thing in it to implement is a call to, I mean, the first thing I implemented,

07:06.840 --> 07:12.840
it was a call to the interpreter, basically you want to generate, you want to implement

07:12.840 --> 07:16.280
generators for all the obstacles of the machine, right?

07:16.280 --> 07:21.960
So what you can do is create a generator for call to the interpreter, that we just interpret

07:21.960 --> 07:27.720
one instruction and then return, and if you use that generator for all the

07:27.720 --> 07:31.960
obstacles of the machine, you already have a working diner wreck, just slow because it's

07:31.960 --> 07:38.120
calling into the interpreter every time, but it's already working, so since it's already working,

07:38.120 --> 07:43.960
the next step is to add tooling because trust me, debugging a diner wreck without tools is really

07:44.040 --> 07:51.720
not fun. For that, I wrote my own debugger, which is called the big as debugger, it's called

07:51.720 --> 07:58.440
the big as debugger because it's really not smart in the way it works, it just runs an instance

07:58.440 --> 08:03.400
of the emulator using the interpreter and an instance of the emulator using the diner wreck and

08:03.400 --> 08:11.640
just compares the two of them at every time the dispatcher loops exit, just going to run

08:11.720 --> 08:19.000
checksums and compare them. I have a very, very debug mode which runs the dispatcher just for

08:19.000 --> 08:23.720
one block and then exit checksum the whole two megabytes of RAM, all the registers are

08:23.720 --> 08:28.440
everything and then run for another block and this one runs in minutes per frame, but it allows

08:28.440 --> 08:33.640
me if there is a breakage somewhere, there's a mis compilation, it allows me to find the exact

08:33.640 --> 08:43.960
block that contains the invalid code. Once you have the tooling and everything, you can start

08:43.960 --> 08:50.920
implementing some opcode, so you generate instruction that will load from the register cache,

08:50.920 --> 08:56.200
the values that you need, that you will generate an instruction that corresponds to the

08:56.200 --> 09:01.640
instruction that you want to emulate for instance a nudge for an add, etc. And then an instruction,

09:02.520 --> 09:11.560
sorry, to save the result back into the register cache, you can keep what I do is I keep a

09:11.560 --> 09:17.560
pointer to my state structure that contains the register cache into a host register for easy access.

09:21.480 --> 09:27.400
One very important thing in a diner, I can very hard to emulate is a code invalidation.

09:28.280 --> 09:33.880
We need a way to invalidate code, for instance, the emality program might want to load

09:33.880 --> 09:39.400
new data from the room or from a disk and write it to somewhere where we thought we had code

09:40.200 --> 09:46.120
and all the blocks we created for that old code, we need to invalidate them. The very simple

09:46.120 --> 09:51.720
solution would be to check the block cache at every memory right, we do an invalidate the block

09:51.720 --> 09:58.920
that we hit. That's very slow, don't do that. The best option, you memory protect the whole run,

09:59.960 --> 10:07.240
using the smallest pages you can, every time you jump to a new PC address, you mark the memory

10:07.240 --> 10:14.120
page as read only. And if you write to read only page then you get an exception, you can switch

10:14.120 --> 10:20.280
to memory page to read right and invalidate all the blocks that are there. It was really well,

10:20.360 --> 10:24.280
it doesn't really work for me because it's tricky to implement for cross platform dynamics.

10:25.320 --> 10:31.720
So what I did was I have a code lookup table, which is the size, it's two megabytes,

10:31.720 --> 10:38.920
the size of my emulated RAM. And basically it's a map between a PC address and the host code

10:38.920 --> 10:47.400
for the block that corresponds at that PC address. So it's a very fast map and it allows me to

10:47.400 --> 10:54.600
do coding validation as well. The memory writes, the generated code that will write a memory

10:54.600 --> 11:01.400
just needs to write a null to the corresponding entry. The dispatcher when it tries to fetch

11:01.400 --> 11:10.680
the next block for that PC address is going to check the loot entry. If it's null then it gets

11:10.680 --> 11:15.480
the block from cache. If there is no block in the cache then it just creates anyone just compiles

11:15.480 --> 11:21.480
on you one. If there is a block however it's going to compute a checksum of the memory that

11:21.480 --> 11:29.080
this block covers. And if the checksum matches the one that was computed at compilation time

11:29.080 --> 11:35.160
of the block then it just resets the loot entry. It's like a force positive resets the null entry.

11:35.160 --> 11:40.440
So that next time we won't have to enter that loop again. Otherwise if it's not matching then it

11:40.440 --> 11:47.960
just regenerates the block. It tries to block and compiles on you one basically. It's not perfect

11:48.920 --> 11:55.320
because for instance if you have a right to offset zero and then jump to sorry a right to offset two

11:55.320 --> 12:01.480
and then jump to offset zero you end up with a block that doesn't really correspond to what is

12:01.800 --> 12:10.680
really in RAM. In practice it works good enough for me at least. The register allocator

12:11.720 --> 12:20.280
it allows you to use host registers for emulated registers because I was saying before

12:20.280 --> 12:26.760
load a register from the register cache emits your opcode then start it again that's low. Don't do that.

12:27.560 --> 12:34.680
If you have a register allocator you can allocate host registers for emulated registers

12:35.960 --> 12:43.880
and try to make it try to spill as many registers sorry spill as less registers as possible

12:44.760 --> 12:51.800
keeping as many emulated registers in host register as possible. I have a multi-level priority system.

12:51.800 --> 12:59.240
I want to flush these to invalidate basically the host registers that are the easiest to

12:59.240 --> 13:09.880
recreate in case I need to recreate them. A good register allocator is really important if you want to

13:09.880 --> 13:16.840
have speed. A little bit about cycle like counting we need to keep track of how many cycles

13:17.800 --> 13:26.760
the way I do it as just keep cycle counter in a host register and increment it on blocks exit

13:26.760 --> 13:34.760
depending on the exit point increment by your different value. You could ask how many cycles

13:34.760 --> 13:43.000
is a memory load. It really depends what memory targets. The PCS6 and Lighttrack they have a very

13:43.000 --> 13:50.360
smart way of delegated it's just to the answer is two instructions sorry two cycles per instruction.

13:50.360 --> 14:01.000
In practice for PlayStation it works mostly okay. Emulating memory that's a hard part and that's

14:01.000 --> 14:10.440
maybe where there is the most to gain with a dynamic. It is hard because in PlayStation

14:10.440 --> 14:14.760
for instance you have the ram at a zero you have several neurons and you have the bios at

14:14.760 --> 14:18.680
the different address and you have the scratch pad at the different address. If you want to emulate

14:18.680 --> 14:24.760
these you have one buffer like you can see for the ram you have one buffer for the bios one

14:24.760 --> 14:33.160
buffer for the scratch pad and you need to map the address between themselves. So the generated

14:33.160 --> 14:38.920
code can become really really big. One simple solution is to generate to have a fast pass

14:38.920 --> 14:45.160
under slow pass. So you generate an address check in the generated code that will check a

14:45.160 --> 14:50.040
isn't this address in ram if it's in ram then you go to a fast pass where it's just loading

14:50.040 --> 14:55.720
doing the address translation and load from ram otherwise it just jumped to C. When I call the

14:55.720 --> 15:00.520
yellow solution is just assume everything points to ram and then you get a segmentation fault you

15:00.520 --> 15:07.320
handle the segmentation fault and continue. My solution I just unconditionally jump to C

15:08.120 --> 15:13.640
then in the C code I check if it's pointing if the address is pointing to a memory or

15:13.640 --> 15:21.000
hardware IO and then I tag the output in the block and recompile the block then in the block

15:21.000 --> 15:28.840
when it's recompiled if it's points to memory then I generate a very fast code for that

15:28.840 --> 15:37.160
otherwise I just jump to C. It works just because people don't try to break it because it

15:37.240 --> 15:41.480
it's based on the assumption that something that points to memory will always point to memory

15:41.480 --> 15:45.080
and something that points to memory IO will always point to memory IO.

15:49.480 --> 15:53.560
Memory translation so I was saying that you need to translate between

15:54.280 --> 16:01.720
emulated memories to host memories you can help this a lot if you memory map the

16:02.120 --> 16:07.640
emulated buffers to where they're actually on the emulated system so in my case I memory

16:07.640 --> 16:13.320
memory map the emulated ram the buffer for the emulated ram to address zero where the

16:13.320 --> 16:20.440
page station exists expected and that means that my address translation is much shorter

16:22.840 --> 16:28.440
and it's very important to make it much shorter because smaller code is faster code.

16:29.400 --> 16:37.480
Finally, a miss optimization that's like the last 10% of the performance these are the

16:37.480 --> 16:44.760
costly ones you count maybe one year for the first person then two years for the next and for

16:44.760 --> 16:51.960
for et cetera so you have for instance red it compilers you can run the compilers one

16:51.960 --> 16:57.240
person red constant folding is you analyze the code see where the constants or if you

16:57.240 --> 17:02.680
addition I add each addition it if you add a constant with a constant it just makes another

17:02.680 --> 17:11.960
constant memory target deduction if you have a constant you know it's it points to ram and

17:11.960 --> 17:18.840
then you load let's say 16 bit or 8 bit value from ram and you add the two values you don't know

17:18.840 --> 17:23.960
exactly what the address is right but you know it's points to ram so you can actually emit

17:24.520 --> 17:32.840
fast pass for that for that memory IO. Instruction sorry combining just you see that

17:32.840 --> 17:38.040
this instruction you could do then better with a host instruction that does the two in one

17:39.160 --> 17:45.880
and define behavior optimization the PlayStation has this break of the code and the break

17:45.880 --> 17:51.480
what the break of code does is just crash the PlayStation so if you if you find it in the

17:53.480 --> 17:59.160
emulated code you know that everything in that branch is dead so you can remove the whole branch

18:01.560 --> 18:09.000
times up yeah local branch is just instead of going back to the dispatcher every time you are

18:09.080 --> 18:16.840
on the branch so you handle the block branching inside itself and target specific optimizations

18:16.840 --> 18:23.720
in my case that's inside new lightning for instance for the SH4 is just moving the instruction so

18:23.720 --> 18:30.680
that they combine together for pipeline stuff like that yeah that was as good as I could do in 20 minutes

18:30.760 --> 18:32.760
thank you

18:47.640 --> 18:49.640
no

18:50.440 --> 18:57.240
yeah the question is are any of these optimizations part of new lightning more or less right

18:58.200 --> 19:03.880
so a new lightning does a little bit of optimization but just a time if it finds

19:03.880 --> 19:10.760
and see if you set a register 0 to 0 and then you set register 0 to 0 is going to sing her

19:11.880 --> 19:18.520
I could skip one that's about the level of optimizations that lighting does everything else is done in

19:18.520 --> 19:35.400
light break and on top of it yeah so if you have forward job the question is if you have a forward

19:35.400 --> 19:47.000
jump do I need a label yes the question is if I have a forward jump I need to resolve the

19:47.080 --> 19:59.880
address before before I can continue with the block right yeah so there are two cases first case is

19:59.880 --> 20:11.000
it's not local then I just jump back to the dispatcher if it's local then I memorize the address

20:11.000 --> 20:18.200
of the branch and later when I know exactly where I'm branching to I'm going to patch that address

20:20.440 --> 20:24.760
that's that's offered by a new lightning by the code editor yeah the generator

20:26.600 --> 20:34.120
somewhat will be more occurring than others so you don't really need to translate everything

20:35.080 --> 20:42.840
because some instructions might appear once every 100 times so in your opinion in your experience

20:42.840 --> 20:50.520
what is the amount of instructions that you need to dynamic in order to gain and ask these and speed

20:50.520 --> 20:57.240
up so you cannot so you don't have to translate everything so the question is

20:58.120 --> 21:07.240
I'm trying to think of the sentence sorry the question is what instructions you absolutely have

21:07.240 --> 21:16.920
to generate code for and in in yeah memory are you that's the most important one yeah

21:16.920 --> 21:24.920
because it's the these instructions we generate maybe the most code but you need them to be as

21:25.080 --> 21:31.880
small as possible to to get good performance yeah

21:32.760 --> 21:38.280
do you always have to start awkward by awkward before moving to instruction combining

21:38.280 --> 21:43.960
just for example my space there's some of course not like loading data in between register

21:43.960 --> 21:49.000
not really anything I would assume those will start by doing the combined work first

21:49.880 --> 21:56.200
can you repeat the question and I'm sorry how many instructions might be in the other console

21:56.200 --> 22:00.520
yeah they don't really do much by themselves they like know data in the special weather yeah

22:01.320 --> 22:05.880
like I don't feel like that there's like how do you start by doing not first and then

22:05.880 --> 22:17.240
finding it later the question is how do I how to replace that how do I select

22:19.240 --> 22:30.600
how do I implement an upcode first before doing the whole part that's so in my slide four I think

22:34.360 --> 22:39.640
I have this code to interpret it basically at that point every upcode just jumps to the

22:39.640 --> 22:50.040
interpreter right and then I can one by one rewrite the code generators to admit better code

22:50.040 --> 22:57.240
that doesn't jump to the interpreter to generate you know I don't see for example they I have

22:58.280 --> 23:04.360
a similar instruction for dividing for floats at the same time if I just do the loading

23:04.520 --> 23:09.400
instruction in diner I think the rest is in there but I don't see our work you don't that's the thing

23:09.400 --> 23:15.720
no no you have to do the whole block the whole block in diner or the whole block in interpreter yeah

23:17.880 --> 23:23.560
we said that the block that starts with an entry point and as we have an additional thing

23:24.600 --> 23:29.160
what I'm going to do if they call the jams into the middle of the block you just generate the

23:30.040 --> 23:38.280
I if so the question is what why do I what do I do if the code writes itself in the middle of

23:38.280 --> 23:43.640
the block or if I jump in the middle of the block I never jump in the middle of the block

23:43.640 --> 23:52.360
because that's a new block that's the short version if you have half an hour we can do the

23:52.360 --> 23:57.480
long version but I have a plane to catch

24:12.840 --> 24:17.560
the question is did I meet some cases where the code is just way too big

24:18.200 --> 24:25.160
to weird oh I've seen weird code believe me I've seen things you people would understand

24:27.320 --> 24:33.560
do you guys know meps assembly what happens when you have a branch as a daily slot of a branch

24:35.960 --> 24:41.320
for question sorry it's going to fetch like that it's going to execute the first instruction at the

24:41.320 --> 24:48.040
address of the target yeah the target of the first branch and then jump to the address of the second

24:48.040 --> 24:53.640
branch what happens if the first instruction at the target of the first branch is a branch

24:55.480 --> 25:00.840
yeah it gets deep like that that's the kind of blocks I was talking at the beginning where I just

25:00.840 --> 25:05.160
jump to the interpreter because that's just impossible to handle in diner or yeah

25:11.320 --> 25:13.320
yeah