WEBVTT

00:00.000 --> 00:09.840
Hello, this is our talk about things that are coming together for Fortran Tooling, and

00:09.840 --> 00:15.480
this talk is about how me and my colleague experienced working with Fortran.

00:15.480 --> 00:18.920
And when I'm talking about me and my colleague, well, this is Peter.

00:18.920 --> 00:23.200
We are both working on a scientific computing institute in Darmstadt, Germany, and he is

00:23.200 --> 00:26.680
doing tooling for profiling and instrumentation.

00:26.680 --> 00:31.520
It's a researching way to put your measurement probes about how to figure out which parts

00:31.520 --> 00:36.040
of your code take how much time, making that efficient with low overhead, and he also

00:36.040 --> 00:42.120
does apply performance engineering mostly in a space and aerospace safety context, and

00:42.120 --> 00:45.520
done a master's thesis in collaboration with ESA.

00:45.520 --> 00:50.120
And I am just doing C++ compiler stuff, but I also have fun doing that.

00:50.120 --> 00:56.320
If it does analysis, transformation, be it source to source or IR to IR, or optimizations

00:56.320 --> 01:01.880
with the limited optimisation knowledge that I have, I'm at least very much interested in it.

01:01.880 --> 01:07.040
And basically, I'm using everything that is claimed to lean or IR passes in my day-to-day

01:07.040 --> 01:08.040
business.

01:08.040 --> 01:12.200
And day-to-day business means that at our institute, we have developed a lot of different

01:12.200 --> 01:18.080
tools that do a lot of different things, and they fall into exactly those categories.

01:18.080 --> 01:24.120
So metasyG and the cage, so call graph generation, is what it sounds like, it generates

01:24.120 --> 01:27.080
call graphs, we will go into that a little later.

01:27.080 --> 01:31.840
But there are also Alpaca, which is a program difference, analysis tool, and mini-apx is

01:31.840 --> 01:36.440
a mini-ap extraction tool, or proxy-ap extraction tool, which I presented last year at the

01:36.440 --> 01:38.840
HBC Devroom.

01:38.840 --> 01:42.480
And there are also two other tools that I want to highlight.

01:42.480 --> 01:48.120
One is Pyra, which was developed by JP Lair, the men within impressive mustache, and

01:48.120 --> 01:51.160
the second to last row.

01:51.160 --> 01:55.240
And a copy, which was developed by a colleague of mine, Sebastian Kreuzer, who also brought

01:55.240 --> 01:58.680
you x-rays capabilities to now instrument shared libraries.

01:58.680 --> 02:02.560
So if you make use of that, he's the guy to thank for it.

02:02.560 --> 02:10.000
And now that we have all these tools, all of those tools were developed in C++, 4C++

02:10.000 --> 02:14.680
with C and C++, idiosyncrasies in mind.

02:14.680 --> 02:19.920
When I say all the tools for this talk, those two are the ones that are most irrelevant.

02:19.920 --> 02:25.080
So on the right side, you see a call graph, I think most of you are pretty familiar with

02:25.080 --> 02:26.080
how it looks.

02:26.080 --> 02:31.120
It's basically how do functions call each other, and what are the possible call paths

02:31.120 --> 02:32.960
through a program.

02:32.960 --> 02:37.680
This is basically what our library meta-seaches, a meta-cograph library.

02:37.680 --> 02:43.240
Thus, it allows you to specify a call graph, and additionally attach arbitrary meta-data

02:43.240 --> 02:47.560
tool, the notes, or the edges, so that you can keep track of things that are interesting

02:47.560 --> 02:53.560
to you, your use case, your program, and cache the call graph generation tool is the IR-based

02:53.560 --> 02:55.840
usage tool of this library.

02:55.840 --> 02:59.880
There's also a source code based usage tool for that.

02:59.880 --> 03:04.080
And Pyra is the performance instrumentation and refinement automation tool.

03:04.080 --> 03:10.400
This basically tries to automate his job by inserting the appropriate instrumentation

03:10.400 --> 03:16.120
and timing calls into a program, trying to minimize the overhead, but trying to maximize

03:16.120 --> 03:23.160
the amount of information you get about the timing and runtime behavior of your program.

03:23.160 --> 03:29.080
The exact ways how Pyra works will be explained later by Peter, but Pyra is basically trying

03:29.080 --> 03:32.920
to figure out that these three functions that are marked in a darker gray are the ones

03:32.920 --> 03:37.240
that make up the kernels of your program, so the part of your program where most of the

03:37.240 --> 03:40.120
computation time is spent.

03:40.120 --> 03:47.160
And both of these tools are technically not dependent on CNC++.

03:47.160 --> 03:52.320
So every time we talk about our tools and what they are doing, usually in an HPC context,

03:52.320 --> 03:55.320
we're always getting asked, well, are you able to do Fortran?

03:55.320 --> 03:59.560
And usually we say, well, Fortran is a little bit different than C, and they're different

03:59.560 --> 04:01.760
front end, and no we can't.

04:01.760 --> 04:06.160
But let's give it a shot, right?

04:06.200 --> 04:10.200
One of the first things first, neither of us really does know Fortran very well.

04:10.200 --> 04:13.120
And luckily, we are not the only ones.

04:13.120 --> 04:17.320
This is an excerpt screenshot from the code that he found during his master thesis working

04:17.320 --> 04:23.000
at the European Space Agency, where they just copied some Fortran code from original source

04:23.000 --> 04:29.880
and treated it as a black box, which is basically summing up our combined Fortran experience.

04:29.880 --> 04:35.600
And a second thing, second, neither of us really worked with Fortran too, and MLIR, which

04:35.600 --> 04:40.840
make up a whole lot of the Fortran front end driver.

04:40.840 --> 04:45.440
But we know IR, and you can get flying you to emit IR.

04:45.440 --> 04:50.240
So we skip all the scary things that we don't know, and we lie on the things that we actually

04:50.240 --> 04:52.000
have some experience about.

04:52.000 --> 04:59.800
So the idea was, you take flying you, you ask it nicely to emit some IR, then you take the IR

04:59.800 --> 05:05.400
call graph generation tool that is obviously language-agnostic and will work out of the box.

05:05.400 --> 05:09.360
Then you have your analysis done, and then you can use our tools, namely Pyra, to do

05:09.360 --> 05:13.000
your analysis, and hotspot detection, and whatever, profit.

05:13.000 --> 05:16.640
There's no way this could ever go wrong.

05:16.640 --> 05:22.680
And when a colleague again came for us from Arhan and asked, well, we have this U-Lish KKR

05:22.680 --> 05:30.920
K loop code that might be interesting to you for your hotspot analysis, we gave it a shot,

05:30.960 --> 05:38.400
and we just loaded the available flying U driver that we had flying around and gave it a shot.

05:38.400 --> 05:40.360
And the program didn't even configure.

05:40.360 --> 05:46.280
We weren't even able to ask flying you to provide IR, because during the configure stamp,

05:46.280 --> 05:52.280
the program checked for the compiler identification, and flying U is too new for the build

05:52.280 --> 05:55.040
system to recognize the compiler identification.

05:55.040 --> 06:00.520
But not to worry, one afternoon, maybe two afternoon's later, we hacked flying U's compiler

06:00.560 --> 06:06.000
identification and the appropriate flanks inside the build system, and we could get it to configured.

06:06.000 --> 06:07.000
Right.

06:07.000 --> 06:11.760
So, after you configure, you compile, but our compile also failed.

06:11.760 --> 06:15.200
The reason for that was an unsupported intrinsic.

06:15.200 --> 06:22.200
And now this project started in January 2024, and we were relying on LVM-16, because again,

06:22.200 --> 06:27.440
that was what we had lying around, and we started looking into what is this intrinsic,

06:27.480 --> 06:33.040
do we need to implement it, and some helpful person in the LVM community pushed a fix for

06:33.040 --> 06:37.920
that exact intrinsic two weeks earlier to the marine branch.

06:37.920 --> 06:42.880
So, instead of taking what we had lying around, we compiled LVM from source or flying

06:42.880 --> 06:49.280
and everything that belongs to this whole tool changer, flying uses, and we got rid of our intrinsic

06:49.280 --> 06:52.160
error, and we got LVM IR.

06:52.160 --> 06:56.840
This was another few afternoon's to get things working, but now we have IR.

06:56.840 --> 07:03.560
We at least compile to IR, and we take our IR and move it to tools.

07:03.560 --> 07:10.040
Well you remember that we had to build the newest version of LVM, and Cage, as it was originally

07:10.040 --> 07:16.840
designed to generate call graphs for C, C++, made use of type pointers for some of its

07:16.840 --> 07:17.840
analysis.

07:17.840 --> 07:21.960
So, mainly VTable analysis to figure out, are we actually calling something that is a function

07:21.960 --> 07:27.240
type, or are we calling into a struct, which then is usually a VTable.

07:27.240 --> 07:32.520
But your new struct doesn't make use of VTables in the same way that C++ does, so we were able

07:32.520 --> 07:37.800
to just strip out most of our logic of our beloved call graph generation tool, which left

07:37.800 --> 07:42.120
us with a very bare bones call graph generation tool, with basically no other analysis than

07:42.120 --> 07:48.280
just providing you a bare call graph, but now you have call graphs for fatran codes.

07:48.280 --> 07:53.640
And now that my work with LVM IR, getting things to compile, and having tooling done,

07:53.640 --> 07:59.200
I was able to hand over to my colleagues, so he can do the actual pyro work and the instrumentation

07:59.200 --> 08:01.200
work.

08:01.200 --> 08:07.320
Great, all right, thank you.

08:07.320 --> 08:16.160
To give you some background first on what do you know how the measurements are working

08:16.160 --> 08:22.200
basically, is both pyro and also some of the other measurement tools rely upon Scorpie

08:22.200 --> 08:28.480
for the actual measurements, which is a profiling and tracing library coming mostly out of

08:28.480 --> 08:38.480
our usually interesting, I think Scorpie relies on instrumentation for its measurements.

08:38.480 --> 08:42.720
There are a couple of options how this instrumentation can work.

08:42.800 --> 08:48.720
You can use F instrument functions and related flags, which gives you some, but not complete

08:48.720 --> 08:53.680
controller, but what's actually being instrumented, there's also a GCC plugin that's being

08:53.680 --> 09:00.960
shipped with Scorpie that kind of hooks into the compiler and serves the measurement points.

09:00.960 --> 09:05.000
And it's not released yet, but we checked a couple of days ago, and it's by now released

09:05.080 --> 09:13.800
candidate, so Scorpie 9 will include a LVM instrumentation plugin finally, so it basically

09:13.800 --> 09:18.200
does the similar thing to what did you see plugin does, hooks into the compiling process

09:18.200 --> 09:22.200
and inserts these measurement probes.

09:22.200 --> 09:29.480
As we did, we did some debugging figured out Scorpie is not working with the other versions

09:29.480 --> 09:34.960
of Scorpie, but the fix has been pushed, so you built Scorpie from scratch,

09:35.040 --> 09:41.920
and then you try it, and then the compilation fails again, because the open MPI

09:42.480 --> 09:48.560
fought around 90 compiler effort needs to be aware of the kinds of flags that are supported

09:48.560 --> 09:52.800
by the actual compiler, and it meets flags that are not supported by fling you.

09:54.000 --> 09:56.560
So one more diversion, basically.

09:58.320 --> 10:03.440
Yeah, well, as I've said, so open MPI needs to be aware of the compiler fixed supported by the

10:03.520 --> 10:11.040
actual underline compiler, and it didn't do that until a few versions ago, so we tried to build

10:11.840 --> 10:17.840
open MPI for one four, which didn't work, but it happens to work that if you use

10:18.800 --> 10:25.920
basically one of the newest open MPI versions that this version you can build with fling for the

10:25.920 --> 10:33.680
four-trend parts, then it knows the right flags, and it actually manages to compile an open

10:33.680 --> 10:39.120
MPI fault-trend program. Well, we then had to recompile a Scorpie again from scratch, because it's

10:39.120 --> 10:44.960
kind of bound to the compiler version and open MPI version that you use, but after that, we're

10:44.960 --> 10:52.240
actually able to run to unparallel. To give you some background here, I talked about instrumentation,

10:52.240 --> 10:57.920
and I guess most of you are aware, but instrumentation is one of the two major techniques besides

10:57.920 --> 11:04.080
sampling to generate data about a performance run. Basically, it's based on inserting measurement

11:04.080 --> 11:09.280
points into the application. There are a couple of variants of instrumentation. You can do instrumentation

11:09.280 --> 11:13.680
at different points in the process. You could do it manually, you could do it in compiling stuff,

11:13.680 --> 11:18.960
but we're talking here about compiling instrumentation on a function level, so no loop instrumentation

11:18.960 --> 11:26.320
or anything. Instrumentation is able to produce very precise and reliable measurements, but kind

11:26.320 --> 11:32.720
of introduces the risk of very, very high overheads. So if you instrument every function in an

11:32.720 --> 11:39.360
program, you can get over as up to a thousand X or something or even worse, which can be the

11:39.920 --> 11:45.200
kind of ruins your results. So what you need to do is you need to find a balance between

11:45.200 --> 11:51.280
no instrumentation and full instrumentation with full instrumentation giving you a lot of data,

11:51.280 --> 11:57.520
but you can basically do 30 data right away, and no instrumentation with, well, there's no overhead,

11:57.520 --> 12:04.480
but no useful information either. What performance analysts usually do to kind of to find

12:04.480 --> 12:10.880
this balance is what we call the builder and analyze cycle. So they start with a rough initial

12:10.960 --> 12:18.320
overview measurement and then they iteratively stepwise work their way and deepen the instrumentation

12:18.320 --> 12:22.880
further until they get some instrumentation configuration that is useful to them that actually

12:22.880 --> 12:29.600
gives useful information about the application. And as Tim already kind of announced the idea of

12:29.600 --> 12:37.840
pirates to automate this exact refinement workflow. So pirates start by just looking at static

12:37.840 --> 12:42.640
information in the metadata g core graph and then it builds some initial instrumentation

12:42.640 --> 12:48.000
runs the program and then combines the dynamic and static information to iteratively refine the

12:49.200 --> 12:54.640
instrumentation until it reaches to some configuration that yields useful profile hopefully.

12:56.880 --> 13:03.040
So that's what you did. And after these several afternoons of debugging, we actually got it

13:03.840 --> 13:08.560
there was one one more thing. So Tim already talked about it. There was some problems configuring

13:08.560 --> 13:12.640
there was some problems building, well of course there was some problems linking as well.

13:15.280 --> 13:21.520
So they turns out that the Suley Kakaar code is kind of designed to work with the Intel MKL

13:21.520 --> 13:27.040
library and it did refuse to work with open blast or something. So we kind of hack the build system again

13:27.040 --> 13:33.520
to kind of link the MKL things into the fling you built which in the end worked and we got it

13:33.520 --> 13:39.280
working and actually pirates able to successfully instrument the binary and find the the relevant

13:39.280 --> 13:44.880
hotspot which in this case is some some kind of matrix inversion which calls into into MKL here.

13:47.680 --> 13:55.120
All right to some things up. So as it turns out you can use LVM IR based tools to do for

13:55.200 --> 14:01.200
trenching with fling you now. It's a little bit hacky sometimes. So from time to time you need to move

14:01.200 --> 14:06.480
to your potentially unreleased version still but that will sort itself out over time naturally.

14:08.160 --> 14:13.040
Then system is always a bit tricky so you need might need to spend some time hacking your

14:13.040 --> 14:19.520
range of it and by now you still might need to compile things from Suley as they're not released again.

14:20.000 --> 14:26.720
But it is it is possible and we we demonstrated that using using this ULIC mini app.

14:27.440 --> 14:33.280
So as Tim said right we're not photron people not even fling developers but basically from

14:33.280 --> 14:39.600
our perspective which is looking in from the outside. It really looks like things are coming together

14:39.600 --> 14:43.520
for fling and for fling tooling as well and I thought some people might

14:43.600 --> 14:51.760
appreciate that some things are moving forward. To conclude this talk just let me give you a quick

14:51.760 --> 15:00.480
kind of outlook into what's next for us. So as Tim said at the moment we mostly rely on LVM IR

15:00.480 --> 15:08.560
to do our tooling some things also on source code level. So we think about trying to kind of move

15:08.640 --> 15:15.920
that to forward run as well. So fling you does have a kind of source code level

15:17.600 --> 15:23.360
plug-in interface under development. It seems to us that it's pretty much in early stages of

15:23.360 --> 15:29.680
development at the moment and you're basically discouraged from using it as of now but as soon as

15:29.680 --> 15:35.840
that's to some mature state we're definitely going to take a look at it and we're always in the

15:35.840 --> 15:41.680
process of refining the heuristics behind pire on the other tools to kind of come up with better

15:42.800 --> 15:47.520
implementation configurations with left with less overhead. I'm actually working on a project called

15:47.520 --> 15:53.920
flip right now which is kind of tailored instrumentation to kind of generate just the information

15:53.920 --> 16:00.080
that you need to generate a good profile and we're also taking the looking at dynamic instrumentation

16:00.080 --> 16:05.200
which is kind of instrumentation that does not need rebuilds in between iterations so you can just

16:05.280 --> 16:11.120
turn off and turn on the instrumentation at one time which is something this LVM x-ray does

16:11.120 --> 16:16.960
and we in the process most of our code expression is doing that to do that in his copy tool.

16:18.080 --> 16:22.640
All right thank you very much for your attention and we're looking for any any questions. Thank you.

16:35.280 --> 16:51.680
All right so the question is what's the the killer application for instrumentation based

16:53.040 --> 17:00.160
profiting against something based profiting well it's kind of always so you do have both options

17:00.160 --> 17:06.000
and there's some upsides and downsides for both so the the main thing is that kind of gives

17:06.000 --> 17:11.200
instrumentation the edge over sampling in some use cases is that it's reliable you won't miss anything

17:12.000 --> 17:17.600
so with sampling you if you depending on your sampling interval of course you can't be sure that you're

17:17.600 --> 17:24.800
not missing anything any smaller functions which is something you especially if you're

17:25.120 --> 17:29.440
tracing right and then you also want to see smaller functions and that's something you can do with

17:29.440 --> 17:34.560
instrumentation you kind of you you start using an instrumentation to do profile and as soon as you're

17:34.560 --> 17:39.520
you're you like the instrumentation configuration you use it to trace this but the part is right

17:39.520 --> 17:47.760
sampling does similar things with a different approach you have any plans to look at

17:47.760 --> 18:02.160
really if you want to suggest you see if you do K is considered a for trial so the question was

18:02.160 --> 18:08.320
if we try to look into things that are not proxy or mini apps but CP2K was the exact application

18:08.320 --> 18:16.480
dimension which is known to be a for triangular so we are definitely interested in getting things to

18:16.480 --> 18:21.280
work for more complex codes we looked into mixed codes like cleverly which you see and for

18:21.280 --> 18:27.920
trying combine to do things which also had interesting first results but short answer short is

18:28.640 --> 18:34.000
we are still scared of large and big for foreign codes because if things go wrong neither of us

18:34.000 --> 18:41.200
really will know what is going wrong so while it would be of course interesting there are a lot of

18:41.200 --> 18:48.800
things inside our pipeline that would then have a realistic chance of failing and the ideas to

18:48.800 --> 18:54.640
work are way slowly up to more and more complex codes and not jump from this is a small proxy application

18:54.640 --> 18:59.280
mini app which we have a domain expert on which we can ask nicely to help us figure out what's going

18:59.280 --> 19:03.920
on to this is a known compiler killer are you interested in having your application work on that

19:03.920 --> 19:18.000
yes we are but we'll give it a little time so the question was what about other tools like

19:18.000 --> 19:26.000
flank tidy compared to like clank tidy if I get this correct you're asking about whether we

19:26.000 --> 19:32.000
are looking into helping developers write better better fortune code with like tools like

19:32.720 --> 19:38.240
flank tidy in my experience which is very limited when it comes to fortune. Fortran tooling support

19:38.240 --> 19:48.240
is very bare bones so while there is a lot of things to do writing a Fortran base tool that

19:48.240 --> 19:53.920
helps you write syntax and tries to find obvious errors would again require us to actually know

19:53.920 --> 20:00.480
what's going on in Fortran so sadly again the answer is I really see the use case for a tool

20:00.480 --> 20:07.120
like flank tidy and I hope that someone with more Fortran experience and knowledge than I will

20:07.120 --> 20:12.720
have a go at it because you can start doing it there are compiler plugins available for the

20:12.720 --> 20:18.160
flank compiler now again the IPI is under heavy development and it says do not use for production

20:18.160 --> 20:23.280
use case on the documentation but you can start now and you might be ready once the IPI is

20:23.600 --> 20:30.400
able to just release flank tidy nice use case currently not an hour old map sorry

20:33.280 --> 20:35.280
thank you very much

20:53.280 --> 20:55.280
you

