WEBVTT

00:00.000 --> 00:15.000
Okay, so our next speaker is Johan Haaland and he's going to talk about faulty depths, undeclared and unused dependencies.

00:15.000 --> 00:17.000
Give him a warm welcome.

00:17.000 --> 00:28.000
Thank you very much. Hello, my name is Johan Haaland, and you can find me as Jay Herland on GitHub,

00:28.000 --> 00:30.000
on Masad on LinkedIn, wherever.

00:30.000 --> 00:35.000
I'm an origin, but I currently live in Delphs in the Netherlands, and I work as a developer

00:35.000 --> 00:41.000
productivity engineer at Twink, which is a software consultancy within the larger mod is create platform.

00:41.000 --> 00:46.000
At the Twink, we help our clients solve our problems with open source software, and we contribute back to the open source community

00:46.000 --> 00:53.000
whenever possible. Today, I will talk to you about faulty depths. It's a dependency checker for your Python code.

00:53.000 --> 01:00.000
But I want to start with some motivation for why this is such a big problem in the first place, and why we decided to build this tool.

01:00.000 --> 01:09.000
So the following looks at the problem from one particular angle, but there are many different avenues that lead to why a Python dependency checker might be a good idea.

01:09.000 --> 01:14.000
So one of the major problems in science and academia today is known as the replication crisis.

01:14.000 --> 01:20.000
A lot of scientific results are being published, but then other scientists struggle to reproduce the same results.

01:20.000 --> 01:27.000
And this is actually core to the concept of science itself, because if a result cannot be reproduced, then is it really science at all.

01:27.000 --> 01:34.000
The field of data science has a particular responsibility here, because not only has it grown into a huge discipline on its own right,

01:34.000 --> 01:37.000
but it's also become an important tool in many other scientific disciplines.

01:38.000 --> 01:43.000
And Python, being such an important tool in data science, is also affected by this.

01:43.000 --> 01:49.000
I want to briefly highlight two papers that have looked at reproducibility and Python specifically.

01:49.000 --> 01:56.000
First, we have this paper from August 23 that looked at reproducibility of Jupiter notebooks in the life sciences.

01:56.000 --> 02:02.000
These researchers collected a large amount of notebooks via papers available on PubMed Central.

02:02.000 --> 02:07.000
And then they automated a process to copy these notebooks in attempt to reproduce the results.

02:07.000 --> 02:09.000
I'm going to jump straight to the numbers they found.

02:09.000 --> 02:12.000
A spoiler alert, they're not very good.

02:12.000 --> 02:17.000
Here is from their abstract, out of more than 22,000 Python notebooks.

02:17.000 --> 02:22.000
Almost 16,000 or 70% declared any dependencies at all.

02:22.000 --> 02:26.000
Around 10,000 or 46% had dependencies that could be installed.

02:26.000 --> 02:31.000
Around 1200 or 5.3% could be run without errors.

02:31.000 --> 02:37.000
Only 879 or 3.9% could actually reproduce the results in the end.

02:37.000 --> 02:41.000
So let's take a closer look at the notebooks that were run, but then failed with an error.

02:41.000 --> 02:46.000
This corresponds to the yellow bar, but subtracting those that actually run without errors.

02:46.000 --> 02:51.000
The most commonly raised error by far is modular found error.

02:51.000 --> 02:55.000
And in position number three, we find the more general import error.

02:55.000 --> 02:58.000
Both of these are closely related to missing dependencies.

02:58.000 --> 03:05.000
And together, these two errors were raised by 29% of the total number of Python notebooks.

03:05.000 --> 03:10.000
In other words, if we overlay these numbers and focus on the failures at each stage,

03:10.000 --> 03:14.000
we find that 30% of notebooks fail to declare any dependencies at all.

03:14.000 --> 03:19.000
For another 24% of them, the declared dependencies fail to install.

03:19.000 --> 03:23.000
And even then, for another 29% there are errors while trying to import dependencies.

03:23.000 --> 03:29.000
So when we look at everything together, it looks like more than 80% of notebooks failed to reproduce

03:29.000 --> 03:32.000
because of missing or incorrect dependency declarations.

03:32.000 --> 03:35.000
Let's quickly move on to the second study.

03:35.000 --> 03:39.000
This paper came out a couple of months after the first, and it looks at the PI ecosystem,

03:39.000 --> 03:42.000
which is a very central part of the overall Python community.

03:42.000 --> 03:48.000
These authors start from the 10,000 most popular libraries on PI PI,

03:48.000 --> 03:52.000
narrowing it down to 8,282 libraries that have more than two releases.

03:52.000 --> 03:58.000
They then attempt to install each and every release of these libraries to find issues in the project configuration.

03:58.000 --> 04:01.000
These issues could be related to dependency declarations,

04:01.000 --> 04:06.000
problems that become apparent when you try to install and use the library from PI PI.

04:06.000 --> 04:09.000
I'm going to shortcut to the most relevant numbers here.

04:09.000 --> 04:15.000
They find that only 65% of PI PI libraries can be successfully installed and imported into your code.

04:15.000 --> 04:19.000
Around 5% fail to install for various other reasons.

04:19.000 --> 04:25.000
So around 30% can be installed, but may encounter errors as soon as you import them.

04:25.000 --> 04:30.000
Depending on exactly which parts you import and what your runtime environment looks like.

04:30.000 --> 04:33.000
Digging a bit into how they categorize the various failures.

04:33.000 --> 04:39.000
We can see that the vast majority of failures are again due to missing or incorrect dependency declarations.

04:39.000 --> 04:46.000
So both these papers show that missing or incorrect dependency declarations is a significant and important problem in the Python world.

04:46.000 --> 04:51.000
But what do I really mean when I talk about missing or incorrect dependency declarations?

04:51.000 --> 04:56.000
So let's quickly recap the basics of dependencies and dependency declarations in Python.

04:56.000 --> 05:00.000
This should be familiar to anybody who has worked with Python for a while.

05:00.000 --> 05:04.000
I import some third-party library like NumPy into my Python code.

05:04.000 --> 05:09.000
When Python runs this code, it will look into my runtime environment for the NumPy library.

05:09.000 --> 05:11.000
This is the dependency.

05:11.000 --> 05:14.000
In order for my code to work, NumPy has to be installed.

05:14.000 --> 05:16.000
But let's assume that I haven't installed NumPy.

05:16.000 --> 05:18.000
So this is no longer a problem, right?

05:18.000 --> 05:23.000
Well, what happens when you copy my code onto your machine and try to run it?

05:23.000 --> 05:25.000
It fails because you haven't installed NumPy.

05:25.000 --> 05:27.000
This is the works of my machine problem.

05:27.000 --> 05:31.000
The code we run is the same, but the runtime environment is different.

05:31.000 --> 05:36.000
And because of those differences, it works on my machine, but not on yours.

05:36.000 --> 05:41.000
The real problem here is that I fail to declare my dependency on NumPy.

05:41.000 --> 05:45.000
Now in Python, there are many different ways to declare dependencies.

05:45.000 --> 05:50.000
I could write the Pyprojectomol or set up the Py to configure my project.

05:50.000 --> 05:52.000
Setup.cfg is another popular alternative.

05:52.000 --> 05:56.000
And we can even discuss whether the dependencies should be declared loosely.

05:56.000 --> 05:58.000
Like here, I'm just depending on NumPy.

05:58.000 --> 06:01.000
Orpinto-specific version like that.

06:01.000 --> 06:04.000
However, this talk is not about those details.

06:04.000 --> 06:08.000
So instead, I'll just go for the simplest possible declaration here.

06:08.000 --> 06:14.000
This requirements, the Xt file, is the bare minimum that any tool would need to be able to discover our dependency

06:14.000 --> 06:16.000
and help our users to resolve it.

06:16.000 --> 06:21.000
By writing NumPy into this file, we declare that we expect to have it available when we run this project

06:21.000 --> 06:24.000
and it is a declaration of our NumPy dependency.

06:24.000 --> 06:27.000
So let's import another module.

06:27.000 --> 06:29.000
Pandas into our code.

06:29.000 --> 06:32.000
This is a new dependency, but we haven't yet declared it.

06:32.000 --> 06:35.000
So, unsurprisingly, we call this an undeclared dependency.

06:35.000 --> 06:37.000
And we can have the opposite problem as well.

06:37.000 --> 06:41.000
So let's say that we experimented with TensorFlow at some point in our code.

06:41.000 --> 06:44.000
And we did the right thing by declaring it in our requirements, the Xt.

06:44.000 --> 06:47.000
But then later, we ended up not using it after all,

06:47.000 --> 06:50.000
and we forgot to remove it from our dependency declaration.

06:50.000 --> 06:53.000
This is now an unused dependency.

06:53.000 --> 06:57.000
Unused dependencies do not directly affect the reproducibility of your project,

06:57.000 --> 07:01.000
and therefore not as critical as undeclared dependencies.

07:01.000 --> 07:03.000
But we still want to avoid them.

07:03.000 --> 07:05.000
TensorFlow, for example, is several hundred megabytes,

07:05.000 --> 07:09.000
and this can end up wasting a lot of space and time for our users.

07:09.000 --> 07:10.000
So to summarize.

07:10.000 --> 07:13.000
On the left-hand side, we have our real dependencies,

07:13.000 --> 07:16.000
the set of packages that we actually depend on in our Python code.

07:16.000 --> 07:18.000
And on the right, we have our declared dependencies,

07:18.000 --> 07:21.000
the set of packages that we have listed as required,

07:21.000 --> 07:25.000
either in our project configuration or in something like requirements Xt.

07:25.000 --> 07:28.000
We can make a venn diagram out of these,

07:28.000 --> 07:30.000
our undeclared dependencies end up on the left,

07:30.000 --> 07:32.000
these make it hard or impossible to use,

07:32.000 --> 07:34.000
and we produce our project,

07:34.000 --> 07:37.000
and conversely, our unused dependencies end up on the right.

07:37.000 --> 07:39.000
These add unnecessary blood to our project.

07:39.000 --> 07:42.000
Obviously, we want for all our dependencies to end up in the center.

07:42.000 --> 07:47.000
We want to properly declare the dependencies we actually use and only those.

07:47.000 --> 07:49.000
As I said, at the start,

07:49.000 --> 07:52.000
at the twig, we help clients across many industries to solve our problems,

07:52.000 --> 07:55.000
and we often encounter projects with or without Python,

07:55.000 --> 07:59.000
that struggle with reproducibility in one way or another.

07:59.000 --> 08:01.000
A couple of years ago, a colleague of mine,

08:01.000 --> 08:04.000
Nora, who then worked in the date engineering group at Twig,

08:04.000 --> 08:07.000
found herself repeatedly in conferring projects that suffered

08:07.000 --> 08:08.000
from undercare dependencies.

08:08.000 --> 08:11.000
And she wanted to automate the kind of dependency analysis

08:11.000 --> 08:13.000
that we just did in the previous slides.

08:13.000 --> 08:17.000
Maria, also then in our date engineering group,

08:17.000 --> 08:19.000
join her and a little later I also joined.

08:19.000 --> 08:22.000
I'm not part of the date engineering group.

08:22.000 --> 08:24.000
I'm part of the scalable build group at Twig.

08:24.000 --> 08:27.000
So my usual day-to-day consists more of fixing build systems,

08:27.000 --> 08:30.000
and less about date designs and date engineering.

08:30.000 --> 08:35.000
But even if I usually don't work on the same kinds of projects and clients as Nora and Maria,

08:35.000 --> 08:39.000
I still often see the same kinds of issues with respect to lack of reproducibility

08:39.000 --> 08:41.000
and the works of my machine problem.

08:41.000 --> 08:44.000
Along the way, we've had some more tweak colleagues,

08:44.000 --> 08:45.000
jump in and out of the project,

08:45.000 --> 08:48.000
and today we have Jihan and Richard with us as well.

08:48.000 --> 08:50.000
Together, we have created faulted apps,

08:50.000 --> 08:53.000
a tool for finding undercare and unused dependencies in Python projects.

08:53.000 --> 08:57.000
Faulted apps is mostly a passion project that we work on

08:57.000 --> 08:59.000
in between client engagements.

08:59.000 --> 09:01.000
And yes, the name faulted apps is based on the

09:01.000 --> 09:05.000
Montipython adjacent sitcom faulted towers.

09:05.000 --> 09:11.000
The point of faulted apps is to do exactly the kind of analysis that we have just walked through.

09:11.000 --> 09:13.000
When you run faulted apps on your project,

09:13.000 --> 09:16.000
Faulted apps will find the set of packages that you import in your code

09:16.000 --> 09:19.000
and compare that to your declared dependencies.

09:19.000 --> 09:23.000
It then produces a report showing you any undercare or unused dependencies.

09:23.000 --> 09:26.000
So you can consider faulted apps as a linker,

09:26.000 --> 09:28.000
not a regular linker for your code,

09:28.000 --> 09:31.000
but rather a linker for your dependencies.

09:31.000 --> 09:36.000
Here, we see faulted apps being run without any options in our little toy project.

09:36.000 --> 09:39.000
The output first lists the undercare dependencies

09:39.000 --> 09:41.000
that faulted app found in this case, pandas.

09:41.000 --> 09:46.000
And below, we see the unused dependencies in this case TensorFlow.

09:46.000 --> 09:48.000
And if you run faulted apps in a large project,

09:48.000 --> 09:51.000
you can ask for a more detailed report as well.

09:51.000 --> 09:54.000
And here, we get the file and line numbers

09:54.000 --> 09:58.000
where each undercare and unused dependencies is located.

09:58.000 --> 10:01.000
So where can you use faulted apps?

10:01.000 --> 10:04.000
We support Python code, whether it's in a regular

10:04.000 --> 10:06.000
the file script or inside a Jupyter notebook.

10:06.000 --> 10:09.000
We support any version of Python that is not ancient.

10:09.000 --> 10:12.000
We run on post-exystems like Linux and Mac OS

10:12.000 --> 10:15.000
and we also add a support for Windows.

10:15.000 --> 10:17.000
For dependency declarations,

10:17.000 --> 10:21.000
we support all the most common formats used in Python ecosystem today.

10:21.000 --> 10:25.000
Requirements TXT is probably the most commonly used format.

10:25.000 --> 10:27.000
If you're starting a new Python project today,

10:27.000 --> 10:29.000
though, you should obviously use Python and Tomo.

10:29.000 --> 10:32.000
And preferably in the way that is recently

10:32.000 --> 10:34.000
standardized in PEP61 and related PEPs.

10:34.000 --> 10:37.000
That said, we also support the custom fields

10:37.000 --> 10:39.000
that poetry uses in Python.

10:39.000 --> 10:42.000
A lot of Python packages use a setup.py file

10:42.000 --> 10:43.000
to the care dependencies.

10:43.000 --> 10:45.000
And supporting this is more complicated

10:45.000 --> 10:48.000
because a setup topy can really contain

10:48.000 --> 10:50.000
arbitrarily complex Python code.

10:50.000 --> 10:52.000
And we don't want to execute this.

10:52.000 --> 10:55.000
That said, for simple setup.py files,

10:55.000 --> 10:58.000
we are currently able to parse out the declarative dependencies.

10:58.000 --> 11:01.000
We also recently started supporting content projects

11:01.000 --> 11:03.000
and also the Pixie Package Manager,

11:03.000 --> 11:05.000
which tries to bridge the gap between

11:05.000 --> 11:08.000
the user and the user and the user.

11:08.000 --> 11:12.000
So, we can get the same way that you typically consume

11:12.000 --> 11:14.000
your Python dependencies.

11:14.000 --> 11:16.000
A PEP install will work, but of course,

11:16.000 --> 11:19.000
a careless PEP install is often the first step towards adding

11:19.000 --> 11:21.000
an undercare dependency in the first case.

11:21.000 --> 11:23.000
So, you should rather think about whether you want to add

11:23.000 --> 11:25.000
it to your project as a development dependency

11:25.000 --> 11:28.000
or as a standalone tool that is generally available

11:28.000 --> 11:30.000
in your user environment.

11:30.000 --> 11:34.000
If you use a modern tool like UV or PEPX or Po3 or PDM

11:34.000 --> 11:37.000
or anything similar, there is typically a command available.

11:37.000 --> 11:39.000
That will do the right thing.

11:39.000 --> 11:42.000
We also specifically support running faulted apps as a pre-commit

11:42.000 --> 11:44.000
hook or as a GitHub action in your CI pipeline.

11:44.000 --> 11:47.000
More about this soon.

11:47.000 --> 11:51.000
I want to emphasize the configurability of faulted apps.

11:51.000 --> 11:53.000
You can configure everything on the command line,

11:53.000 --> 11:55.000
but also via the environment or the tool.faulted

11:55.000 --> 11:57.000
app section in PEP.

11:57.000 --> 12:00.000
And finally, we have also a command line option

12:00.000 --> 12:03.000
to quickly create this tool.faulted apps configuration

12:03.000 --> 12:07.000
based on the current command line options.

12:07.000 --> 12:10.000
Let's look at some more output options.

12:10.000 --> 12:14.000
We want to make faulted apps as transparent and helpful as possible

12:14.000 --> 12:17.000
so that you can understand how and why something ends up

12:17.000 --> 12:19.000
in the final report.

12:19.000 --> 12:21.000
You can run faulted apps with list sources

12:21.000 --> 12:24.000
to see which files faulted apps reads while looking

12:24.000 --> 12:26.000
for your actual and declared dependencies.

12:26.000 --> 12:30.000
This is often the first step when debugging faulted apps.

12:30.000 --> 12:33.000
You're risking asking the question, is it finding the actual code

12:33.000 --> 12:36.000
and the dependencies that I have declared?

12:36.000 --> 12:39.000
Next, you can run it with list imports to see which

12:39.000 --> 12:42.000
imports is finds in your code.

12:42.000 --> 12:45.000
And you can run it with list apps to see which dependencies

12:45.000 --> 12:47.000
it finds to have been declared.

12:47.000 --> 12:50.000
Finally, if you want to write some script to interact with the analysis

12:50.000 --> 12:53.000
that faulted apps does, we have a JSON flag,

12:53.000 --> 12:57.000
and you can get machine readable output with all the glory details.

12:57.000 --> 13:01.000
So let's move on to deploying faulted apps into your project workflow.

13:01.000 --> 13:04.000
Running faulted apps on your project once in a while is obviously a good idea,

13:04.000 --> 13:07.000
but there is extra value if you can automatically run it

13:07.000 --> 13:08.000
whenever you make changes.

13:08.000 --> 13:11.000
The pre-commit tool is one way to do this,

13:11.000 --> 13:14.000
and I'll show you how to set up faulted apps to run on every committee

13:14.000 --> 13:15.000
in your repository.

13:15.000 --> 13:17.000
This is mostly useful in the beginning of your project,

13:17.000 --> 13:19.000
while it's still relatively small and dependencies

13:19.000 --> 13:21.000
come and go all the time.

13:21.000 --> 13:23.000
If we are not already using the pre-commit tool,

13:23.000 --> 13:25.000
we need to install that first.

13:26.000 --> 13:28.000
The pre-commit homepage lists a few different ways.

13:28.000 --> 13:31.000
Here I'll just do a simple paper and so on.

13:31.000 --> 13:35.000
Next we need to tell pre-commit how to run faulted apps.

13:35.000 --> 13:39.000
So we create this pre-commit config Yamufile,

13:39.000 --> 13:41.000
and we have these lines to it.

13:41.000 --> 13:43.000
These lines tell pre-commit where to find faulted apps,

13:43.000 --> 13:46.000
which version to use, and which parts of faulted apps to run.

13:46.000 --> 13:49.000
We want to find both undercred and unused dependencies

13:49.000 --> 13:52.000
in the code that we're trying to commit, so we put that there.

13:52.000 --> 13:55.000
And next, we activate the pre-commit hook

13:55.000 --> 13:57.000
by running the pre-commit install.

13:57.000 --> 14:01.000
Pre-commit is now armed and ready to run on every commit that we do.

14:01.000 --> 14:03.000
So let's test it out.

14:03.000 --> 14:05.000
So we add an import statement to our code,

14:05.000 --> 14:07.000
and when we try to commit this,

14:07.000 --> 14:09.000
pre-commit will fail and prevent those from committing

14:09.000 --> 14:11.000
and undercred dependency.

14:11.000 --> 14:13.000
You can see that we also get the faulted apps up with,

14:13.000 --> 14:16.000
pointing to which dependency and where it was imported.

14:16.000 --> 14:20.000
So let's fix this by declaring the pandas dependency properly.

14:20.000 --> 14:22.000
So we added to our requirements to XT,

14:22.000 --> 14:24.000
and then we retry the commit,

14:24.000 --> 14:26.000
and now pre-commit succeeds and our commit is done.

14:26.000 --> 14:30.000
The same interaction happens when we try to add an unused dependency.

14:30.000 --> 14:33.000
So let's add a TensorFlow to the requirements to XT,

14:33.000 --> 14:36.000
but without actually using it anywhere in our code.

14:36.000 --> 14:38.000
And now when we try to commit,

14:38.000 --> 14:41.000
pre-commit fails again and faulted apps tells us what's up.

14:41.000 --> 14:45.000
Another way to make sure faulted apps run regularly

14:45.000 --> 14:48.000
and quickly catches problems with your dependency is to run it

14:48.000 --> 14:51.000
and check in your CI system or continuous integration system.

14:51.000 --> 14:54.000
With CI checks, you typically have a little more patience

14:54.000 --> 14:55.000
than in a pre-commit hook.

14:55.000 --> 14:57.000
So this is usually the better option in larger projects

14:57.000 --> 15:00.000
where analyzing the code can take a little longer.

15:00.000 --> 15:03.000
Here I show a GitHub action setup for faulted apps.

15:03.000 --> 15:05.000
If you use a different CI system,

15:05.000 --> 15:07.000
the setup should typically follow similar steps,

15:07.000 --> 15:10.000
although this syntax will probably be different.

15:10.000 --> 15:14.000
I'm also using the faulted apps action helper project here,

15:14.000 --> 15:17.000
which we have created to make it very easy to run faulted apps in your CI.

15:17.000 --> 15:19.000
But whether you use that or not,

15:19.000 --> 15:21.000
the end result should look pretty familiar to anyone

15:21.000 --> 15:24.000
who has already set up GitHub actions for a Python project.

15:24.000 --> 15:27.000
First, we set up a GitHub action by writing a file

15:27.000 --> 15:29.000
like this into our repository.

15:29.000 --> 15:32.000
We spend a couple of lines declaring what platforms

15:32.000 --> 15:34.000
we're running on and checking out our Python project

15:34.000 --> 15:37.000
and then comes the faulted apps part.

15:37.000 --> 15:40.000
We refer to the faulted apps action helper with a pin version

15:40.000 --> 15:42.000
and then we also pass an extra option to indicate

15:42.000 --> 15:46.000
that we want to run faulted apps with the detailed option.

15:46.000 --> 15:49.000
If you have configured faulted apps in your Python repository's

15:49.000 --> 15:51.000
title file, that configuration will of course

15:51.000 --> 15:55.000
also be picked up and taken into account by faulted apps.

15:55.000 --> 15:57.000
Once this is committed to your project,

15:57.000 --> 16:00.000
GitHub will automatically run faulted apps on new commits

16:00.000 --> 16:01.000
and pull requests.

16:01.000 --> 16:03.000
So let's see what this looks like when we create

16:03.000 --> 16:06.000
the pull request that adds an undeclared dependency.

16:06.000 --> 16:10.000
Here we have a pull request that adds our old friend pandas

16:10.000 --> 16:12.000
as an undeclared dependency.

16:12.000 --> 16:14.000
Once the GitHub action has run,

16:14.000 --> 16:16.000
we get an indication on the pull request or review

16:16.000 --> 16:18.000
that the faulted apps check is failing.

16:18.000 --> 16:20.000
And if we go to the details of the failed check

16:20.000 --> 16:23.000
and then through summary, we can see the output of

16:23.000 --> 16:27.000
faulted apps pointing us to the problematic pandas import.

16:27.000 --> 16:31.000
The Toy example we use so far has helped show the basic concepts.

16:31.000 --> 16:33.000
But things are always more complicated in practice.

16:33.000 --> 16:35.000
First we will look at a part of the dependency

16:35.000 --> 16:37.000
analysis that we've glossed over so far

16:37.000 --> 16:41.000
but that is really crucial in order to produce correct results.

16:41.000 --> 16:44.000
And this is one of the cases where we believe that faulted apps

16:44.000 --> 16:47.000
goes the extra mile, especially compared to doing dependency

16:47.000 --> 16:49.000
analysis by hand.

16:49.000 --> 16:52.000
So let's draw the rest of the demo.

16:52.000 --> 16:54.000
And I'm going to talk a little bit about

16:54.000 --> 16:56.000
machine import names and package names in Python.

16:56.000 --> 16:59.000
It's more tricky than you would think.

16:59.000 --> 17:02.000
So far we have imported NumPy in our Python code

17:02.000 --> 17:05.000
and we've declared the NumPy package as our dependency.

17:05.000 --> 17:08.000
NumPy happens to use the same name for the package and the import.

17:08.000 --> 17:11.000
But this is not actually a strict requirement in Python.

17:11.000 --> 17:13.000
The package name you declare as a dependency

17:13.000 --> 17:17.000
does not always have to match the name that you need to import in your code.

17:17.000 --> 17:21.000
The package cycle to learn is a popular example of this.

17:21.000 --> 17:23.000
The package name that you people install or the

17:23.000 --> 17:25.000
clear as a dependency is the cycle to learn.

17:25.000 --> 17:28.000
But the name that you actually import in your code is called

17:28.000 --> 17:30.000
Escaleran.

17:30.000 --> 17:33.000
There are also cases where single package provides multiple

17:33.000 --> 17:34.000
import names.

17:34.000 --> 17:37.000
So the set of tools package for example has long provided both

17:37.000 --> 17:41.000
the set up tools import name as well as the PKG resources import name.

17:41.000 --> 17:45.000
In order to import either of those modules you need to have set up tools

17:45.000 --> 17:46.000
installed.

17:46.000 --> 17:50.000
There are even more complex cases as well which I won't go into here.

17:50.000 --> 17:53.000
But it should already be clear enough that we need some kind of mapping

17:53.000 --> 17:55.000
between import names and package names.

17:55.000 --> 17:58.000
And without a good mapping, we won't be able to correctly

17:58.000 --> 18:01.000
match import names to declare dependencies and will end up

18:01.000 --> 18:05.000
reporting on the clear and unused dependencies incorrectly.

18:06.000 --> 18:09.000
Fortunately, we can always build a good mapping

18:09.000 --> 18:12.000
when we have access to your project's Python environment.

18:12.000 --> 18:17.000
Now, we don't really know how or where you install your project dependencies.

18:17.000 --> 18:20.000
If it could be a virtual M somewhere or you could depend on a

18:20.000 --> 18:23.000
system-wide installation or something else.

18:23.000 --> 18:26.000
However, let's assume that your project dependencies have been installed

18:26.000 --> 18:28.000
somewhere.

18:28.000 --> 18:32.000
And when you run your project, let's see here.

18:32.000 --> 18:36.000
There we are.

18:36.000 --> 18:40.000
Yeah.

18:40.000 --> 18:44.000
And when you run your project, Python must be able to find those dependencies

18:44.000 --> 18:46.000
otherwise your project wouldn't work at all.

18:46.000 --> 18:50.000
So as long as faulted apps can find the same environment where the dependencies

18:50.000 --> 18:54.000
are installed, then it can build a correct mapping.

18:54.000 --> 18:57.000
So naturally, we put a lot of effort into finding the right Python

18:57.000 --> 19:00.000
environment so that the analysis will be as correct as possible.

19:01.000 --> 19:05.000
First, we look for environments that are associated directly with the project.

19:05.000 --> 19:09.000
Without any configuration, we will search on the current directory for any

19:09.000 --> 19:13.000
environments located inside your project directory to put the virtual M.

19:13.000 --> 19:18.000
And as long as we can find those, everything is fine.

19:18.000 --> 19:22.000
However, you can also pass one or more paths directly to faulted apps.

19:22.000 --> 19:25.000
This will limit where we look for code dependency declarations and also for

19:25.000 --> 19:27.000
Python environments.

19:28.000 --> 19:32.000
This is often useful in monorepo settings where your repository contains a lot more

19:32.000 --> 19:35.000
a lot more or less independent Python projects.

19:35.000 --> 19:39.000
So you can then point to the Python project that you want to analyze.

19:39.000 --> 19:44.000
You can also use the PIN option to directly point faulted apps at your project

19:44.000 --> 19:45.000
environment.

19:45.000 --> 19:50.000
In addition to looking inside the project, we will also always consider the Python

19:50.000 --> 19:53.000
environment in which faulted apps happens to be installed.

19:53.000 --> 19:56.000
Thus, if you install faulted apps into the same environment where your project

19:56.000 --> 19:58.000
is installed, you will always get good results.

19:58.000 --> 20:04.000
Together, these two approaches give good results as long as the dependencies are

20:04.000 --> 20:05.000
installed.

20:05.000 --> 20:08.000
But there might be situations where the dependencies are not installed.

20:08.000 --> 20:10.000
So how do we deal with those?

20:10.000 --> 20:14.000
First, for really advanced users, we have a way to provide a custom mapping.

20:14.000 --> 20:18.000
Where you can tell faulted apps directly have to package name, how package name

20:18.000 --> 20:19.000
and import names correspond.

20:19.000 --> 20:23.000
But it's usually not something that you would want to configure.

20:23.000 --> 20:26.000
But when it's given, it takes precedence.

20:26.000 --> 20:30.000
Otherwise, we have a really cool option called install apps.

20:30.000 --> 20:33.000
This sets up a temporary virtual end and install missing dependencies from

20:33.000 --> 20:36.000
Python, just so that faulted apps can get the correct mapping.

20:36.000 --> 20:42.000
And then this is obviously quite expensive, so it's not enabled by default.

20:42.000 --> 20:47.000
Our final resort, when all of the things fail, is something that we call the

20:47.000 --> 20:48.000
identity mapping.

20:48.000 --> 20:52.000
And this is a simple assumption that package names and import names actually correspond.

20:52.000 --> 20:58.000
So if we have no other solution, then that is sort of the final resort that we

20:58.000 --> 21:00.000
go to.

21:00.000 --> 21:05.000
And it works out correctly in many cases, but it's also sometimes incorrect.

21:05.000 --> 21:08.000
And that's why we only use it as the final resort.

21:08.000 --> 21:12.000
So let's quickly go over this again with an example like scikit-learn.

21:12.000 --> 21:16.000
So remember, when we declare a dependency on scikit-learn in requirements 60,

21:16.000 --> 21:19.000
what we actually import in Python is called escalern.

21:20.000 --> 21:23.000
Faulted apps needs to find that these two names are related.

21:23.000 --> 21:27.000
Otherwise, it would report escalern as on the cleared and scikit-learn as unused.

21:27.000 --> 21:30.000
So here's how we figured this out today.

21:30.000 --> 21:34.000
We take the dependency name, scikit-learn, and we pass it to the first stage in our

21:34.000 --> 21:38.000
funnel, the custom mapping, which takes precedence over anything else.

21:38.000 --> 21:40.000
But as I said, it's usually not provided.

21:40.000 --> 21:44.000
If you're basically a power user and you want very tight control over this.

21:44.000 --> 21:49.000
So next, we look at the Python environments associated with the project.

21:49.000 --> 21:52.000
If we can find an environment where scikit-learn is installed,

21:52.000 --> 21:55.000
then we'll see that escalern is indeed provided by this package,

21:55.000 --> 21:56.000
and we have a match.

21:56.000 --> 21:58.000
Otherwise, we move on.

21:58.000 --> 22:01.000
And look at the environment where Faulted apps is installed,

22:01.000 --> 22:03.000
same procedure here, if we find scikit-learn, we're done.

22:03.000 --> 22:05.000
Otherwise, we're starting to run out of options.

22:05.000 --> 22:08.000
So now, we're checking here.

22:08.000 --> 22:11.000
Did we use the installed apps option in that case,

22:11.000 --> 22:15.000
we can auto install from PIPI, and find the correspondence between scikit-learn and

22:15.000 --> 22:17.000
escalern that way.

22:17.000 --> 22:20.000
And then the very final resort, if installed apps is not used,

22:20.000 --> 22:22.000
we need to use the identity mapping.

22:22.000 --> 22:24.000
And in this case, that actually gives us a wrong answer.

22:24.000 --> 22:28.000
Because through the identity mapping, we would then assume that if you install scikit-learn,

22:28.000 --> 22:33.000
then it would provide an import called scikit-learn, which is incorrect in this option.

22:33.000 --> 22:38.000
So that's why we never want to get to that point.

22:38.000 --> 22:43.000
But as you can see, we tried very hard to find the correct mapping before it comes to that.

22:43.000 --> 22:46.000
So yeah, I'll quickly go over some more complex use cases.

22:46.000 --> 22:49.000
We've added support for.

22:49.000 --> 22:54.000
We have an option to exclude parts of your project where Faulted apps is analysis.

22:54.000 --> 23:01.000
It's used as getting north-style patterns, and you can even pass getting north-style files to that option.

23:01.000 --> 23:07.000
Also, another popular or problem that often shows up is some dependencies that

23:07.000 --> 23:13.000
are real dependencies of your code based, but that you never install and never meant to be installed.

23:13.000 --> 23:19.000
The most popular ones are tools like the talks black, flake out, and even Faulted apps itself is part of this.

23:19.000 --> 23:23.000
You can add it as a development dependency, but you never intend to actually import it.

23:23.000 --> 23:34.000
We carry a default list that takes care of most of these, but you can also use the ignore unused option to specify these separately.

23:35.000 --> 23:42.000
And then, also, sometimes there are imports in your code that are either conditional or alternatives.

23:42.000 --> 23:49.000
Think of like you're trying to import something, and if that fails, you're either importing something else, or you're doing some other fold-back solution.

23:49.000 --> 23:54.000
Those don't always correspond to something that you want to declare as a proper dependency.

23:54.000 --> 23:59.000
So in those cases, we have an ignore-underclared option that you can use.

23:59.000 --> 24:03.000
And then there are some things that we're still working on that are hard to solve.

24:03.000 --> 24:08.000
When you do dynamic imports, you can do, like Python is a real dynamic language.

24:08.000 --> 24:14.000
You can do stuff like this, import something that depends on some arbitrarily complex expression.

24:14.000 --> 24:19.000
Those, we don't handle yet, and it's hard to figure out exactly how we can handle that.

24:19.000 --> 24:22.000
We are actually executing your code.

24:23.000 --> 24:30.000
We, let's see, and then also, Python allows for dependencies to be categorized and made optional in various different ways.

24:30.000 --> 24:35.000
Depending on your context, some of these might be considered optional, and should be skipped by faulted apps.

24:35.000 --> 24:41.000
We have plans to add support for telling faulted apps to ignore some or all such optional groups.

24:41.000 --> 24:47.000
But for now, faulted apps will create all of these declared dependencies as something that should be imported in your code.

24:47.000 --> 24:54.000
So to summarize, reproducibility matters, in science generally, but in Python specifically to.

24:54.000 --> 24:59.000
The biggest hurdle to reproducibility in Python is missing or incorrect dependency declarations.

24:59.000 --> 25:05.000
And faulted apps helps you fix this by highlighting undercare and unused dependencies in your project.

25:06.000 --> 25:15.000
Faulted apps works with all these kinds of various Python projects and Python versions, as I've shown.

25:15.000 --> 25:19.000
And it should work out of the box on most projects.

25:19.000 --> 25:23.000
And you can automate it, that's one of the most important points here.

25:23.000 --> 25:26.000
Finally, that's it for me really.

25:26.000 --> 25:28.000
Thanks for listening and go check out faulted apps.

25:28.000 --> 25:30.000
We're on GitHub and Pipei.

25:30.000 --> 25:34.000
We're on the 3-Bug and there can also read about all the other things that we're so active.

25:34.000 --> 25:35.000
Yeah.

25:35.000 --> 25:39.000
If there is any time remaining, I'm happy to take questions.

25:39.000 --> 25:41.000
Thank you very much for the nice talk.

25:41.000 --> 25:44.000
Unfortunately, we don't have time for questions.

25:44.000 --> 25:49.000
So please ask any questions that you might have in the Matrix chat or just approach your hand directly.

25:49.000 --> 25:50.000
Thank you.