WEBVTT

00:00.000 --> 00:05.000
Can you hear me?

00:05.000 --> 00:09.000
Everybody can hear me?

00:09.000 --> 00:12.000
So hi, I'm Anton Kirnoff.

00:12.000 --> 00:16.000
I'm, or maybe, I was, and I've been back developer.

00:16.000 --> 00:19.000
I work with FF Labs, or maybe I used to.

00:19.000 --> 00:25.000
And past year, I implemented multi-view decoding in AD codec.

00:25.000 --> 00:29.000
And in the FF and back CLI, Transcoder 2.

00:29.000 --> 00:33.000
In this talk, I will tell you, what is multi-view?

00:33.000 --> 00:35.000
Why might you care?

00:35.000 --> 00:38.000
Why might you care even if you don't care about multi-view?

00:38.000 --> 00:45.000
And some technical interesting aspects, hopefully interesting, of this work.

00:45.000 --> 00:52.000
It was sponsored by Vimeo and Meta, so thanks for making this possible.

00:52.000 --> 01:00.000
It has wide implications beyond just multi-view, but multi-view in itself is also quite interesting.

01:00.000 --> 01:03.000
So, to start, what is multi-view?

01:03.000 --> 01:06.000
I think the picture really says it all.

01:06.000 --> 01:14.000
You have two or maybe more video streams that are kind of independent, but not really.

01:14.000 --> 01:20.000
So, they are independent in the sense that you treat them as a two-parallel video streams,

01:20.000 --> 01:22.000
but there's a lot of redundancy.

01:22.000 --> 01:26.000
So, if you squint really, really closely, you probably can't see it from there,

01:26.000 --> 01:30.000
but they are actually not the same.

01:30.000 --> 01:31.000
They are actually different.

01:31.000 --> 01:35.000
And the canonical example of multi-view is to your scopic 3D.

01:36.000 --> 01:44.000
So, if this is the left-eye view, the right-eye view,

01:44.000 --> 01:50.000
so, yeah, that's the way people generally use this,

01:50.000 --> 01:53.000
but you can do other things with it.

01:53.000 --> 01:55.000
So, now you want to code this thing.

01:55.000 --> 01:58.000
The naive way you code two video streams.

01:58.000 --> 02:02.000
This is very simple and obvious, but your bitrageous doubled.

02:02.000 --> 02:04.000
So, you don't want that.

02:04.000 --> 02:08.000
So, what you do want is to make use of the redundancy,

02:08.000 --> 02:12.000
and somehow predicts one of the images from the other,

02:12.000 --> 02:16.000
and just encode the differences.

02:16.000 --> 02:20.000
You could use some kind of hacks like, well, maybe you put them side by side,

02:20.000 --> 02:24.000
and use intra-frame prediction,

02:24.000 --> 02:28.000
but, or you can interleave the frames,

02:28.000 --> 02:30.000
and put them into one stream,

02:30.000 --> 02:33.000
but these are possible.

02:33.000 --> 02:35.000
People sometimes do them.

02:35.000 --> 02:37.000
I think, but it's quite hacky,

02:37.000 --> 02:40.000
and, for instance, it forces you to decode

02:40.000 --> 02:42.000
always decode both of them,

02:42.000 --> 02:44.000
which you don't always want.

02:44.000 --> 02:46.000
Maybe you sometimes want just one of them.

02:46.000 --> 02:51.000
So, multi-view is a set of tools to deal with that.

02:51.000 --> 02:55.000
So, the thing I actually implemented is called NVHVC,

02:55.000 --> 02:58.000
which is multi-view for the HVC codec,

02:58.000 --> 03:01.000
also known as H265.

03:01.000 --> 03:07.000
As you all know, H265 is a successor to ABC or H264,

03:07.000 --> 03:10.000
which we all know and love, the best codec ever,

03:10.000 --> 03:12.000
objectively true.

03:12.000 --> 03:16.000
A and H264, they used to be a thing, which was called NVC,

03:16.000 --> 03:19.000
which was multi-view coding for H264.

03:19.000 --> 03:22.000
I think it was used in 3D blue rays,

03:22.000 --> 03:25.000
and we had a longstanding feature request to implement that

03:25.000 --> 03:27.000
an AD codec and that never happened.

03:27.000 --> 03:31.000
For a bunch of reasons, which I will elaborate on later.

03:33.000 --> 03:35.000
But so, yeah, it existed.

03:35.000 --> 03:37.000
It was used a little bit in the wild,

03:37.000 --> 03:40.000
but it's not really supported very much.

03:40.000 --> 03:43.000
So, in HVC, there is a similar thing,

03:43.000 --> 03:48.000
which we now call NV, which people call NVHVC.

03:48.000 --> 03:53.000
And it is a way of doing exactly where I show you

03:53.000 --> 03:58.000
in the previous slide of packing multiple semi-independent streams

03:58.000 --> 04:01.000
into a single HVC bit streams such that the streams

04:01.000 --> 04:04.000
can predict from each other, but otherwise you can sort of

04:04.000 --> 04:08.000
tweak them as independent, which is exactly what you want.

04:08.000 --> 04:12.000
It is based on multi-layer extensions.

04:12.000 --> 04:16.000
So, I think in H264, all of this was multi-view,

04:16.000 --> 04:20.000
was a separate thing, scalability, was a separate thing,

04:20.000 --> 04:22.000
other dancing animated ponies.

04:22.000 --> 04:24.000
It was a separate thing.

04:24.000 --> 04:26.000
In HVC, I think that the unified,

04:26.000 --> 04:31.000
there is sort of a general multi-layer extensions

04:31.000 --> 04:34.000
specification that, and then it's specialized

04:34.000 --> 04:37.000
into multi-view, scalable encoding, alpha,

04:37.000 --> 04:40.000
things, some kind of depth, texture,

04:40.000 --> 04:42.000
thingy, I didn't look into it.

04:42.000 --> 04:45.000
But there's a bunch of things of purposes

04:45.000 --> 04:47.000
that can be used for, but generally,

04:47.000 --> 04:50.000
people care about multi-view, about alpha.

04:50.000 --> 04:55.000
I think some people care about scalable, who knows.

04:55.000 --> 04:59.000
If you remember what an all unit had

04:59.000 --> 05:03.000
or looks like, which you should, there is a field in it,

05:03.000 --> 05:05.000
which is always zero.

05:05.000 --> 05:07.000
And if it's non-zero, you scream and run away,

05:07.000 --> 05:10.000
and the point of this work is that you don't scream,

05:10.000 --> 05:14.000
you don't run away, you face it,

05:15.000 --> 05:18.000
you don't, and do something useful with it.

05:18.000 --> 05:21.000
The first specification is insanely complex,

05:21.000 --> 05:24.000
because all of these things, they can be used together.

05:24.000 --> 05:28.000
And you can sort of have a multi-view, scalable stream

05:28.000 --> 05:32.000
with alpha, which has up to 63 layers,

05:32.000 --> 05:34.000
which one of them is the base one.

05:34.000 --> 05:38.000
This is the layer ID zero, which can be

05:38.000 --> 05:41.000
decoded on its own, by a decoder that

05:41.000 --> 05:43.000
doesn't know anything about any multi-layer,

05:43.000 --> 05:46.000
just ignores everything, and just decodes the base there.

05:46.000 --> 05:49.000
But the other layers sort of predict from it,

05:49.000 --> 05:52.000
and there can be a complex dependency graph.

05:52.000 --> 05:55.000
And as far as I know nothing supports that,

05:55.000 --> 05:57.000
even the reference implementation,

05:57.000 --> 06:01.000
like there's the base one, which only does base layer,

06:01.000 --> 06:04.000
and there's like three forks of it,

06:04.000 --> 06:06.000
one of which does multi-view, and the other one

06:06.000 --> 06:09.000
does scalable, and the other one does 3D,

06:09.000 --> 06:12.000
and I'm not sure which one does alpha,

06:12.000 --> 06:16.000
maybe somebody knows, and they are separate,

06:16.000 --> 06:19.000
and none of them can do all of these at once.

06:19.000 --> 06:21.000
But in principle, per specification,

06:21.000 --> 06:23.000
you can do all of these together,

06:23.000 --> 06:25.000
and if you look at this specification,

06:25.000 --> 06:28.000
which I highly recommend, it's just completely insane.

06:28.000 --> 06:30.000
So we decided to not support, of course,

06:30.000 --> 06:32.000
any of that, we only support two layers,

06:32.000 --> 06:35.000
with the second one, depending on the first.

06:35.000 --> 06:40.000
Although, with alpha, being interesting for people,

06:40.000 --> 06:44.000
maybe there will be a use case, where you have

06:44.000 --> 06:46.000
a multi-view stream with alpha.

06:46.000 --> 06:49.000
Somebody should create it, but I don't think

06:49.000 --> 06:53.000
the demand for this is driven by VR, I think,

06:53.000 --> 06:57.000
is so, probably has to be hardware for the does this,

06:57.000 --> 06:59.000
so, probably not, but it will be fun.

06:59.000 --> 07:02.000
But so far, we can do two layers,

07:02.000 --> 07:07.000
and that's it.

07:07.000 --> 07:09.000
Why do you care?

07:09.000 --> 07:12.000
So, one possibility, you care about your copy 3D.

07:12.000 --> 07:17.000
You have VR glasses, or Oculus Quest,

07:17.000 --> 07:19.000
Apple Vision Pro, one of these things,

07:19.000 --> 07:24.000
and you really like to record and watch videos on them.

07:24.000 --> 07:27.000
So that's one possibility, that's the canonical use case.

07:27.000 --> 07:30.000
You might care about alpha, so I didn't implement that,

07:30.000 --> 07:33.000
but I opened the door to that, and somebody else,

07:33.000 --> 07:36.000
already wrote the patches.

07:36.000 --> 07:40.000
So, that will be possible soon, probably.

07:40.000 --> 07:44.000
But more generally, multi-view, the coding,

07:44.000 --> 07:47.000
the reason why it was never implemented for issues,

07:47.000 --> 07:50.000
or one of the reasons, and why it was hard to implement for this,

07:50.000 --> 07:53.000
is that it challenges a bunch of assumptions we make internally,

07:53.000 --> 07:57.000
and also in the APIs about how video is decoded.

07:57.000 --> 08:01.000
For instance, you have a single input packet,

08:01.000 --> 08:04.000
like the coded HVC data that you sent to a decoder,

08:04.000 --> 08:07.000
and that decodes, that contains all the views.

08:07.000 --> 08:09.000
So, it decodes into multiple frames.

08:09.000 --> 08:13.000
Two in our cases, but we don't make the assumptions in the API.

08:13.000 --> 08:15.000
So, in principle, end frames.

08:15.000 --> 08:20.000
So, that was not really supported in a bunch of ways before,

08:20.000 --> 08:23.000
now it is, and that has implications,

08:23.000 --> 08:26.000
so maybe it allows some things which were not possible for.

08:26.000 --> 08:29.000
And the other thing is that now you have a single decoder,

08:29.000 --> 08:33.000
which produces frames for several independent stream,

08:33.000 --> 08:36.000
which, again, it has implications.

08:36.000 --> 08:39.000
It might allow some use cases which were not possible before.

08:39.000 --> 08:42.000
So, you might care even if you don't care about 3D.

08:42.000 --> 08:47.000
So, what was hard about implemented?

08:47.000 --> 08:51.000
So, first, inside the HVC decoder itself,

08:51.000 --> 08:55.000
so AD codec has like generic, generic code,

08:55.000 --> 08:58.000
which is the code independent, and then below that,

08:58.000 --> 09:01.000
there is the decoder specific stuff.

09:01.000 --> 09:05.000
So, this is for the decoder specific stuff.

09:05.000 --> 09:08.000
The main thing that you encounter, or the first one,

09:08.000 --> 09:12.000
is that a bunch of state that used to be per codec,

09:12.000 --> 09:14.000
is now per layer.

09:14.000 --> 09:17.000
So, you have your decoder context, which is a big struct,

09:17.000 --> 09:19.000
which with a bunch of state in it,

09:19.000 --> 09:22.000
and now a lot of that state is per layer.

09:22.000 --> 09:25.000
So, you have, you need to have multiple of these context,

09:25.000 --> 09:28.000
and for each layer we want to decode.

09:28.000 --> 09:31.000
A common approach, I don't know about your project,

09:31.000 --> 09:35.000
but a common approach that people very often do,

09:35.000 --> 09:40.000
is that you add a bunch of children of,

09:40.000 --> 09:44.000
or a bunch of copies of the same struct inside it,

09:44.000 --> 09:49.000
which saves you, it seems like it saves you work.

09:49.000 --> 09:52.000
So, it's because you don't have to really do anything,

09:52.000 --> 09:54.000
you just do that very simple thing,

09:54.000 --> 09:59.000
and from now on, some things are per codec,

09:59.000 --> 10:01.000
and some are per layer.

10:01.000 --> 10:03.000
This is a horrible, horrible, evil,

10:03.000 --> 10:06.000
obfuscation method, which you should never ever do,

10:06.000 --> 10:08.000
and if you do that, please stop.

10:08.000 --> 10:11.000
Because immediately, when you do that,

10:11.000 --> 10:14.000
you lose the information, which fields from the struct are,

10:14.000 --> 10:16.000
are meaningful in the parent,

10:16.000 --> 10:18.000
and which are meaningful in the child.

10:18.000 --> 10:21.000
Now, everybody who's reading the code later has to reverse

10:21.000 --> 10:26.000
the engineer, the struct, check all the places where

10:26.000 --> 10:28.000
it's where some field is used,

10:28.000 --> 10:32.000
and only then you discover, which is which.

10:32.000 --> 10:35.000
And you might think, oh, but if I document it,

10:35.000 --> 10:39.000
surely this will fix it, haha.

10:39.000 --> 10:41.000
Of course, nobody ever documents things,

10:41.000 --> 10:44.000
and if you do it, it will get out of date,

10:44.000 --> 10:46.000
eventually, because somebody dazed the code

10:46.000 --> 10:48.000
and doesn't update their documentation.

10:48.000 --> 10:52.000
So documentation helps a little, not that much.

10:52.000 --> 10:55.000
And it's, but also another problem is,

10:55.000 --> 10:57.000
you have a bunch of dead fields.

10:57.000 --> 10:59.000
In the parent context, and in the children,

10:59.000 --> 11:01.000
you have a bunch of fields that are just there,

11:01.000 --> 11:04.000
waste memory, waste cash, and don't do anything.

11:04.000 --> 11:07.000
So, and in the end, the amount of work it saves you,

11:07.000 --> 11:09.000
it's very little.

11:09.000 --> 11:12.000
It looks like a lot, but not really,

11:12.000 --> 11:15.000
and it's, it's work that's very straightforward.

11:15.000 --> 11:16.000
You don't have to think about it.

11:16.000 --> 11:19.000
In the future, probably, you charge a Pt will do it.

11:19.000 --> 11:22.000
So, please never ever do this.

11:22.000 --> 11:25.000
The thing you actually should do,

11:25.000 --> 11:27.000
you check all the fields.

11:27.000 --> 11:30.000
You find out which one are actually per layer,

11:30.000 --> 11:33.000
which you have to do anyway in the end.

11:33.000 --> 11:36.000
You just, in this approach, you just do it more systematically.

11:36.000 --> 11:38.000
And then you add a per layer context.

11:38.000 --> 11:42.000
You move the things in that per layer context,

11:42.000 --> 11:45.000
one by one, and hopefully you're done.

11:45.000 --> 11:49.000
If you work this was the majority of work by patch volume,

11:49.000 --> 11:53.000
but it was really straightforward mostly.

11:53.000 --> 11:56.000
If your code is really crappy and entangled and spaghetti-fied,

11:56.000 --> 11:59.000
this must be not trivial, because moving one thing

11:59.000 --> 12:01.000
can depend on some other thing,

12:01.000 --> 12:03.000
which happened here to some extent,

12:03.000 --> 12:06.000
but not as much as it could have.

12:06.000 --> 12:11.000
For instance, the H640 coder has more history.

12:11.000 --> 12:12.000
Let's say.

12:12.000 --> 12:17.000
And doing the same thing there will be more complicated.

12:17.000 --> 12:21.000
If you feel like second that problem would be prepared for some pain.

12:21.000 --> 12:25.000
So, that was the biggest thing I had to do.

12:25.000 --> 12:29.000
Another thing was frame output logic.

12:29.000 --> 12:32.000
As you all know, H2HVC,

12:32.000 --> 12:35.000
and also ADC they have frame reordering.

12:35.000 --> 12:38.000
So, when you decode a frame, you don't output it immediately.

12:38.000 --> 12:40.000
You put it in a decode a picture buffer,

12:40.000 --> 12:44.000
and then maybe depending on some conditions.

12:44.000 --> 12:46.000
You look at the picotic picture buffer.

12:46.000 --> 12:49.000
You select some specific frame from it,

12:49.000 --> 12:54.000
and then you maybe output it.

12:54.000 --> 12:58.000
One factor that complicated is that there are things,

12:58.000 --> 13:00.000
which are called sequences.

13:00.000 --> 13:06.000
A sequence is basically a segment of coded video,

13:06.000 --> 13:08.000
which has the same parameters.

13:08.000 --> 13:14.000
Like a single coded video that was encoded in one at once,

13:14.000 --> 13:15.000
for instance.

13:15.000 --> 13:18.000
And this can change at a time.

13:18.000 --> 13:20.000
So, you can concatenate a bunch of videos,

13:20.000 --> 13:24.000
and you get two sequences, or multiple sequences.

13:24.000 --> 13:28.000
And whenever you switch the sequence,

13:28.000 --> 13:33.000
you have a bunch of frames buffered for output later.

13:33.000 --> 13:35.000
And so, you could be decoding one frame,

13:35.000 --> 13:38.000
but still be a frame for months sequence,

13:38.000 --> 13:40.000
and still be outputting frames from a previous sequence,

13:40.000 --> 13:42.000
or in more pathological cases,

13:42.000 --> 13:45.000
you could be two sequences back, or 16 sequences back.

13:45.000 --> 13:48.000
Probably not six, I think, 15 is the limit.

13:48.000 --> 13:52.000
If you really want pain.

13:52.000 --> 13:57.000
So, and we had a lot of complicated logic to handle that.

13:57.000 --> 14:01.000
And actually, and what I had to do right now,

14:01.000 --> 14:04.000
because now we have not only do we have to handle this,

14:04.000 --> 14:06.000
but we also have two views.

14:06.000 --> 14:09.000
And when you switch sequences, you can also switch the number of views.

14:09.000 --> 14:12.000
You can switch from single view to multi view,

14:12.000 --> 14:15.000
and back, or you can switch the positions of the views,

14:15.000 --> 14:18.000
or the properties blah, blah, blah.

14:18.000 --> 14:21.000
So, this had to be added on top of that logic,

14:21.000 --> 14:24.000
which would be very complicated.

14:24.000 --> 14:29.000
And because I'm not smart enough to think about all that,

14:29.000 --> 14:32.000
I noticed that we actually don't have to do any of it.

14:32.000 --> 14:35.000
But all that complicated logic is only there,

14:35.000 --> 14:40.000
because we have the constraint that a single input packet

14:40.000 --> 14:42.000
has to output at most one frame.

14:42.000 --> 14:45.000
But we don't have that constraint anymore.

14:45.000 --> 14:48.000
So, I change the logic to output multiple frames at once,

14:48.000 --> 14:51.000
which we can do, and all of that horror goes away.

14:51.000 --> 14:55.000
So, this work actually simplified a lot of things,

14:55.000 --> 14:58.000
even though I now, it has to interleaf,

14:58.000 --> 15:01.000
still interleaf frames from multiple views.

15:01.000 --> 15:04.000
Now what it does is when it encounters a sequence switch,

15:04.000 --> 15:07.000
it just flushes the decoder picture by for completely,

15:07.000 --> 15:10.000
which we can do, which is great.

15:10.000 --> 15:14.000
I can also notice frame threading is inefficient for multi view.

15:14.000 --> 15:17.000
If you care, you might want to fix that.

15:17.000 --> 15:19.000
That's welcome.

15:19.000 --> 15:24.000
Now, moving a layer up in a decoder generic code,

15:24.000 --> 15:29.000
there was also a bunch of bunch of issues.

15:29.000 --> 15:36.000
We, as I said, we need a single input packet,

15:36.000 --> 15:39.000
and we need it to produce multiple frames, which need to be output,

15:39.000 --> 15:42.000
which is fine as far as public APIs concerned,

15:42.000 --> 15:48.000
because since the new API, it's 10 years old at this point,

15:48.000 --> 15:52.000
which was added by the infamous WM4.

15:53.000 --> 16:00.000
Exactly for this, to handle M to N arbitrary packet to frame mapping.

16:00.000 --> 16:05.000
So, on the public API level, this is fine, but internally,

16:05.000 --> 16:09.000
internally frame threading did not support that.

16:09.000 --> 16:14.000
A frame threading was working on the old API model.

16:14.000 --> 16:19.000
So, it could only do one packet to add most one frame.

16:19.000 --> 16:21.000
So, I had to change that.

16:21.000 --> 16:27.000
I had to port frame threading to the new API, new, then your old.

16:27.000 --> 16:33.000
Actually, I started doing that back in 2017 for my work on MVC,

16:33.000 --> 16:37.000
which I never finished, but it was never really,

16:37.000 --> 16:42.000
most of the work was most of the work in theory was done,

16:42.000 --> 16:45.000
but actually polishing it was quite complicated,

16:45.000 --> 16:49.000
because there was some unnamed decoder,

16:49.000 --> 16:53.000
which abused frame threading quite a lot.

16:53.000 --> 16:55.000
It did a bunch of things wrong.

16:55.000 --> 16:57.000
It did that thing, which I told you not to do.

16:57.000 --> 16:59.000
It did exactly this.

16:59.000 --> 17:05.000
So, I had to reverse and generate and undo that.

17:05.000 --> 17:09.000
It did that for slide threading, and just to make it readable and possible

17:09.000 --> 17:11.000
to understand for myself, I had to fix that,

17:11.000 --> 17:15.000
which incidentally made it 4% faster in single threaded decoding,

17:15.000 --> 17:18.000
which I didn't intend, but yeah.

17:18.000 --> 17:24.000
But also, it had some hacks like generic codec independent frame threading code

17:24.000 --> 17:30.000
would have code like, if the codec is this, do something insane.

17:30.000 --> 17:35.000
And by insane mean, reduced the number of threads by one.

17:35.000 --> 17:38.000
So, if you had two threads, it was running single threading.

17:38.000 --> 17:44.000
And there were also some races, and by thread sanitizer, and so on.

17:44.000 --> 17:49.000
So, I had to do, in order to implement multi-view for HVC,

17:49.000 --> 17:53.000
I had to fix this decoder, unfortunately,

17:53.000 --> 17:56.000
or maybe fortunately, if you care about it, because now it's faster,

17:56.000 --> 17:58.000
now it doesn't have any races.

17:58.000 --> 18:02.000
It's faster in single thread, it's faster in frame threaded mode.

18:02.000 --> 18:05.000
Now, frame threading is actually always faster than sliced threading,

18:05.000 --> 18:12.000
which makes sense, it should always be faster, otherwise there's no reason to use it.

18:12.000 --> 18:14.000
So, yep.

18:14.000 --> 18:19.000
In order, one thing that helped me a lot here is this new API,

18:19.000 --> 18:22.000
we have which is called draft struct, which was new.

18:22.000 --> 18:25.000
It's a year old, I think, about now.

18:25.000 --> 18:27.000
It was written by Andreas, thank you, Andreas.

18:27.000 --> 18:31.000
It's great, it's recently became public, so you can use it.

18:31.000 --> 18:36.000
It's an API for reference, counted structs with very little overhead,

18:36.000 --> 18:42.000
and very little boiler plays on top of it.

18:42.000 --> 18:44.000
So, it's very convenient.

18:44.000 --> 18:47.000
So, I heavily recommend it, it's great.

18:47.000 --> 18:52.000
So, I had to fix that, and then finish this batch,

18:52.000 --> 18:58.000
and then frame threading is now finally able to handle multiple over frames per packet.

18:58.000 --> 19:01.000
All that for just a small thing.

19:01.000 --> 19:06.000
Another challenge or a bunch of challenges is the public API part.

19:06.000 --> 19:11.000
As I said, the output part is not problematic,

19:11.000 --> 19:15.000
because we do support multiple frames, multiple output frames per packet.

19:15.000 --> 19:17.000
We did that for a long time.

19:17.000 --> 19:20.000
I think many colors actually don't get that right.

19:20.000 --> 19:24.000
An example recently that assumes that one packet produces at most one frame.

19:24.000 --> 19:29.000
So, all these colors are broken, but that's their problem, unfortunately.

19:29.000 --> 19:31.000
But it's not ours.

19:31.000 --> 19:36.000
The problems we do have actually is that all the multi-layer properties are per sequence.

19:36.000 --> 19:42.000
So, if this never happens, but it had to be implemented properly, of course.

19:42.000 --> 19:48.000
So, in principle, you have to consider the case where a multi-view video is concatenated with a single view one,

19:48.000 --> 19:53.000
and so, or you can have a multi-view video with different properties.

19:53.000 --> 20:02.000
So, you need to tell the color, what view ideas there are, what view positions there are, which view is right or left,

20:02.000 --> 20:04.000
and this can change dynamically.

20:04.000 --> 20:09.000
So, for that, I use the GetForex callback, which is named that way for historical reasons.

20:09.000 --> 20:14.000
It actually is used currently to configure hardware acceleration,

20:14.000 --> 20:22.000
but now it's also used to negotiate multi-view properties with the color.

20:22.000 --> 20:28.000
So, when that color bike is cold, the color gets the information about the stream,

20:28.000 --> 20:34.000
and can tell us that it wants either one or both views to be decoded,

20:34.000 --> 20:42.000
but in principle, as many as there are, which can be up to 60 feet, or in the API there's no limit.

20:42.000 --> 20:44.000
In Max.

20:44.000 --> 20:49.000
Also, I added every type of options, because we want to export multiple view ideas,

20:49.000 --> 20:55.000
and multiple view positions, and previously we didn't have a rate-type option.

20:55.000 --> 21:01.000
So, what was done was we communicate by using comma-separated strings,

21:01.000 --> 21:05.000
and parsing strings in C is great fun, everybody loves it,

21:05.000 --> 21:08.000
but because I hate fun, I took that away from you.

21:08.000 --> 21:11.000
So, yeah.

21:11.000 --> 21:17.000
And this will also be used heavily in other places, like in any filter we do that,

21:17.000 --> 21:20.000
and we do it everywhere all the time.

21:20.000 --> 21:25.000
And the frames that are produced by the decoder have side data,

21:25.000 --> 21:30.000
which tells you which view it is.

21:30.000 --> 21:34.000
So, that's quite simple, and then the color can deal with it as it likes.

21:34.000 --> 21:38.000
I will skip that, because I'm actually going very slowly.

21:38.000 --> 21:43.000
That was a repeat of my last year's talk, which you can look up.

21:44.000 --> 21:49.000
There is native support for multiview, and the CLI, and the Tonscoder tool.

21:49.000 --> 21:56.000
It's not originally intended to just, well, have the codec output all the frames in to leave,

21:56.000 --> 22:02.000
and then let the user deal with it, which was completely painful, because the users don't know anything.

22:02.000 --> 22:08.000
And so, because I do, I extend it three specifiers, which everybody loves,

22:08.000 --> 22:13.000
into view specifiers. So, now, before you could say, well, I want to decode the fourth video stream.

22:13.000 --> 22:18.000
Now, you can say, I want the left view of the fourth video stream, and you can pipe that to an output stream,

22:18.000 --> 22:23.000
or to a complex filter graph, typically you want to put the frames side by side,

22:23.000 --> 22:28.000
or, or marks them into different streams, or different files, or whatever.

22:28.000 --> 22:35.000
So, that's up to you. That can be done with, with the new view specifiers.

22:35.000 --> 22:40.000
One feature, people might, might be interested in.

22:40.000 --> 22:46.000
So, now, a single decoder in the CLI can produce multiple streams.

22:46.000 --> 22:49.000
It is, it is technically possible.

22:49.000 --> 22:54.000
This could be generalized, for instance, to support closed captions, and other features like that.

22:54.000 --> 22:57.000
You could have a video stream, which has embedded closed captions.

22:57.000 --> 23:08.000
And before, we supported extracting them by using insane hacks like, you use the Aby filter device pseudo-demuxer,

23:08.000 --> 23:17.000
which uses a movie video source, which opens a file, and somehow decodes the video internally inside the filter graph,

23:17.000 --> 23:25.000
and extracts the video stream and the closed captions, and then gives it back to you as a demuxer,

23:25.000 --> 23:28.000
which sort of works, but just know.

23:28.000 --> 23:37.000
And now, the CLI can, in principle, it's not implemented, but it could, it could in principle be done straight forwardly.

23:37.000 --> 23:40.000
And other things like that.

23:40.000 --> 23:43.000
Okay, I think I'm done. Thank you.

23:43.000 --> 23:50.000
Yes, Kevin?

23:50.000 --> 23:55.000
The splitting out closed captions notion does, wouldn't that mean that a decoder?

23:55.000 --> 24:00.000
You know, right now we have video decoders, audio decoders, and some type of decoders.

24:00.000 --> 24:06.000
You don't have a decoder that can produce both video and some type of data.

24:06.000 --> 24:14.000
And not on Abycodec level, on Abycodec level, I imagine it would give you a frame with side data.

24:14.000 --> 24:16.000
Oh, sorry. I have to repeat the question.

24:16.000 --> 24:19.000
Could you repeat the question, sir?

24:19.000 --> 24:27.000
The question was, you know, if that back today, the coders are typically categorized as video audio.

24:27.000 --> 24:28.000
Right, right, right.

24:28.000 --> 24:34.000
So the question is, in Abycodec, the decoder is a video or audio or a subtitle decoder.

24:34.000 --> 24:37.000
How do we make it out with closed captions?

24:37.000 --> 24:41.000
And the answer is, we don't do that in Abycodec.

24:41.000 --> 24:44.000
We do that in the CLI. That's my point.

24:44.000 --> 24:47.000
Right, so a decoder remains a video decoder.

24:47.000 --> 24:50.000
It gives you a frame, and the frame has side data with closed captions.

24:50.000 --> 24:53.000
And then the CLI pretends that it's actually two streams.

24:53.000 --> 24:56.000
It's a video stream and a subtitle stream.

24:56.000 --> 25:00.000
The decoder is actually producing multiple up and stream.

25:00.000 --> 25:05.000
The decoder decoder isn't. The CLI decoder object is.

25:05.000 --> 25:08.000
Those are different things.

25:08.000 --> 25:10.000
Are the questions?

25:10.000 --> 25:11.000
Yep.

25:11.000 --> 25:19.000
I'm wondering if you plan to expand this multi-use report to DVD angles.

25:19.000 --> 25:22.000
What about that?

25:22.000 --> 25:29.000
So the question is, am I planning to extend this to DVD angles?

25:29.000 --> 25:33.000
I don't think it's...

25:33.000 --> 25:36.000
I am not sure how DVD angles actually work,

25:36.000 --> 25:40.000
so I am not sure that this would be applicable to them.

25:40.000 --> 25:45.000
We do have a lot of activity on DVD demuxing right now.

25:45.000 --> 25:50.000
Maybe you should ask the author of that code.

25:51.000 --> 25:53.000
Sorry.

25:53.000 --> 25:55.000
Have the questions?

25:55.000 --> 25:58.000
Yeah, Victoria.

25:58.000 --> 26:03.000
Can we do internalized in MVHVC?

26:03.000 --> 26:06.000
Sadly, the windows do not open.

26:07.000 --> 26:10.000
Yep.

26:18.000 --> 26:23.000
So the question was, how do we accelerate with MVHVC?

26:23.000 --> 26:24.000
I don't know.

26:24.000 --> 26:26.000
I didn't try.

26:27.000 --> 26:30.000
Yeah, there's an encoder set at least on this curve thing.

26:30.000 --> 26:35.000
So enabling with the MVH module to encode already into the content.

26:35.000 --> 26:40.000
For decoding, in principle, I think the low-level hardware

26:40.000 --> 26:42.000
doesn't really care about multi-view.

26:42.000 --> 26:44.000
It only gets the reference frames.

26:44.000 --> 26:49.000
So the difference in decoding in the actual pixel level

26:49.000 --> 26:54.000
or macro-block level decoding is that you have more frames in your reference

26:54.000 --> 26:58.000
pictures, lists, or set, or whatever it is.

26:58.000 --> 27:01.000
So if the high-level code just adds one more frame in that,

27:01.000 --> 27:03.000
the hardware doesn't care in theory.

27:03.000 --> 27:04.000
I didn't try.

27:04.000 --> 27:09.000
But I would hope that it can work.

27:12.000 --> 27:13.000
Yep.

27:13.000 --> 27:17.000
I was wondering what kind of requirements

27:17.000 --> 27:22.000
for milk to view?

27:22.000 --> 27:29.000
Does two inputs need to be the same size?

27:29.000 --> 27:33.000
The same aspect in the late, you know,

27:33.000 --> 27:36.000
are they always left the right or can they be up and down

27:36.000 --> 27:40.000
or in the corner?

27:40.000 --> 27:46.000
So the question is, if there are restrictions on dimensions

27:46.000 --> 27:52.000
and aspect ratio and positions, so the answer is kind of.

27:52.000 --> 27:57.000
So positions are just metadata, right?

27:57.000 --> 27:59.000
So it's just a field.

27:59.000 --> 28:01.000
It's actually optional.

28:01.000 --> 28:06.000
You don't need to have a screen that doesn't tell you what the position is.

28:06.000 --> 28:13.000
Actually, the spec doesn't mandate that it has to be somehow oriented.

28:13.000 --> 28:16.000
That it has to be really to eyes.

28:16.000 --> 28:17.000
It could be anything.

28:17.000 --> 28:22.000
The interpretation is really in some metadata, which may or may not be present.

28:22.000 --> 28:26.000
I think the allowed positions are left right and top bottom.

28:26.000 --> 28:30.000
I don't really remember if it could have more complex orientations.

28:30.000 --> 28:34.000
For formats and resolutions and aspect ratios, in principle,

28:34.000 --> 28:35.000
they don't have to match.

28:35.000 --> 28:42.000
And actually there is this spec allows you to have different size of the different views.

28:42.000 --> 28:52.000
But we don't support that, because that is not same.

28:52.000 --> 28:57.000
More questions?

28:57.000 --> 28:59.000
Thank you, then.

