WEBVTT

00:00.000 --> 00:12.120
I'm happy to be the first one here, and thank you for being awake that early on the Sunday.

00:12.120 --> 00:19.480
My name is Roman Buxis, and I'm going to present today a couple of ideas I put together

00:19.480 --> 00:28.640
after working on managing dynamic content for an automation streaming software that I developed

00:28.640 --> 00:32.760
with Samuel here years ago.

00:32.760 --> 00:38.600
Leave it up myself, I leave it in the audience, I've been a software engineer for over 10 years,

00:38.600 --> 00:46.760
maybe 15 soon, and one of the things I love when I do software is to think about navigating

00:46.760 --> 00:53.080
between the higher level, the clean abstraction we can create as software engineers,

00:53.080 --> 01:00.400
and what goes on under the hood, where we actually deal with a lot of complex problems

01:00.400 --> 01:06.640
that we know are to deal inside of bugs and present something outside of the bugs that

01:06.640 --> 01:11.960
is neat and very usable for people and solves those problems internally.

01:11.960 --> 01:14.400
I'll give some example of that.

01:14.400 --> 01:19.680
What I'm trying to do today is after having dealt with the hard problems looking back and

01:19.680 --> 01:24.320
say, what can we do that simplifies it that would be useful for other people?

01:24.320 --> 01:29.160
So a little bit of the context here is liquid soap, it's the software that we have, it's

01:29.160 --> 01:36.480
been created around 2004 by Samuel and David Baal, where students at the ENS, the theoretical

01:36.480 --> 01:43.560
computer school, it's a language for media streaming, it's functional, statically typed

01:43.640 --> 01:49.320
within for a type, there was all the hype, but we're not here to talk about that today,

01:49.320 --> 01:54.080
and since then, because that's been a while, it expanded to a free feature language with

01:54.080 --> 02:00.640
a free type integration with FFMPEG, and that's the integration that triggered what

02:00.640 --> 02:06.840
we are going to talk today about, especially because originally it was audio and now it's

02:06.840 --> 02:11.800
doing audio and video mixing and streaming, which brought another level of complexity in

02:11.800 --> 02:16.880
terms of the kind of content that we handle, and the kind of questions we have to answer

02:16.880 --> 02:19.120
when we create stuff.

02:19.120 --> 02:24.680
So I'm going to look at a script in liquid soap, that's where we use, that's the way we

02:24.680 --> 02:30.560
use to do it, take a minute to read about it, but essentially you create a playlist of

02:30.560 --> 02:36.760
audio files, you're putting a jingle, a file here that's used when you're the playlist

02:36.760 --> 02:41.040
is empty, we're extracting the audio, the metadata, and the track marks, we're taking

02:41.040 --> 02:46.760
a video of a playlist of videos, we're extracting the video track, we're remixing here

02:46.760 --> 02:51.560
audio video and metadata from each source, we're adding a little text on it, and we're

02:51.560 --> 02:54.520
exporting to HLS re encoding.

02:54.520 --> 03:00.720
That's the way we use to do it, where everything is decoded to a role content, being PCM

03:00.720 --> 03:06.960
audio, you for 20 video, something that you can actually manipulate and add data to it.

03:06.960 --> 03:11.080
But one of the layer we've added since then, since the first time we came to present here

03:11.080 --> 03:18.160
in 2020, is that now we're able to do something very interesting, which is instead of re encoding,

03:18.160 --> 03:26.840
we have a copy and color that takes the role binary and coded data straight up of the files,

03:26.840 --> 03:33.840
basically informs every single decoder here that they need to not decode and just take the

03:33.840 --> 03:42.360
packets of encoding content and repackage it for HLS, and it looks minimal in terms of scripting

03:42.360 --> 03:48.000
but in terms of CPU and machine consumption, it's day and night, because you can have a 4K video

03:48.000 --> 03:54.800
here that's being processed and re encoding a 4K video is vastly different from just repackaging

03:54.880 --> 03:56.640
the encoding video.

03:56.640 --> 04:03.280
And so this is where the complexity arise, how you transfer what we do in streaming to the

04:03.280 --> 04:08.280
world of directly encoded data.

04:08.280 --> 04:10.280
Whoops.

04:10.280 --> 04:15.240
Yeah, so liquid soap can schedule how the place can support life source DJ that connect at any

04:15.240 --> 04:22.440
given time, it can have multiple output, and basically what it means is that it can handle

04:22.480 --> 04:27.360
dynamic and coded content and at any time you might switch between the stream and another

04:27.360 --> 04:30.960
one, you might change your audio, you might change your video, and you still want to be able

04:30.960 --> 04:36.280
to compose all these content and create a stream that's readable and compliant with

04:36.280 --> 04:39.160
the format that you're exporting to.

04:39.160 --> 04:43.680
And that's what we're trying to solve here and I'm going to explain how we did it.

04:43.680 --> 04:48.040
It's related to other needs, I think that it's not just us and that's why I want to step

04:48.040 --> 04:51.280
back and say we've done it, but how can we help other people do it?

04:51.280 --> 04:56.280
Like in FFMPEG, if you read a RTP stream that's coming from an external encoder and

04:56.280 --> 05:01.080
the encoder starts, currently you're not going to get something very successful because

05:01.080 --> 05:06.520
FFMPEG can do some limited content, concatenation, but you cannot really do it in a dynamic

05:06.520 --> 05:10.280
way where you suddenly change the source in the middle of the encoding.

05:10.280 --> 05:16.480
And so I'm hoping that by extracting some of the knowledge from here, we can create maybe

05:16.480 --> 05:21.440
a library of functions that can be reusable outside of all project itself.

05:21.440 --> 05:28.000
So I'm going to do a quick demo and let's hope that it works because that's demo for you.

05:28.000 --> 05:31.040
So I have the script, let me put that here.

05:31.040 --> 05:37.880
I have the script from, and that's not going to work, I don't know.

05:37.880 --> 05:55.480
I have the scripts from the, oh dear, I have the script from the last time, no demo,

05:55.480 --> 05:59.840
do it another time, sorry, it's just the two screenings to confusing, I'll do the demo

05:59.840 --> 06:06.840
again.

06:06.840 --> 06:07.840
Back.

06:07.840 --> 06:12.960
So if we don't do a demo, I'm going to explain what is streaming, how you create a streaming

06:12.960 --> 06:13.960
loop.

06:13.960 --> 06:21.400
What you need to do is that you take a frame duration of 0.02 seconds, you start at a certain

06:21.400 --> 06:29.640
time, you use a clock to measure time, it can be your CPU, but if you're using other

06:29.640 --> 06:35.880
things like SRT input, the clock synchronization is going to be given by SRT.

06:35.880 --> 06:39.840
If you're using your sound card, it's going to control your latency, I'm not going to

06:39.840 --> 06:41.560
get into that, let's not fit today.

06:41.560 --> 06:48.320
But for today, we take a unique get time of day, we measure the initial time, then we run

06:48.320 --> 06:54.000
our program, now program is going to generate 0.2 seconds of content, send it to all

06:54.000 --> 07:00.240
it put and say, done, that's our ending time, and then we have two scenario, if the

07:00.240 --> 07:07.040
lapse of time that we have used to create and not put all that is less than 0.2 seconds,

07:07.040 --> 07:11.880
we are good, we can slip a little bit, we come back later, and we can create content in real

07:11.880 --> 07:12.880
time.

07:12.880 --> 07:17.600
If we have done more time than that, it's going to be a problem, because we're late,

07:17.600 --> 07:20.920
maybe we can catch up, maybe we're temporary, but most likely it's not, your system

07:20.920 --> 07:23.200
is not able to handle streaming.

07:23.200 --> 07:27.480
But that's how you run a streaming loop essentially, it's another complexity under the

07:27.480 --> 07:31.280
hood, but on the higher level, that's the AD.

07:31.280 --> 07:38.200
So in the liquids of the way we used to do it is that we used to have a static frame of 0.2,

07:38.200 --> 07:45.920
0.0 to second of audio PCM data, so think of these bars as little zero loading points

07:45.920 --> 07:52.960
PCM audio samples, and we used to partially feel that we would pass it down to every single

07:52.960 --> 07:58.360
operator that creates data, like a file decoder, which fill up your frame, pass it on

07:58.360 --> 08:04.160
to something that composes it, all the way to an output would take the samples that

08:04.160 --> 08:11.000
had been generated and output it, and the way we would do it is this way is because this

08:11.000 --> 08:16.840
way we didn't have to relocate content on every cycle, we just have one static frame

08:16.840 --> 08:21.280
that we pass around every cycle, and we would mark how much of the frame had been filled

08:21.280 --> 08:23.200
up, like you can see.

08:23.200 --> 08:30.480
But that was not a very useful format for content when we started adding video frame, because

08:30.480 --> 08:35.840
we want things to be synchronous, so we were constrained to have one frame, one video frame

08:35.840 --> 08:44.400
per content frame, which forced us to put first of all to do a sampling rate in audio,

08:44.400 --> 08:48.760
that was compatible with the frame rate, so that we could have one single check of data, that

08:48.760 --> 08:54.200
was on the same boundary level, and we would assume a fixed video rate, which is not

08:54.200 --> 08:59.360
the case, and a lot of content that are encoded, and it would force us to have like 0.0

08:59.360 --> 09:06.640
for second, because of that, it was not a convenient way of representing content when we

09:06.640 --> 09:08.320
started adding video.

09:08.320 --> 09:13.880
And so what we did in a recent version is that we started thinking of content in a different

09:13.880 --> 09:21.640
way, which is that you have an empty frame and you want to fill it up, and so the intuition

09:21.640 --> 09:25.760
of filling up is that you're going to add a chunk of video content to it, and it's not

09:25.760 --> 09:31.280
filled up yet, so you run it again, you finish with another chunk of content, you do the

09:31.280 --> 09:38.320
same with audio, and at the end of the day, you say boom, right here, that's my 0.0

09:38.320 --> 09:46.320
to second of frame, and so instead of passing around a pre-allocated frame of PCM array,

09:46.320 --> 09:53.920
we pass an empty list, and we just dump content into it until it's filled up enough, and

09:53.920 --> 09:58.720
what we want to do also is that if we have a little bit too much, we slice it up right

09:58.720 --> 10:03.760
here, we keep that in our buffer for the next cycle, and we pass that to the output to

10:03.760 --> 10:09.600
an code, and that's so much more rational way of thinking about it.

10:09.600 --> 10:14.400
But now the question that arises, how can you programmatically do that, and that's what we

10:14.400 --> 10:20.400
can't talk about it, because what I want to introduce here is this idea of unified content

10:20.400 --> 10:22.200
composition API.

10:22.200 --> 10:28.920
Can we have an abstract API to compose content, to consider what is a chunk of content?

10:28.920 --> 10:32.080
What is inside your chunk of content?

10:32.080 --> 10:37.160
How do you take a chunk of content, and another one, and concatenate together, that's your

10:37.160 --> 10:42.760
first operation, and the second operation is how do you take a chunk of content and slice

10:42.760 --> 10:48.560
it up, because if you remember, here we want to slice up right here, so what we're going

10:48.560 --> 10:55.360
to explore is, can we have an API that is working with both decoded and coded, so

10:55.360 --> 11:03.720
the content that is still a row MP3 data, a row H264 data, can we create abstract chunk

11:03.720 --> 11:09.640
of content that can be concatenated and sliced?

11:09.640 --> 11:14.840
And we want to encapsulate the system where, from a user perspective, I'm going to say,

11:14.840 --> 11:20.000
oh, I have a chunk of content, concatenate those two or slice this, and I don't want to really

11:20.000 --> 11:24.240
know what goes on under the hood.

11:24.240 --> 11:27.840
Yeah, so let's go.

11:27.840 --> 11:32.440
So the first thing we need is content elements, so what are our content elements?

11:32.440 --> 11:41.240
So, and as something that I want to take a minute to think about, because the audio content

11:41.240 --> 11:47.040
is vastly different from video content, and you don't manipulate them the same way.

11:47.040 --> 11:52.800
Typically, if you think about decoded data, an audio content would be a PCM array, which

11:52.800 --> 12:00.960
is here five samples of audio floats that represents digitized point on a curve that's

12:00.960 --> 12:03.360
like a sign or frequency based.

12:03.360 --> 12:08.080
But for video, you have just a single video image is your base content, an Azure is vastly

12:08.080 --> 12:09.080
different.

12:09.080 --> 12:13.720
In video, you can do nearest image resampling, most of the time it's going to work because

12:13.720 --> 12:16.400
of written-in persistence.

12:16.400 --> 12:21.040
With the audio, no, like you can't just manipulate elementary samples.

12:21.040 --> 12:25.560
They just need to come together with the significant enough amount of data.

12:25.560 --> 12:30.760
Welcome back to that, because the same problem arises in anchored format.

12:30.760 --> 12:34.880
So let's go into it and say, what is a packet?

12:34.880 --> 12:40.960
So I'm going to use the terminology of FFM peg here, which says the frame usually talks

12:40.960 --> 12:42.960
about decoded content.

12:42.960 --> 12:45.720
Packet is anchoded content.

12:45.720 --> 12:51.320
Packet is anchoded content, meaning it is going to be a natural boundaries for it.

12:51.320 --> 12:59.200
So for instance, if you have MP3 data, there's a natural notion of MP3 frames.

12:59.200 --> 13:04.080
It's in the spec of MP3 data, and so that's like a fixed amount, I think, 10, 44 samples

13:04.080 --> 13:05.080
of audio.

13:05.080 --> 13:08.840
That just gives you a chunk of the anchoded data.

13:08.880 --> 13:14.080
For all you have a notion of packet, that's the packetization in the mixing system.

13:14.080 --> 13:20.840
For video, most videos format will say, one chunk is a frame, which is an image, but they

13:20.840 --> 13:22.080
have different nature.

13:22.080 --> 13:24.120
If you're from it, now with it, you have a knife frame.

13:24.120 --> 13:28.760
We can be decoded separately of a p-frame, we need to have a knife frame to be decoded.

13:28.760 --> 13:33.080
But what's FFM peg does here is that it abstracts away, what is the packet and it just

13:33.080 --> 13:34.080
tells you?

13:34.080 --> 13:35.080
Here's a packet.

13:35.080 --> 13:39.200
That's just a opaque and ambreakable amount of data.

13:39.200 --> 13:42.960
It holds a little piece of your anchoded data, but you can't really look into it.

13:42.960 --> 13:46.880
You can't really break it up into a smaller chunk.

13:46.880 --> 13:51.560
So the second thing we need is to create a chunk.

13:51.560 --> 13:57.720
This is our timeline, and here at this time in point is two ways of describing it.

13:57.720 --> 14:02.040
If you would decoded content, you could add a p-ts, that's the time at which you need

14:02.040 --> 14:05.560
to be presented to the people watching the video.

14:05.560 --> 14:09.480
If you're an anchored world, it's the decoded timestamp, that's the time it needs to be

14:09.480 --> 14:12.120
given to the decoder, and we'll come back to that.

14:12.120 --> 14:16.800
You have a duration for your chunk of data, and that gives you right there your elementary

14:16.800 --> 14:18.760
first chunk of data.

14:18.760 --> 14:20.520
That's the most simple element.

14:20.520 --> 14:22.760
You have a frame, it has a sort of duration.

14:22.760 --> 14:23.760
Boom.

14:23.760 --> 14:25.800
You have a little tiny bit of content.

14:25.800 --> 14:32.000
You have a packet, it has like an MP3 frame, it has 10-44 samples of audio, 44,

14:32.040 --> 14:34.720
k, rate, boom, that's a little bit of a amount of data.

14:34.720 --> 14:38.240
You have one image, it has a little duration.

14:38.240 --> 14:40.520
Now you want to compose your stuff.

14:40.520 --> 14:44.680
You have two chunks here, and because you know, remember that we have duration for all these

14:44.680 --> 14:49.400
things, so you imagine that those are no longer samples, right there like frame, packets, little

14:49.400 --> 14:50.400
elementary.

14:50.400 --> 14:56.320
We have a duration for it, and we have a last pts or dts, a last timestamp for that.

14:56.320 --> 15:01.680
So what we can do is look at the other chunk here, that's the first timestamp, and so

15:01.680 --> 15:06.080
we can basically say the last timestamp is going to be the first timestamp of the first

15:06.080 --> 15:08.800
chunk, the first content element in this chunk.

15:08.800 --> 15:13.360
And then one by one we can adjust based on the duration, which would be the timestamp

15:13.360 --> 15:18.320
of number two, number three, and now it's going to flow, right, because the first timestamp

15:18.320 --> 15:23.720
of the first content element here can be placed right here, and we have a concatenation,

15:23.720 --> 15:29.400
that's simple, and this will work at the mixing level.

15:29.400 --> 15:35.600
Now the second problem we have is about slicing content.

15:35.600 --> 15:39.680
Remember I was saying, we want to slice this, we want to be able to divide this content

15:39.680 --> 15:41.600
into a smaller chunk.

15:41.600 --> 15:48.000
So the way we're going to do it is we're going to take, again, a chunk of content, five packets,

15:48.000 --> 15:52.680
five frames, and we're going to say, well, here's an abstract offset, and here's an abstract

15:52.680 --> 15:56.400
duration, and we're not going to do anything.

15:56.400 --> 16:02.720
We're just going to say our content is three elements, the row content, an array, or

16:02.720 --> 16:08.120
list of the data that's inside, and just an abstract offset and duration.

16:08.120 --> 16:14.520
And then if we have another one that's like here, another content, two offset and duration,

16:14.520 --> 16:21.600
we can say now, I would just want to say they are placed next to each other as a composition.

16:21.600 --> 16:27.160
And what we're going to do is once we send that to, once we compose our final content

16:27.160 --> 16:32.160
chunk, we're going to extract the content elements that are relevant within the boundaries

16:32.160 --> 16:33.920
of our offset and duration.

16:33.920 --> 16:37.280
So when I want to create the final larger chunk, I'm going to take everything that's in

16:37.280 --> 16:46.720
the yellow and orange area, adjust the timestamps as menu content.

16:46.720 --> 16:49.560
That is on the surface, what we've done.

16:49.560 --> 16:58.320
We've implemented a notion of content in Nuke, so that extracts chunks of data from the decoders

16:58.320 --> 17:02.240
that are packets from a fifth-impag or frame from a fifth-impag.

17:02.240 --> 17:04.240
They are placed into arrays.

17:04.240 --> 17:08.360
When we want to do a smaller one, we just adjust the offset and duration.

17:08.360 --> 17:16.120
If I want to concatenate, we just list them together, and eventually the encoder, the output

17:16.120 --> 17:21.400
is in charge of recomposing that content description to generate the data that's going

17:21.400 --> 17:24.480
to flow out.

17:24.480 --> 17:32.200
It works, but it doesn't work in all situations, because, as I said, the beauty of an engineer,

17:32.200 --> 17:39.360
good engineering work is about having that exposed to the user, but internally being like,

17:39.360 --> 17:42.760
it's a bit more complex, I'm just not showing you that.

17:43.080 --> 17:47.440
If you think about it, that's already what F-impag does, because F-impag gives you an API that

17:47.440 --> 17:54.240
says, get a stream frame, returns a packet, you don't know what the packet is.

17:54.240 --> 17:58.320
Decode packet gets a frame, you don't know what the frame is, send a frame to a Mixer,

17:58.320 --> 18:00.520
you don't know what the Mixer does.

18:00.520 --> 18:04.920
That's what we're trying to do here, is the same IDs that under the hood, F-impag knows

18:04.920 --> 18:12.720
that if your Mixer is an MP4 container or an MPXTS, then they're going to do different things,

18:12.800 --> 18:19.040
but you don't need to know, and so what I've presented here is really nice, it's elegant,

18:19.040 --> 18:23.520
it works, but the other things that I'm hoping we can solve that the user doesn't need to know

18:23.520 --> 18:30.120
that are pretty complex, so this is what we're going to talk about now, is the hard bits.

18:30.120 --> 18:36.360
First of all, let's talk about what we just saw, which is slicing off a smaller chunk

18:36.360 --> 18:41.280
of content based on a larger one.

18:41.280 --> 18:46.960
It's not always possible, it's not always possible because you have limitations, chunk

18:46.960 --> 18:53.360
have a minimal size, so if you take one sample of PCM audio, its duration is one over the

18:53.360 --> 19:00.240
sample rate, and you can't really slice it to less than that, so here typically if my

19:00.240 --> 19:06.240
chunk was PCM audio, I would be incorrect to assume that, because I can't really take enough

19:06.320 --> 19:12.480
set that in between samples, it doesn't make sense, my digitization level is, my granularity

19:12.480 --> 19:20.720
is at minimum one over a sample rate, you have the same problem with video frame, a single

19:20.720 --> 19:26.400
image is one over 25, one over 30, whatever is your frame rate, you can't really slice it less than

19:26.400 --> 19:32.560
that, and then for I could it packets, it's even more complex than that, because we don't

19:32.640 --> 19:36.320
really know what's inside, it just tells you that, so packet, you can't really slice it.

19:39.360 --> 19:43.280
So here's one thing we've done here, that I'm going to show some of the ideas we have,

19:43.920 --> 19:52.560
I'm going to expose other problems, and then we can talk about it later, so one of the solutions

19:52.560 --> 19:58.160
we found for the PCM audio, if you think about it in your system, you're going to have audio

19:58.240 --> 20:04.800
and video that flows inside internally, and every computer system is, is internally,

20:06.080 --> 20:12.160
digitized in a specific sample rate, there's no continuous, so your application is, it's on

20:12.160 --> 20:19.600
master rate that is then converted into audio rate, and most of the time the audio rate is much higher

20:19.600 --> 20:26.800
than the video rate, it's like 44 points, you know, it's 44K sample per second, 25 frames per second,

20:26.880 --> 20:32.400
so what you can say is just set your applications main rate to be the same as the audio rate,

20:32.400 --> 20:39.120
internally you flow everything in 48K, your application handles everything at 48K, and that means that

20:39.120 --> 20:46.160
your offset is always going to be on the sample boundary, you can't really mess it up, that's a solution,

20:48.000 --> 20:53.600
in videos you can do the nearest image resampling, as I said, original persistence works,

20:53.680 --> 20:58.480
and for anchored packet, it might work for some format, you might be able to do something,

20:58.480 --> 21:05.680
some other you can, and so typically for image, I say that an anchored frame can be a single image,

21:05.680 --> 21:12.240
so if it's a single image, you can always change its location in the timeline, you change its timestamps,

21:12.960 --> 21:17.040
written in persistent, make it possible to put it at any of set you want, so that's going to work,

21:18.000 --> 21:23.520
the other thing you have is that if you think about the fact that that's an example of content

21:23.520 --> 21:28.560
that is not at the exact boundaries, then you have you're going to have a little bit of delta in

21:28.560 --> 21:36.320
each case, but what you could do when you do your APIs is compose your content and keep a delta

21:36.320 --> 21:44.160
that's within acceptable boundaries for instance, granting that the average delta in your, you know,

21:44.160 --> 21:49.040
little boundaries that we're not exactly where you want to slice, it's still on the average zero,

21:49.040 --> 21:55.920
or within one of the accepted delta between audio and video, so those are different standards

21:57.200 --> 22:01.280
depending on Europe and the US, but how much the audio can be laid compared to the video.

22:01.280 --> 22:05.280
So if you keep that within the boundaries, you can still have a very high level API

22:05.280 --> 22:09.840
that allows for content composition and hides a little bit of this complexity that you can,

22:10.160 --> 22:15.200
you cannot actually slice properly, but overall it's going to function as if you had done it.

22:16.480 --> 22:24.160
So the last bit that I want to show before we finish is, of course, I've saved the most important

22:24.160 --> 22:31.040
complexity for the end. This is a timeline of video frame, but in anchored video, usually you have

22:31.120 --> 22:39.200
a knife frame, which is a frame that can be decoded separately. You take this frame, you decoded,

22:39.200 --> 22:44.560
you get an image back, and then usually you have frames after that that needs the first eye frame

22:44.560 --> 22:51.680
to be decoded. And you might as well have frame before that, who also need to be reference,

22:51.680 --> 22:56.960
referencing the eye frame after them to be decoded. So that's the timeline in presentation time

22:56.960 --> 23:04.080
stamp, that's what you want to see in the final video. But that means that the timeline in the

23:04.080 --> 23:10.960
decoder way is going to be way different, and it's going to be completely jumbled, because in

23:10.960 --> 23:17.040
terms of the decoding, you need to give the eye frame first to the decoder, so that he has it,

23:17.040 --> 23:23.120
and then you give the first blue frame that I'll be for, so that they can be decoded first,

23:23.200 --> 23:30.160
then you give the frame after, so that they can be decoded after. And so that means that your

23:30.160 --> 23:36.560
timeline in terms of anchored content is going to look very different than your timeline in terms

23:36.560 --> 23:41.760
of decoding content. And so if you don't pay attention to that detail, and you say, I'm going to take

23:41.760 --> 23:46.560
a chunk of content that's right here, and just ignore the eye frame, and take that's my chunk of

23:46.640 --> 23:52.400
content, it has the two blue frame to green frame, well, in and off itself is not going to be

23:52.400 --> 23:57.360
decodable, because you don't have the eye frame that refers to it. And so one of the problems we

23:57.360 --> 24:02.720
have to solve is that if you're handling dynamic content, and you're composing from two different

24:02.720 --> 24:08.240
stream, you have to ask yourself, like, is this stream the same as the previous one? Then in this

24:08.240 --> 24:13.440
case, I can probably concatenate whatever I want, because I'm assuming that whatever I frame was

24:13.520 --> 24:18.880
required to decode that chunk of content was already given to the decoder in the previous one.

24:18.880 --> 24:25.760
But if you're switching streams from a different source of anchored data, and you want to

24:25.760 --> 24:30.560
compose the content, and you're going to have to have a limitation that says, I want the first

24:30.560 --> 24:35.840
chunk that comes out of the new stream to start with a knife frame, so that we don't have a glitch

24:35.840 --> 24:40.800
to the user. Or you can want a glitch, but that's a presentation for later. So that's another

24:40.800 --> 24:44.960
problem that you have to solve when you want to create a nice composition algebra.

24:46.160 --> 24:52.320
And there are more questions that we haven't solved. Can you compose, so again, those goes to

24:53.120 --> 24:57.760
problems that are very low level and dependent on your mixer and could it format.

24:59.360 --> 25:07.120
For instance, if you look at H264 and MP4, or MPXTS, can you compose, can you compose two

25:07.120 --> 25:12.000
stream that have different frame size? Can I change the frame size dynamically? Can you change the

25:12.000 --> 25:17.120
pixel format? You've for 20 years, you've for 22, it's part of the standard, it's not part of the

25:17.120 --> 25:23.600
standard. Can you change the audio sample format? Another problem that you have is what in FFM

25:23.600 --> 25:29.120
ping is called extra data. So basically, when you have anchored content, you need certain

25:29.120 --> 25:34.960
data to be able to decode like estimate tables for compression, and depending on of your container

25:34.960 --> 25:39.680
in MP4, this is going to be in a global header at the beginning of the file. In MPXTS,

25:39.680 --> 25:45.520
it's going to be in every frame. So if I'm composing content dynamically, I need to be able to

25:45.520 --> 25:52.320
detect that and say, no, no, no, no, no, you need to convert your H264 from global to local data,

25:52.320 --> 25:57.440
so that you can compose dynamically and be able to decode. So typically, I have FM ping as what they call

25:57.440 --> 26:03.920
bit stream filters that can be used to manipulate that extra data and go from a global header format

26:03.920 --> 26:10.400
to a local frame format. Again, I'm hoping that if you write an extra API that does these two

26:10.400 --> 26:15.920
things, co-contination and slicing, it can also detect those things and do it for you. The same

26:15.920 --> 26:21.120
way that when you compose content with the FFM ping API, it's able to do it on a file like if you

26:21.120 --> 26:25.600
write an MP4 file, it's going to do a global header, if you write MPXTS, it's going to do it for you.

26:27.360 --> 26:31.280
Yeah, and you know, the other thing is that it's not always going to work, so what are the

26:31.280 --> 26:36.080
optimal parameters? How many I frame do you need? Those are like the hard questions.

26:36.960 --> 26:43.600
So yes, I'm finished now, and I think we're going to forego the demo, I wish I could have done it,

26:43.600 --> 26:52.400
but to recap what I was trying to present is if we want to manipulate content, and especially

26:52.400 --> 26:59.920
manipulate, well, I encoded content, binary content, can we create an API that allows

27:00.640 --> 27:08.080
to consider your content a little bit of pieces that you can slice up, patch together, send to an

27:08.080 --> 27:12.400
output, and create a stream that is a valid stream without having to re-encode everything,

27:12.400 --> 27:17.840
and also abstracting the complexity. In the demo I wanted to show, it was a 4K video,

27:18.800 --> 27:24.720
well, a bunch of 4K files and some audio, and basically I click on something button,

27:24.720 --> 27:29.840
and it switches the file on the main, and the content keeps being decoded without any glitch.

27:30.480 --> 27:34.400
And I think that there's a lot of value to that, I think that a lot of the

27:35.760 --> 27:40.720
computing power that's required to re-encode to make that work in most application could be saved,

27:40.720 --> 27:48.640
if we can work on specific expert knowledge about all these binary formats and export that

27:48.640 --> 27:54.800
in a very neat API that would solve the problem for the people. Thank you. So, yeah, maybe there

27:54.800 --> 27:56.240
are remarks or questions.

28:05.120 --> 28:10.160
Yes. In your experience, have downstream receivers with going to accept the level of

28:10.160 --> 28:17.360
AV synchronization that your innovation of the solution? In our current solution,

28:19.120 --> 28:25.840
so what we're solving at the moment is a very simple subset, meaning that it works if your system

28:25.840 --> 28:33.040
has been preemptively designed to work properly. So, I haven't done a lot of testing in

28:33.040 --> 28:40.880
specific situations where the AV gets, like the idea of maintaining an average AV disynchronization,

28:40.880 --> 28:44.400
we don't have it yet. It's ideas for future development.

28:51.920 --> 28:58.640
Well, if I have a couple of minutes, I can try that demo. The problem I was having is that the screen

28:58.640 --> 29:10.800
is not screen screen screen screen, or is it? You know, display display. There you go.

29:12.800 --> 29:21.920
All right, arranged? No. Now, the mirror, mirror. Is that, can you see? What I'm seeing? Yes.

29:21.920 --> 29:27.120
All right, so let's try the demo here. This is my script in your quitsup. I'm going to start it.

29:27.840 --> 29:34.720
It's going to create a HLS stream. Yep, the files are studied. So, the music is mine,

29:34.720 --> 29:41.360
and the video is free. So, we're going to look at the term it interface, and we have two

29:41.360 --> 29:48.400
common that says audio next. What's the next audio file and audio skip? So, I'm going to skip audio

29:48.400 --> 29:56.640
two times. All right, and now I'm going to do the same with video. So, that's the next video is

29:56.720 --> 30:08.000
going to be number one. Then number four. All right, and now let's listen to the result.

30:08.800 --> 30:26.880
That's the video we switch. Two times, and now let's do the audio, and now let's do the video.

30:26.880 --> 30:42.400
Ah, wait, it's a 22nd loop. There you go. So, yes, what we're seeing is that we're switching

30:42.400 --> 30:47.600
seamlessly from the different files and they're all 4k. I'm not re-encoding them, but the content is

30:47.600 --> 30:53.040
about is we're able to do that streaming loop where we do 0.2 seconds of it. We slice them up,

30:53.040 --> 30:57.920
we compose them, and there's no glitch, because the iPhone is properly placed first. All right,

30:57.920 --> 31:02.160
I hope that makes more sense now, and I want to thank you, everyone, for being here today.

31:11.200 --> 31:14.240
Oh, sorry. What do you mean exactly?

31:24.000 --> 31:33.440
No, again, and I am not doing a lot on the video side yet. It's other side. I should have

31:33.440 --> 31:38.080
made more clear at the beginning. Liquid slope is more high level. So, I'm not myself an expert

31:38.080 --> 31:44.320
in all the different binary content. I have been working straight up with the, what a fifth

31:44.320 --> 31:49.600
MPG, it's me as an API. And part of the reason I'm giving that talk to be honest with you is

31:49.760 --> 31:54.800
because I'm hoping that more expert knowledge in the binary formats, in the different encoders,

31:54.800 --> 31:59.680
in the different mixer, could help inform that. The set of questions I had at the end about

32:00.960 --> 32:07.520
about things like, you know, when I was saying this, I don't have the answers to that.

32:07.520 --> 32:11.680
I'm actually seeking them myself, like, can you change the frame size, can you shade a pixel

32:11.680 --> 32:17.920
for my, you know, what are the limitations? We have solved the problem in a very specific case,

32:18.080 --> 32:22.640
we're very strict. When we compose, we want the XI same codec, the exact same frame size,

32:22.640 --> 32:30.240
the exact same pixel. And it works if you tested it basically. But thank you for your question.

32:30.960 --> 32:34.960
And then we, yes, absolutely. Yeah, I would love to learn more, absolutely.

32:36.400 --> 32:41.440
All right, I think that I'm done. All right, yep, times up. Thanks everyone. You're all have a good day.

