WEBVTT

00:00.000 --> 00:08.160
Thanks everyone for coming this event.

00:08.160 --> 00:15.640
So today we are going to talk about VTP and what is the TLS plugin inside VTP and how we

00:15.640 --> 00:19.440
have optimized it for better performance.

00:19.440 --> 00:28.240
So today me and Barunar here who will go through the presentation.

00:28.240 --> 00:34.240
So my friends from VTP community has already given you introduction on VTP, I will not

00:34.240 --> 00:40.080
go deep into the VTP internals, but just a question to everyone.

00:40.080 --> 00:48.280
So how many of you have had CPU architecture in your university class or somewhere, right?

00:48.280 --> 00:49.280
Yeah.

00:49.280 --> 00:55.240
So I see a lot of people, so one just statement I want to mention is that VTP is one of

00:55.240 --> 01:01.720
the software which leverages very basically concepts of CPU architecture, like everybody

01:01.720 --> 01:07.600
says that okay, you need to have data in L1 cache or instructions in L1 cache, right?

01:07.600 --> 01:13.360
How many software make sure that you always have your instructions or data in the L1 cache

01:13.360 --> 01:16.520
before you actually execute the instructions, right?

01:16.520 --> 01:21.000
So this is from the program point of view, I feel that VTP is one of the very few software

01:21.040 --> 01:27.320
which leverages this and gets the best performance out of your software okay.

01:27.320 --> 01:29.120
So yeah we will move on.

01:29.120 --> 01:34.360
So the agenda is that we will go over what TLS plugin is and what are the contributions

01:34.360 --> 01:40.960
we have done to TLS plugin which enhances its performance and we will also briefly mention

01:40.960 --> 01:48.440
about DPDK user space crypto driver because we internally leverage DPDK for enhanced

01:48.520 --> 01:56.960
crypto performance and we will provide what is the actual functionality we have implemented

01:56.960 --> 02:02.880
to increase the performance and we have some performance numbers comparison and what

02:02.880 --> 02:07.920
is the future scope of the work we are going to do okay.

02:07.920 --> 02:10.560
So what is VTPTLS plugin okay?

02:10.640 --> 02:18.640
So as earlier sessions has already mentioned that everything in DPDK is a loadable plugin

02:18.640 --> 02:19.640
okay.

02:19.640 --> 02:25.160
So even the TLS functionality in VTP is a loadable plugin okay.

02:25.160 --> 02:32.480
So TLS is a transport type which is exposed by VTP framework just like you want to say

02:32.480 --> 02:39.200
that I want to do a transport socket of TCP type or UDP type right?

02:39.280 --> 02:45.400
Similarly in VTP you can just say that I want to do a data transfer with the transport

02:45.400 --> 02:46.400
of TLS.

02:46.400 --> 02:51.280
So you just go and write to this socket and the underlying TLS implementation will encrypt

02:51.280 --> 02:54.080
the data and then send it out right.

02:54.080 --> 02:59.520
So it also provides integration with multiple TLS libraries.

02:59.520 --> 03:05.160
Some of the examples are open as a cell and P-COTLS and M-M-M-M-TLS.

03:05.160 --> 03:08.440
For the scope of this talk we will concentrate on the open as a cell because that is where

03:08.520 --> 03:12.360
we had done some tests and performance comparison.

03:12.360 --> 03:18.480
And in VTPTLS plugin we have the session management which is done within the TLS plugin

03:18.480 --> 03:25.400
so that any application data which is sent to the TLS layer will know what kind of TLS

03:25.400 --> 03:31.440
case to be used and case and then what is the underlying TCP connection on which you

03:31.440 --> 03:33.640
have to send the package.

03:33.640 --> 03:37.560
So it also provides some quantitativeability on what are the keys and certificates you have

03:37.560 --> 03:42.720
to use for the data application data.

03:42.720 --> 03:51.640
So going over the contribution which we are done to TLS plugin so for those who are aware

03:51.640 --> 03:59.360
of open as cell software there is a it is not very recent but in earlier open as cell

03:59.360 --> 04:03.120
they have introduced a concept of asynchronous crypto jobs.

04:03.120 --> 04:08.160
So we have leverage that and then adapt to the same thing into the TLS plugin where you

04:08.160 --> 04:12.440
can have a crypto job let us say you want to do a RSS sign operation.

04:12.440 --> 04:16.760
So you can go and submit the instruction as a synchronous job you do not if there is an

04:16.760 --> 04:21.080
accelerator which can work on the other side you do not need to wait for the operation to

04:21.080 --> 04:25.880
complete the core comes back and it can do the application work and then by after some

04:25.880 --> 04:31.680
time if you go and check like the crypto operation is done you can go receive the response

04:31.680 --> 04:35.240
and then complete the overall request.

04:35.240 --> 04:41.920
So that way we can save the time in which the RSS operation is getting done or any

04:41.920 --> 04:46.240
crypto operation is getting done.

04:46.240 --> 04:52.560
So we have also after adding this right we have verified this functionality with the

04:52.560 --> 04:57.760
marble open as cell engine which we have written so that we can leverage the hardware accelerator

04:57.760 --> 05:02.400
capabilities of marvell's DPU.

05:02.400 --> 05:09.040
So these are the patches we have pushed to add the performance enhancements okay we can

05:09.040 --> 05:11.640
go over it later.

05:11.640 --> 05:17.240
So the marble open as cell which I have mentioned in the earlier slide it internally

05:17.240 --> 05:24.120
uses DPDK crypto PMD so basically what it does is it goes and submit instructions using

05:24.120 --> 05:30.320
the DPDK RTE APIs and then the underlying PMD will take care of the submission to the

05:30.320 --> 05:32.720
actual hardware okay.

05:32.720 --> 05:41.200
So just brief introduction on DPDK user space crypto as why it is faster it works on the PMD

05:41.200 --> 05:42.200
basis.

05:42.200 --> 05:47.520
So there is no interrupt handling or interrupt contact signatures so the performance will

05:47.520 --> 05:54.040
be higher because of that and as synchronous passing so it allows you to submit the instructions

05:54.280 --> 06:00.040
as synchronously and the two you can do it in a burst basis so that if you want to do 10

06:00.040 --> 06:06.040
requests at a time you can go and submit 10 request and then go back do your work do your

06:06.040 --> 06:11.000
application was come back and see how many are done once what like if X number of requests

06:11.000 --> 06:16.480
are done you go and resume those operations and it also provides flexibility because there

06:16.480 --> 06:24.000
is a software open as cell based crypto implementation in DPDK so you can have a comparison

06:24.000 --> 06:30.400
against your hardware accelerated PMD versus software open as cell crypto implementation.

06:30.400 --> 06:37.440
So that way you can say my implementation is like X percentage better than the software

06:37.440 --> 06:44.600
based implementation and also we maintain session within the engine so that we for a certain

06:44.600 --> 06:51.000
session we do not need to set it up multiple times basically if you have hundreds of

06:51.000 --> 06:55.360
packets flowing to the same session the same session can be used once it is set up first

06:55.360 --> 07:02.880
time and it is a standard API so instead of saying that if you are moving from one implementation

07:02.880 --> 07:08.880
to another implementation underneath in the DPDK you do not need to change your code

07:08.880 --> 07:12.280
match.

07:12.280 --> 07:16.880
So Varun will take over here and then you will explain what is the actual work done at

07:17.040 --> 07:18.040
the code level.

07:18.040 --> 07:19.040
Yeah.

07:19.040 --> 07:27.840
So this particular flow is explaining how the TLS plugin handles the TLS processing.

07:27.840 --> 07:36.560
So whenever the TLS plugin which acts as a server receives any TLS traffic will see the

07:36.560 --> 07:43.160
hand request type and it will send it to the open as cell engine using the SSL library

07:43.160 --> 07:44.160
calls.

07:44.240 --> 07:49.400
So the underlying library will check if the asynchronous mode is enabled or not.

07:49.400 --> 07:54.880
If it is enabled then it checks what is the return value of the API.

07:54.880 --> 08:00.720
If it says I want a sync then it passes the request to the next level which is a more

08:00.720 --> 08:06.840
very open as cell engine which internally encodes the actual process to the hardware where

08:06.840 --> 08:12.400
the actual work will be done and once the work is done the hardware accelerator will respond

08:12.480 --> 08:15.480
back to the engine saying that the job is finished.

08:15.480 --> 08:21.880
Where in the TLS plugin plugin will continuously pull the engine for the asynchronous

08:21.880 --> 08:27.720
request completion whenever a request is completed we will get a call back and in the call

08:27.720 --> 08:31.680
back we will see what kind of event that was completed.

08:31.680 --> 08:36.520
So we are maintaining two different cues one for the handshake messages and the other

08:36.520 --> 08:38.960
one for read and write events.

08:38.960 --> 08:43.880
So based on the event type the particular event will be anchored into the respective

08:43.880 --> 08:48.640
cue and it goes for the next level processing in the plugin.

08:48.640 --> 08:53.760
So in this way the plugin will keep pulling the engine and in the next iteration what

08:53.760 --> 08:59.640
it does is it decodes the cues and it processes the event based on the handler which

08:59.640 --> 09:03.040
was written for the particular event type.

09:03.040 --> 09:08.480
So there the actual work will be done and it completes the request and it will send the

09:08.520 --> 09:10.240
response back to the application.

09:10.240 --> 09:15.480
So here the TLS plugin acts as a server and it will respond back to the client.

09:15.480 --> 09:20.160
So this particular is flow is showing the same thing.

09:20.160 --> 09:29.160
Yeah these are the performance comparison numbers which we have captured with the software

09:29.160 --> 09:31.920
and with the hardware acceleration.

09:31.920 --> 09:37.720
On the left hand side we can see we have tested with one and two cores with different

09:37.720 --> 09:43.160
file sizes and the right hand side we can see the hardware accelerator performance.

09:43.160 --> 09:50.400
And the set set up is a simple DUT which is connected back to traffic generator where

09:50.400 --> 09:55.600
it sends the packet and it gets the encrypted encrypted packet outside.

09:55.600 --> 10:05.760
So we could see that the software numbers with software we are able to achieve up to 6 or

10:05.760 --> 10:12.560
7Gbps up to different file sizes whereas in hardware accelerator the performance is much

10:12.560 --> 10:19.560
better than the software one with asynchronous mode enabled and the testing conditions are

10:19.640 --> 10:27.520
mentioned in the slide we have used RSA for RSA 4K and that very spare second improvement

10:27.520 --> 10:33.960
that we observed was approximately 20% and when we are submitting this particular asynchronous

10:33.960 --> 10:39.800
process to the hardware accelerator we saw an improvement of up to 50% in the CPU utilization

10:39.800 --> 10:45.320
like where the CPU was free for processing something else in the PPP context.

10:45.320 --> 10:55.800
We have also introduced this pipeline feature in the recent patch so these are the numbers

10:55.800 --> 11:05.000
with different pipes with different record sizes so what pipes means like we can configure

11:05.000 --> 11:12.960
in in number of pipes where a particular record of record will be given to those many pipes

11:13.040 --> 11:20.240
and the hardware will process those many pipes of packets entirely.

11:20.240 --> 11:27.840
So here we can see the performance with different number of pipes configuration.

11:27.840 --> 11:35.680
So what we see here is like we expected much better performance with larger number of pipes but

11:35.680 --> 11:41.680
it was not observed in that way but yeah this is something which we are which we will

11:41.760 --> 11:48.480
be taking as a future scope.

11:48.480 --> 11:54.160
So these are the future scope that we are planning where we are planning to upstream our

11:54.160 --> 12:03.200
open SL engine to the marvel as a lecture to VPP as a separate plugin and also we are planning

12:03.200 --> 12:11.120
to give the provider support in the TLS plugin and optimizing the performance with multiple

12:11.200 --> 12:24.400
workers and multiple pipes. These are the things that we are planning in the future.

12:24.400 --> 12:33.440
So along with TLS we have multiple data plan based accelerated solutions which can work on

12:33.440 --> 12:39.360
marvel GPUs. So we have a open source GitHub repo where people can directly go and look

12:39.360 --> 12:44.960
what are the solutions we are providing on our GPU. So we call it as a marvel doubt that data

12:44.960 --> 12:53.520
acceleration of load GitHub. So there are multiple solutions based on DPDK, VPP, OVS and AMAAML

12:53.520 --> 12:58.480
related solutions. So this is a GitHub repo so if there is anybody who is interested in data

12:58.480 --> 13:03.840
plan networking solutions for it. There are some cool solutions you can try it out.

13:03.920 --> 13:12.400
It's you cannot expect on a small hardware like a DPU like you can achieve like 100 MPPS of

13:12.400 --> 13:19.440
L3 for a application. So you can just go over it. Try it out if you have any issues or any

13:19.440 --> 13:25.520
ideas we are ready to help and work on it. That's all we had. Thank you.

13:25.600 --> 13:33.360
Thank you. Any questions?

13:45.200 --> 13:50.640
Hi, thanks for this work. Just a quick question. Are you relying on the crypto

13:50.640 --> 13:57.760
async framework in VPP or is it your own type of framework on type of engine that's implemented

13:57.760 --> 14:06.160
with marvel? Yes. So this is PLS implementation based on open SSL. Open SSL has its own internal

14:06.160 --> 14:10.960
async crypto framework. Okay, so we create an async framework for open SSL not for

14:10.960 --> 14:24.240
you. Okay, thanks. Okay, thank you very much. Thank you.

