[Libva] How to detect the type of memory returned...

Jean-Yves Avenard jyavenard at gmail.com
Tue Jun 17 04:03:05 PDT 2014


hi

On 17 June 2014 18:04, Peter Frühberger <peter.fruehberger at gmail.com> wrote:

> We don't support broken wrappers, that are not maintained since
> several years. We support vdpau for amd oss and nvidia and use vaapi
> for intel. We had implemented XVBA for AMD a while back, but that code
> died of constant no support.

We (mythtv) haven't implemented XVBA, only VAAPI and VDPAU.
AMD OSS' vdpau is actually pretty good now, almost as good as nvidia's.

With AMD's closed-source drivers, VAAPI is as good as it gets. On my
AMD 6970 however, all you get is VC1 and H264 decoding.


>> We always get back to the problem I mentioned in my first email.
>> Unfortunately, there's not a generic solution that can be adapted.
>> If memory used is USWC, you must use SSE4, if not, you certainly don't
>> want to use SSE4 and a buffer
>
> Yes, I see that problem and I find all methods that we currently have
> quite suboptimal. If you see how for example nvidia does it with their
> glinterop, that even mesa implements. I think the proposed API changes
> here go into a similar direction. I hope that the lot of "sync",
> "locks" and so in there that I see in the patches won't make things
> too slow or even slow down multithreaded approaches (decoder + vpp +
> output in different threads), but we will see.

Here are my attempts and results so far:
https://github.com/MythTV/mythtv/blob/master/mythtv/libs/libmythtv/mythframe.cpp#L586

There are 4 primary routines implemented:
For plane YV12 frame copy
SSE_copyplane (this is, very similar to Intel's whitepaper, but
various optimisation added, it's a tad faster than their example, and
obviously XBMC's seeing it's the same)
Make use of a 64 bytes aligned, 4kB buffer.

For deinterleaving the U/V channels in a NV12->YV12:
SSE_splitplanes (with buffer)
As above, make use of the buffer

Those two routines have copy functions making use of movntdqa, and
works extremely well with USWC based memory.

SSE_splitplanes (without buffer)
this one is a SSE3 optimised routine, that deinterleaved the UV
channels, and that works directly between source and destination
frames, regardless of their memory alignment (16 bytes aligned or not)

copyplane: which is a plain C implementation, using memcpy.

My findings are as follow (i7-4650U with HD5000). Convert 2000 h264
frames, extract image with either vaDeriveImage or vaGetImage, and
measure the conversion from either N12->YV12 or plain YV12->YV12
(within VLC playback)

if memory is USWC:
NV12->YV12:
1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer):
2.07ms per 1080 frame

2-One call to C copyplane + SSE_splitplanes (without buffer):
10.96ms per 1080 frame

If memory isn't USWC:
1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer):
1.05ms per per 1080 frame

2-One call to SSE_copyplane + one to SSE_splitplanes (without buffer):
0.97ms per 1080 frame

3-One call to C copyplane + one to SSE_splitplanes (without buffer):
0.96ms per 1080 frame

I can't give a comparison with a simple YV12->YV12 frame copy, seeing
as I can't get a USWC mapped memory.

YV12->YV12
If memory isn't USWC:
1-three calls to SSE_copyplane:
0.94ms per 1080 frame

2-three calls to C copyplane:
0.94ms per 1080 frame


Running those tests made me realise I could gain some speed with a
SSE_copyplane, one that doesn't use any buffers but use SSE4. I had
written the routine before, but discarded it after comparing the
original SSE_copyplane with the C version, didn't think of comparing C
and that routine...

In which case, with new SSE copy routine I get:
Non-USWC memory:
NV12->YV12
4-One call to SSE_copyplane (without buffer) + one to SSE_splitplanes
(without buffer):
0.80ms

YV12->YV12
3-Three calls to SSE_copyplane (without buffer)
0.68ms

Conclusion, if speed is the main concern:
Use YV12 whenever possible with vaGetImage.
If memory is USWC, use SS4 code, via a 4kB buffer
If memory isn't USWC use SSE4/movntdqa (if line is aligned) or
SSE2/movdqu if non-aligned, don't bother with a buffer.

So still keen in getting a reliable way of knowing which type of
memory we're using... though my method of simply checking the running
speed first may probably be the easiest approach


More information about the Libva mailing list