[Libva] How to detect the type of memory returned...

Tue Jun 17 04:03:05 PDT 2014

hi

On 17 June 2014 18:04, Peter Frühberger <peter.fruehberger at gmail.com> wrote:

> We don't support broken wrappers, that are not maintained since
> several years. We support vdpau for amd oss and nvidia and use vaapi
> for intel. We had implemented XVBA for AMD a while back, but that code
> died of constant no support.

We (mythtv) haven't implemented XVBA, only VAAPI and VDPAU.
AMD OSS' vdpau is actually pretty good now, almost as good as nvidia's.

With AMD's closed-source drivers, VAAPI is as good as it gets. On my
AMD 6970 however, all you get is VC1 and H264 decoding.

>> We always get back to the problem I mentioned in my first email.
>> Unfortunately, there's not a generic solution that can be adapted.
>> If memory used is USWC, you must use SSE4, if not, you certainly don't
>> want to use SSE4 and a buffer
>
> Yes, I see that problem and I find all methods that we currently have
> quite suboptimal. If you see how for example nvidia does it with their
> glinterop, that even mesa implements. I think the proposed API changes
> here go into a similar direction. I hope that the lot of "sync",
> "locks" and so in there that I see in the patches won't make things
> too slow or even slow down multithreaded approaches (decoder + vpp +
> output in different threads), but we will see.

Here are my attempts and results so far:
https://github.com/MythTV/mythtv/blob/master/mythtv/libs/libmythtv/mythframe.cpp#L586

There are 4 primary routines implemented:
For plane YV12 frame copy
SSE_copyplane (this is, very similar to Intel's whitepaper, but
various optimisation added, it's a tad faster than their example, and
obviously XBMC's seeing it's the same)
Make use of a 64 bytes aligned, 4kB buffer.

For deinterleaving the U/V channels in a NV12->YV12:
SSE_splitplanes (with buffer)
As above, make use of the buffer

Those two routines have copy functions making use of movntdqa, and
works extremely well with USWC based memory.

SSE_splitplanes (without buffer)
this one is a SSE3 optimised routine, that deinterleaved the UV
channels, and that works directly between source and destination
frames, regardless of their memory alignment (16 bytes aligned or not)

copyplane: which is a plain C implementation, using memcpy.

My findings are as follow (i7-4650U with HD5000). Convert 2000 h264
frames, extract image with either vaDeriveImage or vaGetImage, and
measure the conversion from either N12->YV12 or plain YV12->YV12
(within VLC playback)

if memory is USWC:
NV12->YV12:
1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer):
2.07ms per 1080 frame

2-One call to C copyplane + SSE_splitplanes (without buffer):
10.96ms per 1080 frame

If memory isn't USWC:
1-One call to SSE_copyplane + one to SSE_splitplanes (with buffer):
1.05ms per per 1080 frame

2-One call to SSE_copyplane + one to SSE_splitplanes (without buffer):
0.97ms per 1080 frame

3-One call to C copyplane + one to SSE_splitplanes (without buffer):
0.96ms per 1080 frame

I can't give a comparison with a simple YV12->YV12 frame copy, seeing
as I can't get a USWC mapped memory.

YV12->YV12
If memory isn't USWC:
1-three calls to SSE_copyplane:
0.94ms per 1080 frame

2-three calls to C copyplane:
0.94ms per 1080 frame

Running those tests made me realise I could gain some speed with a
SSE_copyplane, one that doesn't use any buffers but use SSE4. I had
written the routine before, but discarded it after comparing the
original SSE_copyplane with the C version, didn't think of comparing C
and that routine...

In which case, with new SSE copy routine I get:
Non-USWC memory:
NV12->YV12
4-One call to SSE_copyplane (without buffer) + one to SSE_splitplanes
(without buffer):
0.80ms

YV12->YV12
3-Three calls to SSE_copyplane (without buffer)
0.68ms

Conclusion, if speed is the main concern:
Use YV12 whenever possible with vaGetImage.
If memory is USWC, use SS4 code, via a 4kB buffer
If memory isn't USWC use SSE4/movntdqa (if line is aligned) or
SSE2/movdqu if non-aligned, don't bother with a buffer.

So still keen in getting a reliable way of knowing which type of
memory we're using... though my method of simply checking the running
speed first may probably be the easiest approach