vaapidecode GPU to CPU download performance

Dan Williams dwilliams at cernium.com
Mon Dec 1 13:45:27 PST 2014


On 11/26/2014 11:20 PM, Arun Raghavan wrote:
> On 25 November 2014 at 18:05, Gwenole Beauchesne <gb.devel at gmail.com> wrote:
>> Hi Dan,
>>
>> 2014-11-24 22:06 GMT+01:00 Dan Williams <dwilliams at cernium.com>:
>>> Hi all,
>>>
>>> I want to use gstreamer and VAAPI to do accelerated H.264 decoding of
>>> 1920x1080 video frames into memory where I can do subsequent analysis
>>> with the CPU. I am able to use the luma plane of the resulting I420
>>> format buffer for what I need to do.
>>>
>>> I am quite happy with the decode performance but getting the video
>>> from the GPU to the CPU is a bottleneck and I'd like to get some
>>> advice on how to improve that. Both latency and CPU usage are issues.
>>> I need better performance because I want to process many streams of
>>> video at the same time.
>>>
>>> My example program is attached. The pipeline is: filesrc ! qtdemux ! vaapidecode ! appsink
>>>
>>> If I run the program to pull each sample from a file with 18000 frames
>>> as quickly as possible (but not actually gst_buffer_map the resulting
>>> buffer) I get:
>>>
>>> $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 0
>>> 10.21user 27.15system 0:49.60elapsed 75%CPU
>>>
>>> If I then run with the same input but map the buffer from each
>>> sample I get:
>>>
>>> $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 1
>>> 19.55user 38.73system 2:28.15elapsed 39%CPU
>>>
>>> I get 55% of my CPU in the wait state (according to top) in this case.
>>>
>>> I can subtract the two results and get the performance of the
>>> gst_buffer_map operation itself:
>>>
>>> 2:28.15 - 0:49.60 = 98.55s / 18000 frames = 5.5ms / frame or 546MB/s
>>> (since each frame ~= 3MB)
>>
>> When you map the buffer, you get a GstVaapiSurfaceProxy, but what do
>> you do with it next?
>>
>> Are you:
>> 1. Using vaGetImage() + map the resulting pixels + direct read ; or
>> 2. Using vaDeriveImage() + map buffer + use Uncacheable Speculative
>> Write Combining (USWC) memory copy?
> 
> I just filed a bug about what looks like the same issue:
> https://bugzilla.gnome.org/show_bug.cgi?id=740774
> 
> The summary is filesrc ! demux ! parse ! vaapidecode ! xvimagesink is
> incredibly slow -- is that expected? I'm on a Ivybridge-based desktop.
> 
I think our problems are different. I tried your example pipeline and data
on my setup and it's quite smooth.

Dan


More information about the gstreamer-devel mailing list