vaapidecode GPU to CPU download performance

Wed Nov 26 14:03:35 PST 2014

On 25 Nov 2014 13:35:24 +0100, Gwenole Beauchesne <gb.devel at gmail.com> wrote:
> Hi Dan, 2014-11-24 22:06 GMT+01:00 Dan Williams <dwilliams at cernium.com>:
>> > Hi all,
>> >
>> > I want to use gstreamer and VAAPI to do accelerated H.264 decoding of
>> > 1920x1080 video frames into memory where I can do subsequent analysis
>> > with the CPU. I am able to use the luma plane of the resulting I420
>> > format buffer for what I need to do.
>> >
>> > I am quite happy with the decode performance but getting the video
>> > from the GPU to the CPU is a bottleneck and I'd like to get some
>> > advice on how to improve that. Both latency and CPU usage are issues.
>> > I need better performance because I want to process many streams of
>> > video at the same time.
>> >
>> > My example program is attached. The pipeline is: filesrc ! qtdemux ! vaapidecode ! appsink
>> >
>> > If I run the program to pull each sample from a file with 18000 frames
>> > as quickly as possible (but not actually gst_buffer_map the resulting
>> > buffer) I get:
>> >
>> > $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 0
>> > 10.21user 27.15system 0:49.60elapsed 75%CPU
>> >
>> > If I then run with the same input but map the buffer from each
>> > sample I get:
>> >
>> > $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 1
>> > 19.55user 38.73system 2:28.15elapsed 39%CPU
>> >
>> > I get 55% of my CPU in the wait state (according to top) in this case.
>> >
>> > I can subtract the two results and get the performance of the
>> > gst_buffer_map operation itself:
>> >
>> > 2:28.15 - 0:49.60 = 98.55s / 18000 frames = 5.5ms / frame or 546MB/s
>> > (since each frame ~= 3MB)
> When you map the buffer, you get a GstVaapiSurfaceProxy, but what do
> you do with it next?
> 
> Are you:
> 1. Using vaGetImage() + map the resulting pixels + direct read ; or
> 2. Using vaDeriveImage() + map buffer + use Uncacheable Speculative
> Write Combining (USWC) memory copy?
> 
Gwenole,

Thanks for your reply. The pointer to your repo of testing tools is
very useful.

When I map the buffer with 'gst_buffer_map' it populates a
'GstMapInfo' structure. I am able to just memcpy from the 'data' field
of that structure and get the correct bytes for the video frame. I
have verified this by writing them out and then displaying them like
this:

  $ gst-launch-1.0 filesrc location="frame00010.raw" ! \ 
     videoparse format="i420" width=1920 height=1088 ! \
     imagefreeze ! autovideosink

So I'm not sure what is going on under the hood, but I'm doing nothing
explicit like you are in your
dec_gstreamer.c::app_handle_hw_surface_vaapi function. Is it because
you are using fakesink and I am using appsink?

>> > When I use oprofile I see that 44% of the time spent is in the routine
>> > drm_clflush_page:
>> >
>> > samples  %        image name               symbol name
>> > -------------------------------------------------------------------------------
>> > 363013   44.1111  /lib/modules/3.13.0-24-generic/updates/dkms/drm.ko drm_clflush_page
>> >
>> > See http://lxr.free-electrons.com/source/drivers/gpu/drm/drm_cache.c?v=3.13
>> >
>> > I am interested in knowing:
>> >
>> >   1) can I make this run faster and use less CPU? how?
>> >   2) ultimately, how much faster can I make it run?
>> >   3) how much faster would it be with a faster CPU or GPU?
> Using approach (2) above, I can decode + copy + hash (adler32) each
> frame of the 1080p BBB in 0:29.372. That's around 1.4 GB/sec on a Core
> i7-3770 (HD 4000). Without hashing, this task completes in 0:25.101,
> that's around 1.65 GB/sec.
> 
> You probably could use dec_gstreamer from:
> <https://github.com/gbeauchesne/mvt_tools>
> 
> e.g. dec_gstreamer --vaapi /path/to/some/video -r /dev/null
> 
I modified your dec_gstreamer program so that the md5 hash was instead
just a no-op, then ran it on my input:

  /usr/local/time ./dec_gstreamer --hwaccel=vaapi --checksum=md5 ~/media/hd-30m.mp4 -r /dev/null
  390.85user 8.54system 6:43.73elapsed 98%CPU

That is just 134 MB/s.

What is confusing is that this is a more than twice as slow as my
example code (I added a memcpy to what I attached previously) that
supposedly does about the same thing. I tried both DRM and X11
renderers, with no difference.

Here is what a profile of your program shows:
samples  %        image name               symbol name
5002423  91.4669  /home/user/mvt_tools/src/.libs/libmvt_utils.so.0.0.0 CopyFromUswc
119736    2.1893  /home/user/mvt_tools/src/.libs/libmvt_utils.so.0.0.0 Copy2d
103810    1.8981  /home/user/mvt_tools/src/.libs/libmvt_utils.so.0.0.0 SSE_SplitUV
21453     0.3923  /lib/modules/3.13.0-24-generic/updates/dkms/drm.ko drm_clflush_page

Your download rates on faster hardware are enviable, but there is
something else going on here that I don't understand if my uninformed
code is twice as fast as yours.

>> > My hardware is:
>> >
>> > CPU: Intel Atom E3815
>> > GPU: HD 2500 (Ivy Bridge)
> Ah, I have not tried on Baytrail yet.
> 
>> > I am using gstreamer and gstreamer-vaapi built from git master branch
>> > as of today, so 1.5.X I guess.
>> >
>> > The rest of the software stack is:
>> >
>> > Ubuntu 14.04.1 LTS
>> > $ uname -a
>> > Linux nuc-atom-testsys 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>> > $ vainfo
>> > libva info: VA-API version 0.35.1
>> > vainfo: VA-API version: 0.35 (libva 1.3.1)
>> > vainfo: Driver version: Intel i965 driver for Intel(R) Bay Trail - 1.3.2
>> >
>> > ii  libva-dev:amd64                                       1.3.1-3
>> > ii  libdrm2:amd64                                         2.4.54-1
>> > ii  i965-va-driver:amd64                                  1.3.2-1
>> > ii  xserver-common                                        2:1.15.1-0ubuntu2.1
>> > ii  xserver-xorg-video-intel                              2:2.99.911-0intel1
>> >
>> > The input file is 30 minutes of 10fps 1920x1080 H.264 video which I
>> > can make available if that helps.
>> >
>> > Thanks in advance for any help (or even just for reading to the end of
>> > this information-dense post.)
> Regards,
> -- Gwenole Beauchesne Intel Corporation SAS / 2 rue de Paris, 92196 Meudon Cedex, France Registration Number (RCS): Nanterre B 302 456 199