vaapidecode GPU to CPU download performance

Gwenole Beauchesne gb.devel at gmail.com
Tue Nov 25 04:35:24 PST 2014


Hi Dan,

2014-11-24 22:06 GMT+01:00 Dan Williams <dwilliams at cernium.com>:
> Hi all,
>
> I want to use gstreamer and VAAPI to do accelerated H.264 decoding of
> 1920x1080 video frames into memory where I can do subsequent analysis
> with the CPU. I am able to use the luma plane of the resulting I420
> format buffer for what I need to do.
>
> I am quite happy with the decode performance but getting the video
> from the GPU to the CPU is a bottleneck and I'd like to get some
> advice on how to improve that. Both latency and CPU usage are issues.
> I need better performance because I want to process many streams of
> video at the same time.
>
> My example program is attached. The pipeline is: filesrc ! qtdemux ! vaapidecode ! appsink
>
> If I run the program to pull each sample from a file with 18000 frames
> as quickly as possible (but not actually gst_buffer_map the resulting
> buffer) I get:
>
> $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 0
> 10.21user 27.15system 0:49.60elapsed 75%CPU
>
> If I then run with the same input but map the buffer from each
> sample I get:
>
> $ /usr/bin/time ./test-appsink ../media/hd-30m.mp4 1
> 19.55user 38.73system 2:28.15elapsed 39%CPU
>
> I get 55% of my CPU in the wait state (according to top) in this case.
>
> I can subtract the two results and get the performance of the
> gst_buffer_map operation itself:
>
> 2:28.15 - 0:49.60 = 98.55s / 18000 frames = 5.5ms / frame or 546MB/s
> (since each frame ~= 3MB)

When you map the buffer, you get a GstVaapiSurfaceProxy, but what do
you do with it next?

Are you:
1. Using vaGetImage() + map the resulting pixels + direct read ; or
2. Using vaDeriveImage() + map buffer + use Uncacheable Speculative
Write Combining (USWC) memory copy?

> When I use oprofile I see that 44% of the time spent is in the routine
> drm_clflush_page:
>
> samples  %        image name               symbol name
> -------------------------------------------------------------------------------
> 363013   44.1111  /lib/modules/3.13.0-24-generic/updates/dkms/drm.ko drm_clflush_page
>
> See http://lxr.free-electrons.com/source/drivers/gpu/drm/drm_cache.c?v=3.13
>
> I am interested in knowing:
>
>   1) can I make this run faster and use less CPU? how?
>   2) ultimately, how much faster can I make it run?
>   3) how much faster would it be with a faster CPU or GPU?

Using approach (2) above, I can decode + copy + hash (adler32) each
frame of the 1080p BBB in 0:29.372. That's around 1.4 GB/sec on a Core
i7-3770 (HD 4000). Without hashing, this task completes in 0:25.101,
that's around 1.65 GB/sec.

You probably could use dec_gstreamer from:
<https://github.com/gbeauchesne/mvt_tools>

e.g. dec_gstreamer --vaapi /path/to/some/video -r /dev/null

> My hardware is:
>
> CPU: Intel Atom E3815
> GPU: HD 2500 (Ivy Bridge)

Ah, I have not tried on Baytrail yet.

> I am using gstreamer and gstreamer-vaapi built from git master branch
> as of today, so 1.5.X I guess.
>
> The rest of the software stack is:
>
> Ubuntu 14.04.1 LTS
> $ uname -a
> Linux nuc-atom-testsys 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> $ vainfo
> libva info: VA-API version 0.35.1
> vainfo: VA-API version: 0.35 (libva 1.3.1)
> vainfo: Driver version: Intel i965 driver for Intel(R) Bay Trail - 1.3.2
>
> ii  libva-dev:amd64                                       1.3.1-3
> ii  libdrm2:amd64                                         2.4.54-1
> ii  i965-va-driver:amd64                                  1.3.2-1
> ii  xserver-common                                        2:1.15.1-0ubuntu2.1
> ii  xserver-xorg-video-intel                              2:2.99.911-0intel1
>
> The input file is 30 minutes of 10fps 1920x1080 H.264 video which I
> can make available if that helps.
>
> Thanks in advance for any help (or even just for reading to the end of
> this information-dense post.)

Regards,
-- 
Gwenole Beauchesne
Intel Corporation SAS / 2 rue de Paris, 92196 Meudon Cedex, France
Registration Number (RCS): Nanterre B 302 456 199


More information about the gstreamer-devel mailing list