[Libva] Parse and render in different threads? (+ potential bug in the mpeg2 decoder for US15W chipsets)

Thu Mar 21 02:32:52 PDT 2013

Very good information! Thanks for the insight!

Another problem I've stumbled upon is that when using intra_dc_precision = 3 (i.e. 11-bit DC-predictors) the chip doesn't decode the stream properly. The DC levels are completely off both in luma and in chroma. Decoding the same stream using VLC or software mplayer etc displays it correctly.

VA: v0.31.1 VAProfileMPEG2Main + VAEntrypointVLD
CPU : Intel ATOM Z510
GFX: Intel US15W
OS: Linux Fedora 14
X.Org: X.Org X Server 1.9.4
GFX-driver: Intel(R) Embedded Media and Graphics Driver version 1.14.2443

Switching to intra_dc_precision = 2 (i.e. 10-bit DC predictors) decodes correctly though. Do you have any erratas on this or any helpful insights?

Kind regards, Andreas Larsson

20 mar 2013 kl. 14:14 skrev Daniel Vetter <daniel at ffwll.ch>:

> On Wed, Mar 20, 2013 at 9:02 AM, Andreas Larsson
> <andreas.larsson at multiq.se> wrote:
>> Yep ok, that's nice!
>> 
>> Followup question(s):
>> 
>> I'm currently filling in my buffers by calling vaCreateBuffer with NULL as pointer argument to let the server allocate the memory for me, then I use vaMapBuffer when I fill it in. This works very nice and performance is good.
>> 
>> However I see that when I over-allocate my slice buffers the performance drops a bit. Currently I allocate 8kb for each slice and fill it in and then specify in the slice parameters exactly how much of the buffer I actually use. As I understand it the over-allocation shouldn't affect performance, but it does. Are the buffers copied even though I allocate them in the server?
> 
> If you are on ironlake or on sandybridge with an old kernel it might
> be that you're hitting the clflush overhead of cpu mmaps, which is
> proportional to the size of the buffer, not the data actually used.
> I'm not sure of the internals and whether libva has an interface to
> upload with copies. But if that exists we could use the kernel's
> pwrite interface, which has generally higher throughput in this case
> and only overhead proportional to the actually written data.
> 
>> More over, the documentation says the buffers are automatically de-allocated once they've been sent to vaRenderPicture. Is there any way to re-use the buffers and avoid automatic de-allocation? I really don't see why I should create/destroy all buffers each frame, seems like a complete waste of resources.
> 
> Buffers, including any cpu mappings already set up are _very_
> aggressively cached internally in libva (actually libdrm) arleady. So
> there should be no need for you to do any caching on top.
> -Daniel
> 
>> 
>> Kind regards, Andreas Larsson
>> 
>> 
>> 20 mar 2013 kl. 01:50 skrev "Xiang, Haihao" <haihao.xiang at intel.com>
>> :
>> 
>>> On Tue, 2013-03-19 at 08:02 +0000, Andreas Larsson wrote:
>>>> Hi!
>>>> 
>>>> Do I have to perform bitstream parsing and vaRenderPicture in separate threads to maintain best performance. I.e. do vaRenderPicture block or are those calls buffered and handled asynchronously by the driver/chip, like OpenGL?
>>> 
>>> VA runs in asynchronous mode.
>>> 
>>>> 
>>>> As it is, I generate MPEG2 data and for each slice I call vaRenderPicture before I generate the next slice, so if vaRenderPicture blocks this would drain my performance completly.
>>>> 
>>>> Kind regards, Andreas Larsson
>>>> 
>>>> _______________________________________________
>>>> Libva mailing list
>>>> Libva at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/libva
>>> 
>>> 
>> 
>> _______________________________________________
>> Libva mailing list
>> Libva at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/libva
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch