[Mesa-dev] 10bit HEVC decoding for RadeonSI

Sat Jan 28 15:32:55 UTC 2017

Am 27.01.2017 um 20:44 schrieb Mark Thompson:
> On 27/01/17 14:27, Christian König wrote:
>> Am 27.01.2017 um 13:51 schrieb Mark Thompson:
>>> On 26/01/17 16:59, Christian König wrote:
>>> [SNIP]
>>> (For that matter, is there a list somewhere of the set of formats/layouts and what they are used for?)
>> Well taking a look at all the use cases and options you can program into the decoder you easily end up with 100+ format combinations.
>>
>> For example the decoder can split a frame into the top and bottom field during decode, apply different tiling modes and layout the YUV data either packet or in separate planes.
> Maybe I should have phrased that as "how many might sane people actually use?" :)  But yes, I can see that this is going to be more than just two or three such that you do need to treat it generically.

Mhm, let's see. I will just try to sum up the different use cases:

1. Decoding of interlaced Video + Postprocessing + Displaying all on the 
same GPU:

In this case you want the output split by field for easy deinterlacing 
and a tiling config for a good memory access pattern from the shaders.

2. Decoding of progressive Video + Postprocessing + Displaying on the 
same GPU:

Single frame output and only simple tiling because the CSC shader has 
only a very linear access pattern.

3. Decoding + direct display using Overlay including CSC.

Single frame output and a special tiling mode the output engine 
understands directly (unfortunately at least three different cases of 
output engine possible here (which again all have there separate tiling 
requirement).

4. Decoding + Encoding it again (e.g. transcoding) on the same GPU.

Single frame output and one of the tiling mode both UVD as well as VCE 
understand.

5. Decoding + Post processing on a different GPU.

Single frame output and most likely linear or a tiling mode both GPUs 
can understand.

6. Decoding + accessing it with the CPU.

In this case we should probably directly decode into the format the 
application requested. Normally we only support NV12, but could at least 
in theory enable mostly everything else as well.

Additional to all that above when you need to scale the decoded frame 
during post processing you might even want a different tiling mode as 
when you just keep it the same size (again different memory access pattern).

>>
>>> Since the user can't access the surface directly, it can be whatever is most suitable for the hardware and the user can't tell.
>> That won't work. An example we ran into was that a customer wanted to black out the first and last line of an image for BOB deinterlacing.
>>
>> To do so he mapped the surface and just used memset() on the appropriate addresses. On Intel I was told this works because the mapping seems to be bidirectional and all changes done to it are reflected in the original video surface/image.
> Tbh I think both of these examples are user error, if unfortunately understandable in the circumstances.

Completely agree that those are user errors, but APIs should be 
programmed defensively.

> The API is clear that the direct mapping via vaDeriveImage() doesn't necessarily work:
>
> <https://cgit.freedesktop.org/libva/tree/va/va.h#n2719>
> """
>                                 When the operation is not possible this interface
>   * will return VA_STATUS_ERROR_OPERATION_FAILED. Clients should then fall back
>   * to using vaCreateImage + vaPutImage to accomplish the same task in an
>   * indirect manner.
> """
>
> and therefore that you need code which looks like <https://git.libav.org/?p=libav.git;a=blob;f=libavutil/hwcontext_vaapi.c;h=b2e212c1fe518f310576ea14125266fbd5e7ce48;hb=HEAD#l720> to do any sort of CPU mapping reliably.  The indirect path does indeed work on AMD now (and actually is often what you want on Intel as well if you aren't anticipating the weird properties of the direct mapping).
>
> I admit that this isn't necessarily a useful thing to tell your customers when their programs don't work.  Nor is it helpful that the libva examples don't follow it: <https://cgit.freedesktop.org/libva/tree/test/loadsurface.h#n314>.

Ok first of all can we please fix the libva example? That explains why 
so many people come to us doing such a nonsense.

>>> The API certainly admits the possibility that vaDeriveImage() just can't expose surfaces to the CPU directly, or that there are extra implicit copies so that it all appears consistent from the point of view of the user.
>> Yeah, but it doesn't enforce that. E.g. you don't have it properly defined that the derived Image is a copy and it looks like that on Intel it just works so people tend to use it.
> True.  On Intel it is never copying for vaMapBuffer() or vaAcquireBuffer() so writes by the CPU or other GPU operations are reflected immediately (and that goes as far as letting you scribble on decoder reference images while they are being used and breaking everything if you want).  None of that is codified anywhere, though.

Second would it be possible to just mark the whole mapping API in libva 
as deprecated and provide VDPAU like accessors to the data in the surfaces?

Alternative mapping the surfaces read only so that possible abusers run 
into issues could help as well. I mean it doesn't seem to be a good idea 
on Intel either when you can do such nasty things as modifying the 
reference frame needed for further decoding.

>>> I think my use of the word "mapping" wasn't helping there: I was using it to refer both to mapping to the CPU address space (which need not be supported) and to other APIs (OpenGL, OpenCL, whatever) which will use it on the GPU (which is far more important).  My real question on the tiling issue was: is tiling/layout/whatever a properly of the DRM object, such that other APIs interacting with it can do the right thing without the user needing to know about it?
>> No, they can't. For example when we import the UV plane of an NV12 surface into OpenGL then OpenGL needs to know that this originally comes from an decoded image. Otherwise it won't use the correct tilling paramters.
>>
>> Currently we make an educated guess based on the offset what is about to be imported in the GL driver, but that isn't really good engineering.
>>
>> To be fair I've ran into the exactly same problem with VDPAU as well, that's the reason we currently have tilling disabled for video surfaces there as well.
> Right, the reason Intel mostly manages this is that they carry the information via private DRM calls (drm_intel_bo_get_tiling()...) and then add extra copies in some receivers to overcome problem cases (and you can still find holes in the abstraction where it doesn't work).  Also there are many fewer cases, which helps a lot.

AMD actually works exactly the same way. The problem is for Intel every 
plane in a NV12 surface is a separate DRM object and the decoded planes 
can be directly used for sampling, but for AMD the whole NV12 surface is 
one object. So what happens is that you can't easily separate the NV12 
planes into one R8 and one R8G8 to sample from it, you need to handle 
that specially on importing the surfaces.

Currently I have a couple of nasty hacks to work around that for both 
VA-API and VDPAU, but that is really not a good long term solution.

>>> If not, then the VAAPI buffer-sharing construction (vaCreateSurfaces() and vaAcquireBuffer() with VA_SURFACE_ATTRIB_TYPE_DRM_PRIME) doesn't contain enough information to ever work and we should be trying to fix it in the API.
>> Those problems won't resolve with just changes to VA-API. Essentially we need to follow NVidias proposal for the "Unix Device Memory Allocation project".
>>
>> Hopefully this should in the end result in a library which based on the use case allow drivers to negotiate the surface format.
> Yay, bring on the glorious new API that covers everyone's use-cases!  <https://xkcd.com/927/>

Yeah, exactly what we send around as well when we first heard of it :)

But honestly I don't think we don't have competitive standards for 
surface allocation, we just have individual APIs (Vulkan, OpenGL, 
VA-API, OpenMAX, etc...) which all only cover their special use cases.

Maybe we can just stick to more and more extending Vulkan towards being 
the unified API. That at least seems to have gotten a lot of things 
rather sane.

Regards,
Christian.

>
> More seriously, a consistent way to get this right everywhere (insofar as it is even possible) would indeed be great if it actually achieves that.  (As much as Intel mostly works with VAAPI, it does still have weird holes and copy cases which shouldn't be there.)
>
> Thanks,
>
> - Mark