Try to address the DMA-buf coherency problem

Wed Nov 2 12:21:46 UTC 2022

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:
> Hi Christian,
>
> going to reply in more detail when I have some more time, so just some
> quick thoughts for now.
>
> Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
>> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
>>> [SNIP]
>> As far as I can see it you guys just allocate a buffer from a V4L2
>> device, fill it with data and send it to Wayland for displaying.
>>
>> To be honest I'm really surprised that the Wayland guys hasn't pushed
>> back on this practice already.
>>
>> This only works because the Wayland as well as X display pipeline is
>> smart enough to insert an extra copy when it find that an imported
>> buffer can't be used as a framebuffer directly.
>>
> With bracketed access you could even make this case work, as the dGPU
> would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should 
pump the data into LMEM in the first place.

>>> The only case the commonly fails is whenever we try to display UVC
>>> created dmabuf,
>> Well, exactly that's not correct! The whole x86 use cases of direct
>> display for dGPUs are broken because media players think they can do the
>> simple thing and offload all the problematic cases to the display server.
>>
>> This is absolutely *not* the common use case you describe here, but
>> rather something completely special to ARM.
> It the normal case for a lot of ARM SoCs.

Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data 
directly from a decoder into an GPU frame or from a GPU frame into an 
encoder.

The fact that media frameworks doesn't support that out of the box is 
simply a bug.

> That world is certainly not
> any less big than the x86 dGPU world. A huge number of devices are ARM
> based set-top boxes and other video players. Just because it is a
> special case for you doesn't mean it's a global special case.

Ok, let's stop with that. This isn't helpful in the technical discussion.

>
>> That we haven't heard anybody screaming that x86 doesn't work is just
>> because we handle the case that a buffer isn't directly displayable in
>> X/Wayland anyway, but this is absolutely not the optimal solution.
>>
>> The argument that you want to keep the allocation on the codec side is
>> completely false as far as I can see.
>>
>> We already had numerous projects where we reported this practice as bugs
>> to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.
>>
> And on a lot of ARM SoCs it's exactly the right thing to do.

Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty 
the exporter needs to be the device with the requirements.

For x86 dGPUs that's the backing store is some local memory. For the 
non-coherent ARM devices it's that the CPU cache is not dirty.

For a device driver which solely works with cached system memory 
inserting cache flush operations is something it would never do for 
itself. It would just be doing this for the importer and exactly that 
would be bad design because we then have handling for the display driver 
outside of the driver.

>> This is just a software solution which works because of coincident and
>> not because of engineering.
> By mandating a software fallback for the cases where you would need
> bracketed access to the dma-buf, you simply shift the problem into
> userspace. Userspace then creates the bracket by falling back to some
> other import option that mostly do a copy and then the appropriate
> cache maintenance.
>
> While I understand your sentiment about the DMA-API design being
> inconvenient when things are just coherent by system design, the DMA-
> API design wasn't done this way due to bad engineering, but due to the
> fact that performant DMA access on some systems just require this kind
> of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
giving you a proper error code when something won't work in a specific 
way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API 
transparently insert bounce buffers instead of giving you a proper error 
code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it 
into the upper layer where it can be handled much more gracefully.

Regards,
Christian.

>
> Regards,
> Lucas
>