Try to address the DMA-buf coherency problem

Tue Nov 22 17:33:59 UTC 2022

Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
>> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne <nicolas at ndufresne.ca> wrote:
>>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
>>>>>> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
>>>>>> be able to directly decided where a certain operation should go. E.g. we
>>>>>> have cases where for example a P2P write doesn't even go to memory, but
>>>>>> rather a doorbell BAR to trigger another operation. Throwing in CPU
>>>>>> round trips for explicit ownership transfer completely breaks that
>>>>>> concept.
>>>>> It sounds like we should have a dma_dev_is_coherent_with_dev() which
>>>>> accepts two (or an array?) of devices and tells the caller whether the
>>>>> devices need explicit ownership transfer.
>>>> No, exactly that's the concept I'm pushing back on very hard here.
>>>>
>>>> In other words explicit ownership transfer is not something we would
>>>> want as requirement in the framework, cause otherwise we break tons of
>>>> use cases which require concurrent access to the underlying buffer.
>>> I'm not pushing for this solution, but really felt the need to correct you here.
>>> I have quite some experience with ownership transfer mechanism, as this is how
>>> GStreamer framework works since 2000. Concurrent access is a really common use
>>> cases and it is quite well defined in that context. The bracketing system (in
>>> this case called map() unmap(), with flag stating the usage intention like reads
>>> and write) is combined the the refcount. The basic rules are simple:
>> This is all CPU oriented, I think Christian is talking about the case
>> where ownership transfer happens without CPU involvement, such as via
>> GPU waiting on a fence
> Yeah for pure device2device handover the rule pretty much has to be that
> any coherency management that needs to be done must be done from the
> device side (flushing device side caches and stuff like that) only. But
> under the assumption that _all_ cpu side management has been done already
> before the first device access started.
>
> And then the map/unmap respectively begin/end_cpu_access can be used what
> it was meant for with cpu side invalidation/flushing and stuff like that,
> while having pretty clear handover/ownership rules and hopefully not doing
> no unecessary flushes. And all that while allowing device acces to be
> pipelined. Worst case the exporter has to insert some pipelined cache
> flushes as a dma_fence pipelined work of its own between the device access
> when moving from one device to the other. That last part sucks a bit right
> now, because we don't have any dma_buf_attachment method which does this
> syncing without recreating the mapping, but in reality this is solved by
> caching mappings in the exporter (well dma-buf layer) nowadays.
>
> True concurrent access like vk/compute expects is still a model that
> dma-buf needs to support on top, but that's a special case and pretty much
> needs hw that supports such concurrent access without explicit handover
> and fencing.
>
> Aside from some historical accidents and still a few warts, I do think
> dma-buf does support both of these models.

We should have come up with dma-heaps earlier and make it clear that 
exporting a DMA-buf from a device gives you something device specific 
which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this. 
Question left is what documentation is missing to make it clear how 
things are supposed to work?

Regards,
Christian.

>   Of course in the case of
> gpu/drm drivers, userspace must know whats possible and act accordingly,
> otherwise you just get to keep all the pieces.
> -Daniel