Try to address the DMA-buf coherency problem

Fri Dec 9 08:26:06 UTC 2022

On Mon, Dec 5, 2022 at 5:29 PM Christian König <christian.koenig at amd.com> wrote:
>
> Hi Tomasz,
>
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> >> In other words explicit ownership transfer is not something we would
> >> want as requirement in the framework, cause otherwise we break tons of
> >> use cases which require concurrent access to the underlying buffer.
> >>
> >> When a device driver needs explicit ownership transfer it's perfectly
> >> possible to implement this using the dma_fence objects mentioned above.
> >> E.g. drivers can already look at who is accessing a buffer currently and
> >> can even grab explicit ownership of it by adding their own dma_fence
> >> objects.
> >>
> >> The only exception is CPU based access, e.g. when something is written
> >> with the CPU a cache flush might be necessary and when something is read
> >> with the CPU a cache invalidation might be necessary.
> >>
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> >
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> >
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
>
> Well there are multiple problems here:
>
> 1. A lot of userspace applications/frameworks assume that it can
> allocate the buffer anywhere and it just works.
>
> This isn't true at all, we have tons of cases where device can only
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> supports system memory here. Similar cases exists for audio/video codecs
> where intermediate memory is only accessible by certain devices because
> of content protection.

Ack.

Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.

>
> 2. We don't properly communicate allocation requirements to userspace.
>
> E.g. even if you allocate from DMA-Heaps userspace can currently only
> guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.

>
> 3. We seem to lack some essential parts of those restrictions in the
> documentation.
>

Ack.

> >>>> So if a device driver uses cached system memory on an architecture which
> >>>> devices which can't access it the right approach is clearly to reject
> >>>> the access.
> >>> I'd like to accent the fact that "requires cache maintenance" != "can't access".
> >> Well that depends. As said above the exporter exports the buffer as it
> >> was allocated.
> >>
> >> If that means the the exporter provides a piece of memory which requires
> >> CPU cache snooping to access correctly then the best thing we can do is
> >> to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
>
> Yes, exactly that. What happens in this particular use case is that one
> device driver wrote to it's internal buffer with the CPU (so some cache
> lines where dirty) and then a device which couldn't deal with that tried
> to access it.

If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)

>
> We could say that all device drivers must always look at the coherency
> of the devices which want to access their buffers. But that would
> horrible complicate things for maintaining the drivers because then
> drivers would need to take into account requirements from other drivers
> while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA
mapping API, but currently it's only handling requirements of one
device. And we don't have any information from the userspace what
other devices the buffer would be used with...

Actually, do we even have such information in the userspace today?
Let's say I do a video call in a web browser on a typical Linux
system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
V4L2 camera fills in buffers with video frames and both encoder and
display consume them. Do we have a central place which would know that
a buffer needs to be allocated that works with the producer and all
consumers?

>
> That's essentially the reason why we have DMA-buf heaps. Those heaps
> expose system memory buffers with certain properties (size, CMA, DMA bit
> restrictions etc...) and also implement the callback functions for CPU
> cache maintenance.
>

How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).

> >> We do have the system and CMA dma-buf heap for cases where a device
> >> driver which wants to access the buffer with caches enabled. So this is
> >> not a limitation in functionality, it's just a matter of correctly using it.
> >>
> > V4L2 also has the ability to allocate buffers and map them with caches enabled.
>
> Yeah, when that's a requirement for the V4L2 device it also makes
> perfect sense.
>
> It's just that we shouldn't force any specific allocation behavior on a
> device driver because of the requirements of a different device.
>
> Giving an example a V4L2 device shouldn't be forced to use
> videobuf2-dma-contig because some other device needs CMA. Instead we
> should use the common DMA-buf heaps which implement this as neutral
> supplier of system memory buffers.

Agreed, but that's only possible if we have a central entity that
understands what devices the requested buffer would be used with. My
understanding is that we're nowhere close to that with mainstream
Linux today.

// As a side note, videobuf2-dma-contig can actually allocate
discontiguous memory, if the device is behind an IOMMU. Actually with
the recent DMA API improvements, we could probably coalesce
vb2-dma-contig and vb2-dma-sg into one vb2-dma backend.

>
> >> The problem is that in this particular case the exporter provides the
> >> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> >> with that.
> > Why does the exporter leave the buffer with dirty CPU caches?
>
> Because exporters always export the buffers as they would use it. And in
> this case that means writing with the CPU to it.
>

Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?

> >> Either reverting the roles of the importer or exporter or switching over
> >> to the common DMA-heaps implementation for the buffer would solve that.
> >>
> >>>> It's the job of the higher level to prepare the buffer a device work
> >>>> with, not the one of the lower level.
> >>> What are higher and lower levels here?
> >> The MM as higher level and the DMA mapping framework as lower level.
> >>
> > Hmm, I generally consider the DMA mapping framework a part of MM,
> > especially its allocation capabilities. It heavily relies on low level
> > MM primitives internally and exposes another set of low level
> > primitives that is more useful for device drivers. At least it should
> > be seen this way, but I agree that it currently includes things that
> > shouldn't necessarily be there, like the transparent bouncing.
>
> Exactly that, yes! Good to hear that more people think that way.
>
> Christoph made some comments which sounded like he agreed to some of the
> points as well, but nobody ever said it so clearly.
>

An interesting anecdote would be a case where we had the dma_mask set
incorrectly for one of the devices, which ended up triggering the
implicit bouncing and hurting the performance a lot, despite all of
the functional testing passing just fine...

> >>> As per the existing design of the DMA mapping framework, the framework
> >>> handles the system DMA architecture details and DMA master drivers
> >>> take care of invoking the right DMA mapping operations around the DMA
> >>> accesses. This makes sense to me, as DMA master drivers have no idea
> >>> about the specific SoCs or buses they're plugged into, while the DMA
> >>> mapping framework has no idea when the DMA accesses are taking place.
> >> Yeah and exactly that's a bit problematic. In an ideal world device
> >> drivers wouldn't see memory they can't access in the first place.
> >>
> >> For example what we currently do is the following:
> >> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> >> 2. DMA mapping framework is called by the driver to create an sg-table.
> >> 3. The DMA mapping framework sees that this won't work and inserts
> >> bounce buffers.
> >> 4. The driver does the DMA to the bounce buffers instead.
> >> 5. The DMA mapping framework copies the data to the real location.
> >>
> >> This is highly problematic since it removes the control of what happens
> >> here. E.g. drivers can't prevent using bounce buffers when they need
> >> coherent access for DMA-buf for example.
> >>
> >> What we should have instead is that bounce buffers are applied at a
> >> higher level, for example by get_user_pages() or more general in the MM.
> >>
> > I tend to agree with you on this, but I see a lot of challenges that
> > would need to be solved if we want to make it otherwise. The whole
> > reason for transparent bouncing came from the fact that many existing
> > subsystems didn't really care about the needs of the underlying
> > hardware, e.g. block or network subsystems would just pass random
> > pages to the drivers. I think the idea of DMA mapping handling this
> > internally was to avoid implementing the bouncing here and there in
> > each block or network driver that needs it. (Arguably, an optional
> > helper could be provided instead for use in those subsystems...)
>
> Yes, totally agree. The problem is really that we moved bunch of MM and
> DMA functions in one API.
>
> The bounce buffers are something we should really have in a separate
> component.
>
> Then the functionality of allocating system memory for a specific device
> or devices should be something provided by the MM.
>
> And finally making this memory or any other CPU address accessible to a
> device (IOMMU programming etc..) should then be part of an DMA API.
>

Remember that actually making the memory accessible to a device often
needs to be handled already as a part of the allocation (e.g. dma_mask
in the non-IOMMU case). So I tend to think that the current division
of responsibilities is mostly fine - the dma_alloc family should be
seen as a part of MM already, especially with all the recent
improvements from Christoph, like dma_alloc_pages().

That said, it may indeed make sense to move things like IOMMU mapping
management out of the dma_alloc() and just reduce those functions to
simply returning a set of pages that satisfy the allocation
constraints. One would need to call dma_map() after the allocation,
but that's actually a fair expectation. Sometimes it might be
preferred to pre-allocate the memory, but only map it into the device
address space when it's really necessary.

> >>>> In other words in a proper design the higher level would prepare the
> >>>> memory in a way the device driver can work with it, not the other way
> >>>> around.
> >>>>
> >>>> When a device driver gets memory it can't work with the correct response
> >>>> is to throw an error and bubble that up into a layer where it can be
> >>>> handled gracefully.
> >>>>
> >>>> For example instead of using bounce buffers under the hood the DMA layer
> >>>> the MM should make sure that when you call read() with O_DIRECT that the
> >>>> pages in question are accessible by the device.
> >>>>
> >>> I tend to agree with you if it's about a costly software "emulation"
> >>> like bounce buffers, but cache maintenance is a hardware feature
> >>> existing there by default and it's often much cheaper to operate on
> >>> cached memory and synchronize the caches rather than have everything
> >>> in uncached (or write-combined) memory.
> >> Well that we should have proper cache maintenance is really not the
> >> question. The question is where to do that?
> >>
> >> E.g. in the DMA-mapping framework which as far as I can see should only
> >> take care of translating addresses.
> > The DMA mapping framework is actually a DMA allocation and mapping
> > framework. Generic memory allocation primitives (like alloc_pages(),
> > kmalloc()) shouldn't be used for buffers that are expected to be
> > passed to DMA engines - there exist DMA-aware replacements, like
> > dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
> > etc.
> >
> >> Or the higher level (get_user_pages() is just one example of that) which
> >> says ok device X wants to access memory Y I need to make sure that
> >> caches are flushed/invalidated.
> > Okay, so here comes the userptr case, which I really feel like is just
> > doomed at the very start, because it does nothing to accommodate
> > hardware needs at allocation time and then just expects some magic to
> > happen to make it possible for the hardware to make use of the buffer.
> >
> > That said, it should be still possible to handle that pretty well for
> > hardware that allows it, i.e. without much memory access constraints.
> > What would be still missing if we just use the existing gup() +
> > dma_map() + dma_sync() sequence?
>
> Error or rather fallback handling and *not* transparently inserting
> bounce buffers.
>
> The whole implicit bounce buffers concept falls apart as soon as you
> don't have a stream access any more.
>

I see. So I think we agree on that. :)

> >>>>> It's a use-case that is working fine today with many devices (e.g. network
> >>>>> adapters) in the ARM world, exactly because the architecture specific
> >>>>> implementation of the DMA API inserts the cache maintenance operations
> >>>>> on buffer ownership transfer.
> >>>> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >>>> design totally breaks GPUs on Xen DOM0 for example.
> >>>>
> >>>> And Xen is just one example, I can certainly say from experience that
> >>>> this design was a really really bad idea because it favors just one use
> >>>> case while making other use cases practically impossible if not really
> >>>> hard to implement.
> >>> Sorry, I haven't worked with Xen. Could you elaborate what's the
> >>> problem that this introduces for it?
> >> That's a bit longer topic. The AMD XEN devs are already working on this
> >> as far as I know. I can ping internally how far they got with sending
> >> the patches out to avoid this problem.
> > Hmm, I see. It might be a good data point to understand in which
> > direction we should be going, but I guess we can wait until they send
> > some patches.
>
> There was just recently a longer thread on the amd-gfx mailing list
> about that. I think looking in there as well might be beneficial.

Okay, let me check. Thanks,

Best regards,
Tomasz