Try to address the DMA-buf coherency problem

Mon Dec 5 06:41:00 UTC 2022

Hi Christian,

On Thu, Nov 17, 2022 at 9:11 PM Christian König
<christian.koenig at amd.com> wrote:
>
> Hi Tomasz,
>
> Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> > Hi Christian and everyone,
> >
> > On Thu, Nov 3, 2022 at 4:14 AM Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> >> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> >>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> >>> [SNIP]
> >>>> It would just be doing this for the importer and exactly that
> >>>> would be bad design because we then have handling for the display driver
> >>>> outside of the driver.
> >>>>
> >>> The driver would have to do those cache maintenance operations if it
> >>> directly worked with a non-coherent device. Doing it for the importer
> >>> is just doing it for another device, not the one directly managed by
> >>> the exporter.
> >>>
> >>> I really don't see the difference to the other dma-buf ops: in
> >>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> >>> the address space of the importer. Why would cache maintenance be any
> >>> different?
> >> The issue here is the explicit ownership transfer.
> >>
> >> We intentionally decided against that because it breaks tons of use
> >> cases and is at least by me and a couple of others seen as generally
> >> design failure of the Linux DMA-API.
> > First of all, thanks for starting the discussion and sorry for being
> > late to the party. May I ask you to keep me on CC for any changes that
> > touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> > being copied on the entire series, no need to pick the specific
> > patches. Thanks in advance.
>
> Sorry for that, I've only added people involved in the previous
> discussion. Going to keep you in the loop.
>

No worries. Thanks.

Sorry, for being late with the reply, had a bit of vacation and then
some catching up last week.

> > I agree that we have some design issues in the current DMA-buf
> > framework, but I'd try to approach it a bit differently. Instead of
> > focusing on the issues in the current design, could we write down our
> > requirements and try to come up with how a correct design would look
> > like? (A lot of that has been already mentioned in this thread, but I
> > find it quite difficult to follow and it might not be a complete view
> > either.)
>
> Well, exactly that's what we disagree on.
>
> As far as I can see the current design of DMA-buf is perfectly capable
> of handling all the requirements.
>
> A brief summary of the requirements with some implementation notes:
>
> 1. Device drivers export their memory as it is. E.g. no special handling
> for importers on the exporter side.
>      If an importer can't deal with a specific format, layout, caching
> etc... of the data the correct approach is to reject the attachment.
>      Those parameters are controlled by userspace and negotiating them
> is explicitly not part of the framework.

Ack. I believe it matches the current implementation of the DMA-buf
framework, although as you mentioned, the swiotlb feature of the DMA
mapping framework kind of violates this.

>
> 2. Accesses of the CPU to a buffer are bracket int begin_cpu_access()
> and end_cpu_access() calls.
>      Here we can insert the CPU cache invalidation/flushes as necessary.

Ack. I think a part of the problem today is that there exist userspace
and kernel code instances that don't insert them and assume that some
magic keeps the cache clean...

>
> 3. Accesses of the device HW to a buffer are represented with dma_fence
> objects.
>      It's explicitly allowed to have multiple devices access the buffer
> at the same time.
>

Ack. Again there exists kernel code that doesn't honor or provide DMA
fences (e.g. V4L2).

> 4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
>      In other words the exporter provides a bunch of DMA addresses the
> importer should access.
>      The importer should not try to come up with those on its own.
>
> > That said, let me address a few aspects already mentioned, to make
> > sure that everyone is on the same page.
> >
> >> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >> be able to directly decided where a certain operation should go. E.g. we
> >> have cases where for example a P2P write doesn't even go to memory, but
> >> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >> round trips for explicit ownership transfer completely breaks that concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
>
> No, exactly that's the concept I'm pushing back on very hard here.
>
> In other words explicit ownership transfer is not something we would
> want as requirement in the framework, cause otherwise we break tons of
> use cases which require concurrent access to the underlying buffer.
>
> When a device driver needs explicit ownership transfer it's perfectly
> possible to implement this using the dma_fence objects mentioned above.
> E.g. drivers can already look at who is accessing a buffer currently and
> can even grab explicit ownership of it by adding their own dma_fence
> objects.
>
> The only exception is CPU based access, e.g. when something is written
> with the CPU a cache flush might be necessary and when something is read
> with the CPU a cache invalidation might be necessary.
>

Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)

> >> So if a device driver uses cached system memory on an architecture which
> >> devices which can't access it the right approach is clearly to reject
> >> the access.
> > I'd like to accent the fact that "requires cache maintenance" != "can't access".
>
> Well that depends. As said above the exporter exports the buffer as it
> was allocated.
>
> If that means the the exporter provides a piece of memory which requires
> CPU cache snooping to access correctly then the best thing we can do is
> to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?

>
> We do have the system and CMA dma-buf heap for cases where a device
> driver which wants to access the buffer with caches enabled. So this is
> not a limitation in functionality, it's just a matter of correctly using it.
>

V4L2 also has the ability to allocate buffers and map them with caches enabled.

> >> What we can do is to reverse the role of the exporter and importer and
> >> let the device which needs uncached memory take control. This way this
> >> device can insert operations as needed, e.g. flush read caches or
> >> invalidate write caches.
> >>
> > (Putting aside the cases when the access is really impossible at all.)
> > Correct me if I'm wrong, but isn't that because we don't have a proper
> > hook for the importer to tell the DMA-buf framework to prepare the
> > buffer for its access?
>
> No, we do have the explicit begin_cpu_access() and end_cpu_access()
> calls which are even exporter to userspace through IOCTLs for this.

Ack.

>
> The problem is that in this particular case the exporter provides the
> buffer as is, e.g. with dirty CPU caches. And the importer can't deal
> with that.

Why does the exporter leave the buffer with dirty CPU caches?

>
> Either reverting the roles of the importer or exporter or switching over
> to the common DMA-heaps implementation for the buffer would solve that.
>
> >> It's the job of the higher level to prepare the buffer a device work
> >> with, not the one of the lower level.
> > What are higher and lower levels here?
>
> The MM as higher level and the DMA mapping framework as lower level.
>

Hmm, I generally consider the DMA mapping framework a part of MM,
especially its allocation capabilities. It heavily relies on low level
MM primitives internally and exposes another set of low level
primitives that is more useful for device drivers. At least it should
be seen this way, but I agree that it currently includes things that
shouldn't necessarily be there, like the transparent bouncing.

> >
> > As per the existing design of the DMA mapping framework, the framework
> > handles the system DMA architecture details and DMA master drivers
> > take care of invoking the right DMA mapping operations around the DMA
> > accesses. This makes sense to me, as DMA master drivers have no idea
> > about the specific SoCs or buses they're plugged into, while the DMA
> > mapping framework has no idea when the DMA accesses are taking place.
>
> Yeah and exactly that's a bit problematic. In an ideal world device
> drivers wouldn't see memory they can't access in the first place.
>
> For example what we currently do is the following:
> 1. get_user_pages() grabs a reference to the pages we want to DMA to/from.
> 2. DMA mapping framework is called by the driver to create an sg-table.
> 3. The DMA mapping framework sees that this won't work and inserts
> bounce buffers.
> 4. The driver does the DMA to the bounce buffers instead.
> 5. The DMA mapping framework copies the data to the real location.
>
> This is highly problematic since it removes the control of what happens
> here. E.g. drivers can't prevent using bounce buffers when they need
> coherent access for DMA-buf for example.
>
> What we should have instead is that bounce buffers are applied at a
> higher level, for example by get_user_pages() or more general in the MM.
>

I tend to agree with you on this, but I see a lot of challenges that
would need to be solved if we want to make it otherwise. The whole
reason for transparent bouncing came from the fact that many existing
subsystems didn't really care about the needs of the underlying
hardware, e.g. block or network subsystems would just pass random
pages to the drivers. I think the idea of DMA mapping handling this
internally was to avoid implementing the bouncing here and there in
each block or network driver that needs it. (Arguably, an optional
helper could be provided instead for use in those subsystems...)

> >> In other words in a proper design the higher level would prepare the
> >> memory in a way the device driver can work with it, not the other way
> >> around.
> >>
> >> When a device driver gets memory it can't work with the correct response
> >> is to throw an error and bubble that up into a layer where it can be
> >> handled gracefully.
> >>
> >> For example instead of using bounce buffers under the hood the DMA layer
> >> the MM should make sure that when you call read() with O_DIRECT that the
> >> pages in question are accessible by the device.
> >>
> > I tend to agree with you if it's about a costly software "emulation"
> > like bounce buffers, but cache maintenance is a hardware feature
> > existing there by default and it's often much cheaper to operate on
> > cached memory and synchronize the caches rather than have everything
> > in uncached (or write-combined) memory.
>
> Well that we should have proper cache maintenance is really not the
> question. The question is where to do that?
>
> E.g. in the DMA-mapping framework which as far as I can see should only
> take care of translating addresses.

The DMA mapping framework is actually a DMA allocation and mapping
framework. Generic memory allocation primitives (like alloc_pages(),
kmalloc()) shouldn't be used for buffers that are expected to be
passed to DMA engines - there exist DMA-aware replacements, like
dma_alloc_pages(), dma_alloc_coherent(), dma_alloc_noncontiguous(),
etc.

>
> Or the higher level (get_user_pages() is just one example of that) which
> says ok device X wants to access memory Y I need to make sure that
> caches are flushed/invalidated.

Okay, so here comes the userptr case, which I really feel like is just
doomed at the very start, because it does nothing to accommodate
hardware needs at allocation time and then just expects some magic to
happen to make it possible for the hardware to make use of the buffer.

That said, it should be still possible to handle that pretty well for
hardware that allows it, i.e. without much memory access constraints.
What would be still missing if we just use the existing gup() +
dma_map() + dma_sync() sequence?

>
> >>> It's a use-case that is working fine today with many devices (e.g. network
> >>> adapters) in the ARM world, exactly because the architecture specific
> >>> implementation of the DMA API inserts the cache maintenance operations
> >>> on buffer ownership transfer.
> >> Yeah, I'm perfectly aware of that. The problem is that exactly that
> >> design totally breaks GPUs on Xen DOM0 for example.
> >>
> >> And Xen is just one example, I can certainly say from experience that
> >> this design was a really really bad idea because it favors just one use
> >> case while making other use cases practically impossible if not really
> >> hard to implement.
> > Sorry, I haven't worked with Xen. Could you elaborate what's the
> > problem that this introduces for it?
>
> That's a bit longer topic. The AMD XEN devs are already working on this
> as far as I know. I can ping internally how far they got with sending
> the patches out to avoid this problem.

Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.

Best regards,
Tomasz