Use of pci_map_page in nouveau, radeon TTM.

Tue Oct 1 04:56:32 PDT 2013

Am Dienstag, den 01.10.2013, 13:13 +0200 schrieb Thomas Hellstrom:
> On 10/01/2013 12:34 PM, Lucas Stach wrote:
> > Am Dienstag, den 01.10.2013, 12:16 +0200 schrieb Thomas Hellstrom:
> >> Jerome, Konrad
> >>
> >> Forgive an ignorant question, but it appears like both Nouveau and
> >> Radeon may use pci_map_page() when populating TTMs on
> >> pages obtained using the ordinary (not DMA pool). These pages will, if I
> >> understand things correctly, not be pages allocated with
> >> DMA_ALLOC_COHERENT.
> >>
> >>   From what I understand, at least for the corresponding dma_map_page()
> >> it's illegal for the CPU to access these pages without calling
> >> dma_sync_xx_for_cpu(). And before the device is allowed to access them
> >> again, you need to call dma_sync_xx_for_device().
> >> So mapping for PCI really invalidates the TTM interleaved CPU / device
> >> access model.
> >>
> > That's right. The API says you need to sync for device or cpu, but on
> > x86 you can get away with not doing so, as on x86 the calls end up just
> > being WB buffer flushes.
> 
> OK, but what about the cases where the dma subsystem allocates a bounce 
> buffer?
> (Although I think the TTM page selection works around this situation).
> Perhaps at the very least this deserves a comment in the code...

Not doing the the sync_for_* is always a violation of the dma-mapping
API and will rightfully fail on systems relying on those mechanisms to
do proper dma memory handling, bounce buffers are just one of those
cases.

> >
> > For ARM, or similar non-coherent arches you absolutely have to do the
> > syncs, or you'll end up with different contents in cache vs sysram. For
> > my nouveau on ARM work I introduced some simple helpers to do the right
> > thing. And it really isn't hard doing the syncs at the right points in
> > time, just sync for CPU when getting a cpu_prep ioctl and then sync for
> > device when validating a buffer for GPU use.
> 
> Yes, this will probably work for drivers where a buffer is either bound 
> for CPU or for GPU,
> however, on drivers using user-space sub-allocation of buffers, or for 
> partial updates of
> vertex buffers etc. that isn't sufficient. In that case one either has 
> to use coherent memory
> or implement an elaborate scheme where we sync for device and kill 
> user-space mappings on validation and
> sync for cpu in the cpu fault handler. Unfortunately the latter triggers 
> a fence wait for the
> whole buffer, not just the part of the buffer we want to write to.
> >
Yeah, either you have to use dma coherent memory, or implement some
scheme where you only sync subregions of a buffer. Though having to call
a cpu_prepare_subbuffer ioctl might just kill all benefits you got from
using userspace suballocation. So using coherent mem for those buffers
seems like a safe bet.

I already implemented some of this in the nouveau nv50 MESA driver which
uses userspace suballocation, but unfortunately I can't do any serious
performance measurements, as the system setup has other unrelated
bottlenecks.

Regards,
Lucas
-- 
Pengutronix e.K.                           | Lucas Stach                 |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-5076 |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |