[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

Thu Jun 26 07:50:04 PDT 2014

On Tue, Jun 24, 2014 at 10:58 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
> Am Dienstag, den 24.06.2014, 22:52 +0900 schrieb Alexandre Courbot:
>> On Tue, Jun 24, 2014 at 10:25 PM, Lucas Stach <l.stach at pengutronix.de> wrote:
>> > Am Dienstag, den 24.06.2014, 14:27 +0200 schrieb Maarten Lankhorst:
>> >> op 24-06-14 14:23, Alexandre Courbot schreef:
>> >> > On Tue, Jun 24, 2014 at 7:55 PM, Alexandre Courbot <acourbot at nvidia.com> wrote:
>> >> >> On 06/24/2014 07:33 PM, Alexandre Courbot wrote:
>> >> >>> On 06/24/2014 07:02 PM, Russell King - ARM Linux wrote:
>> >> >>>> On Tue, Jun 24, 2014 at 06:54:26PM +0900, Alexandre Courbot wrote:
>> >> >>>>> From: Lucas Stach <dev at lynxeye.de>
>> >> >>>>>
>> >> >>>>> On architectures for which access to GPU memory is non-coherent,
>> >> >>>>> caches need to be flushed and invalidated explicitly at the
>> >> >>>>> appropriate places. Introduce two small helpers to make things
>> >> >>>>> easy for TTM-based drivers.
>> >> >>>>
>> >> >>>> Have you run this with DMA API debugging enabled?  I suspect you haven't,
>> >> >>>> and I recommend that you do.
>> >> >>>
>> >> >>> # cat /sys/kernel/debug/dma-api/error_count
>> >> >>> 162621
>> >> >>>
>> >> >>> (╯°□°）╯︵ ┻━┻)
>> >> >>
>> >> >> *puts table back on its feet*
>> >> >>
>> >> >> So, yeah - TTM memory is not allocated using the DMA API, hence we cannot
>> >> >> use the DMA API to sync it. Thanks Russell for pointing it out.
>> >> >>
>> >> >> The only alternative I see here is to flush the CPU caches when syncing for
>> >> >> the device, and invalidate them for the other direction. Of course if the
>> >> >> device has caches on its side as well the opposite operation must also be
>> >> >> done for it. Guess the only way is to handle it all by ourselves here. :/
>> >> > ... and it really sucks. Basically if we cannot use the DMA API here
>> >> > we will lose the convenience of having a portable API that does just
>> >> > the right thing for the underlying platform. Without it we would have
>> >> > to duplicate arm_iommu_sync_single_for_cpu/device() and we would only
>> >> > have support for ARM.
>> >> >
>> >> > The usage of the DMA API that we are doing might be illegal, but in
>> >> > essence it does exactly what we need - at least for ARM. What are the
>> >> > alternatives?
>> >> Convert TTM to use the dma api? :-)
>> >
>> > Actually TTM already has a page alloc backend using the DMA API. It's
>> > just not used for the standard case right now.
>>
>> Indeed, and Nouveau even already makes use of it if CONFIG_SWIOTLB is
>> set apparently.
>>
>> > I would argue that we should just use this page allocator (which has the
>> > side effect of getting pages from CMA if available -> you are actually
>> > free to change the caching) and do away with the other allocator in the
>> > ARM case.
>>
>> Mm? Does it mean that CMA memory is not mapped into lowmem? That would
>> certainly help in the present case, but I wonder how useful it will be
>> once the iommu support is in place. Will also need to consider
>> performance of such coherent memory for e.g. user-space mappings.
>>
>> Anyway, I will experiment a bit with this tomorrow, thanks!
>
> CMA memory is reserved before the lowmem section mapping is set up. It
> is then mapped with individual 4k pages before giving it back to the
> buddy allocator.
> This means CMA pages in use by the kernel are mapped into lowmem, but
> they are actually unmapped from lowmem once you allocate them as DMA
> memory.

Tried enabling the DMA page allocation for GK20A. The great news is
that with it caching works as expected and that DMA API debugging does
not complain anymore when calling the sync functions. Actually, since
the DMA page allocator returns coherent memory, there is no need for
these sync functions anymore, which makes things easier. This seems to
be the simplest way towards enabling GK20A - albeit performance
suffers a little, but we can revisit that later once we have IOMMU
support.

I would not claim that we are fully compliant with what the DMA API
expects though. For instance, pages allocated for TTM via
dma_alloc_coherent() are later re-mapped in ttm_bo_kmap_ttm() using
vmap(), potentially with a different pgprot. The waste of address
space produced by these two simultaneous mappings aside, is this even
allowed? And when it comes to mapping these pages to user-space, TTM
does it without calling dma_mmap_coherent(), and again with whatever
flags we give it. IIUC memory allocated through the DMA API should
only be touched through the vaddr returned by dma_alloc_coherent(), or
through mappings provided by other DMA API functions. So is TTM
misusing the DMA API here?

In general, should we still be afraid of non-identical mappings on
modern CPUs like the A15 found in Tegra K1? I have heard contradictory
information so far and would really like to be able to understand this
once and for all, as it would give us more choices as for which memory
provider we can use.

Thanks,
Alex.