[Nouveau] [PATCH 3/6] mmu: map small pages into big pages(s) by IOMMU if possible

Mon Apr 20 00:49:55 PDT 2015

On Sat, Apr 18, 2015 at 12:37 AM, Terje Bergstrom <tbergstrom at nvidia.com> wrote:
>
> On 04/17/2015 02:11 AM, Alexandre Courbot wrote:
>>
>> Tracking the PDE and PTE of each memory chunk can probably be avoided
>> if you change your unmapping strategy. Currently you are going through
>> the list of nvkm_vm_bp_list, but you know your PDE and PTE are always
>> going to be adjacent, since a nvkm_vma represents a contiguous block
>> in the GPU VA. So when unmapping, you can simply check for each PTE
>> entry whether the IOMMU bit is set, and unmap from the IOMMU space
>> after unmapping from the GPU VA space, in a loop similar to that of
>> nvkm_vm_unmap_at().
>>
>> Then we only need priv. You are keeping the nvkm_mm_node of the IOMMU
>> space into it, and you need it to free the IOMMU VA space. If only we
>> could find another way to store it, we could get rid of the whole
>> structure and associated list_head in nvkm_vma...
>>
>> I need to give it some more thoughts, and we will probably need to
>> change a few things in base.c to make the hooks more flexible, so
>> please give me some more time to think about it. :) I just wanted to
>> share my thoughts so far in case this puts you on track.
>
> The way you described it would make GPU MMU and IOMMU mappings 1:1. So when
> we map a buffer to GPU MMU, we always map page by page the buffer also to
> IOMMU. There are disadvantages here.
>
> IOMMU addresses are global, and uses in the GPU caches. When a buffer is
> mapped multiple times to different graphics contexts, we want to avoid cache
> aliasing by mapping the buffer only once to IOMMU. We also want to unmap the
> buffer from IOMMU only once after all the instances of the buffer have been
> unmapped, or only when the buffer is actually freed to cache IOMMU mappings.
>
> Doing IOMMU mapping for the whole buffer with dma_map_sg is also faster than
> mapping page by page, because you can do only one TLB invalidate in the end
> of the loop instead of after every page if you use dma_map_single.
>
> All of these would talk for having IOMMU and GMMU mapping loops separate.
> This patch set does not implement both the advantages above, but your
> suggestion would take us further away from that than Vince's version.

Aha, looks like both Vince and I overlooked this point. So IIUC we
would need to make sure a GPU buffer is only ever mapped once by the
IOMMU. This means we either need to preemptively entilery map it at
some point and just keep a reference count, or keep track of which
128k ranges are already mapped (again, with a reference count to know
when to unmap them).

First solution is tempting because it is simpler, but surely there is
something wrong with it?