[Nouveau] [PATCH 3/6] mmu: map small pages into big pages(s) by IOMMU if possible

Mon Apr 20 04:51:25 PDT 2015

On 04/20/2015 03:49 PM, Alexandre Courbot wrote:
> On Sat, Apr 18, 2015 at 12:37 AM, Terje Bergstrom <tbergstrom at nvidia.com> wrote:
>> On 04/17/2015 02:11 AM, Alexandre Courbot wrote:
>>> Tracking the PDE and PTE of each memory chunk can probably be avoided
>>> if you change your unmapping strategy. Currently you are going through
>>> the list of nvkm_vm_bp_list, but you know your PDE and PTE are always
>>> going to be adjacent, since a nvkm_vma represents a contiguous block
>>> in the GPU VA. So when unmapping, you can simply check for each PTE
>>> entry whether the IOMMU bit is set, and unmap from the IOMMU space
>>> after unmapping from the GPU VA space, in a loop similar to that of
>>> nvkm_vm_unmap_at().
>>>
>>> Then we only need priv. You are keeping the nvkm_mm_node of the IOMMU
>>> space into it, and you need it to free the IOMMU VA space. If only we
>>> could find another way to store it, we could get rid of the whole
>>> structure and associated list_head in nvkm_vma...
>>>
>>> I need to give it some more thoughts, and we will probably need to
>>> change a few things in base.c to make the hooks more flexible, so
>>> please give me some more time to think about it. :) I just wanted to
>>> share my thoughts so far in case this puts you on track.
>> The way you described it would make GPU MMU and IOMMU mappings 1:1. So when
>> we map a buffer to GPU MMU, we always map page by page the buffer also to
>> IOMMU. There are disadvantages here.
>>
>> IOMMU addresses are global, and uses in the GPU caches. When a buffer is
>> mapped multiple times to different graphics contexts, we want to avoid cache
>> aliasing by mapping the buffer only once to IOMMU. We also want to unmap the
>> buffer from IOMMU only once after all the instances of the buffer have been
>> unmapped, or only when the buffer is actually freed to cache IOMMU mappings.
>>
>> Doing IOMMU mapping for the whole buffer with dma_map_sg is also faster than
>> mapping page by page, because you can do only one TLB invalidate in the end
>> of the loop instead of after every page if you use dma_map_single.
>>
>> All of these would talk for having IOMMU and GMMU mapping loops separate.
>> This patch set does not implement both the advantages above, but your
>> suggestion would take us further away from that than Vince's version.
> Aha, looks like both Vince and I overlooked this point. So IIUC we
> would need to make sure a GPU buffer is only ever mapped once by the
> IOMMU. This means we either need to preemptively entilery map it at
> some point and just keep a reference count, or keep track of which
> 128k ranges are already mapped (again, with a reference count to know
> when to unmap them).
>
> First solution is tempting because it is simpler, but surely there is
> something wrong with it?
We can maintain a list in mmu/gk20a.c and use the first small page's 
address as a search index for the corresponding 128K page.

e.g.

--- a/drm/nouveau/nvkm/subdev/mmu/gk20a.c
+++ b/drm/nouveau/nvkm/subdev/mmu/gk20a.c
@@ -38,6 +38,8 @@ struct gk20a_mmu_priv {
  struct gk20a_mmu_iommu_mapping {
         struct nvkm_mm_node *node;
         u64 iova;
+       u64 index;
+       struct list_head head;
  };

  extern const u8 gf100_pte_storage_type_map[256];
@@ -79,6 +81,14 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
nvkm_gpuobj *pgt,

         plat = nv_device_to_platform(nv_device(&mmu->base));

+       list_for_each_entry(mapping, mapping_list, head) {
+               if (mapping->index == *list) {
+                       mapping->refcnt++;
+                       *priv = mapping;
+                       return;
+               }
+       }
+
         *priv = kzalloc(sizeof(struct gk20a_mmu_iommu_mapping), 
GFP_KERNEL);
         if (!*priv)
                 return;
@@ -95,6 +105,8 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
nvkm_gpuobj *pgt,
         if (ret)
                 return;

+       priv->index = *list;
+
         for (i = 0; i < npages; i++, list++) {
                 ret = iommu_map(plat->gpu->iommu.domain,
                                 (node->offset + i) << PAGE_SHIFT,
@@ -117,6 +129,7 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
nvkm_gpuobj *pgt,
         p = *priv;
         p->node = node;
         p->iova = node->offset << PAGE_SHIFT;
+       list_add_tail(&priv->head, mapping_list);
  }


Thanks,
Vince