[Nouveau] [PATCH 3/6] mmu: map small pages into big pages(s) by IOMMU if possible

Mon Apr 20 04:55:01 PDT 2015


On 04/20/2015 07:51 PM, Vince Hsu wrote:
> On 04/20/2015 03:49 PM, Alexandre Courbot wrote:
>> On Sat, Apr 18, 2015 at 12:37 AM, Terje Bergstrom 
>> <tbergstrom at nvidia.com> wrote:
>>> On 04/17/2015 02:11 AM, Alexandre Courbot wrote:
>>>> Tracking the PDE and PTE of each memory chunk can probably be avoided
>>>> if you change your unmapping strategy. Currently you are going through
>>>> the list of nvkm_vm_bp_list, but you know your PDE and PTE are always
>>>> going to be adjacent, since a nvkm_vma represents a contiguous block
>>>> in the GPU VA. So when unmapping, you can simply check for each PTE
>>>> entry whether the IOMMU bit is set, and unmap from the IOMMU space
>>>> after unmapping from the GPU VA space, in a loop similar to that of
>>>> nvkm_vm_unmap_at().
>>>>
>>>> Then we only need priv. You are keeping the nvkm_mm_node of the IOMMU
>>>> space into it, and you need it to free the IOMMU VA space. If only we
>>>> could find another way to store it, we could get rid of the whole
>>>> structure and associated list_head in nvkm_vma...
>>>>
>>>> I need to give it some more thoughts, and we will probably need to
>>>> change a few things in base.c to make the hooks more flexible, so
>>>> please give me some more time to think about it. :) I just wanted to
>>>> share my thoughts so far in case this puts you on track.
>>> The way you described it would make GPU MMU and IOMMU mappings 1:1. 
>>> So when
>>> we map a buffer to GPU MMU, we always map page by page the buffer 
>>> also to
>>> IOMMU. There are disadvantages here.
>>>
>>> IOMMU addresses are global, and uses in the GPU caches. When a 
>>> buffer is
>>> mapped multiple times to different graphics contexts, we want to 
>>> avoid cache
>>> aliasing by mapping the buffer only once to IOMMU. We also want to 
>>> unmap the
>>> buffer from IOMMU only once after all the instances of the buffer 
>>> have been
>>> unmapped, or only when the buffer is actually freed to cache IOMMU 
>>> mappings.
>>>
>>> Doing IOMMU mapping for the whole buffer with dma_map_sg is also 
>>> faster than
>>> mapping page by page, because you can do only one TLB invalidate in 
>>> the end
>>> of the loop instead of after every page if you use dma_map_single.
>>>
>>> All of these would talk for having IOMMU and GMMU mapping loops 
>>> separate.
>>> This patch set does not implement both the advantages above, but your
>>> suggestion would take us further away from that than Vince's version.
>> Aha, looks like both Vince and I overlooked this point. So IIUC we
>> would need to make sure a GPU buffer is only ever mapped once by the
>> IOMMU. This means we either need to preemptively entilery map it at
>> some point and just keep a reference count, or keep track of which
>> 128k ranges are already mapped (again, with a reference count to know
>> when to unmap them).
>>
>> First solution is tempting because it is simpler, but surely there is
>> something wrong with it?
> We can maintain a list in mmu/gk20a.c and use the first small page's 
> address as a search index for the corresponding 128K page.
>
> e.g.
>
> --- a/drm/nouveau/nvkm/subdev/mmu/gk20a.c
> +++ b/drm/nouveau/nvkm/subdev/mmu/gk20a.c
> @@ -38,6 +38,8 @@ struct gk20a_mmu_priv {
>  struct gk20a_mmu_iommu_mapping {
>         struct nvkm_mm_node *node;
>         u64 iova;
> +       u64 index;
> +       struct list_head head;
>  };
>
>  extern const u8 gf100_pte_storage_type_map[256];
> @@ -79,6 +81,14 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
> nvkm_gpuobj *pgt,
>
>         plat = nv_device_to_platform(nv_device(&mmu->base));
>
> +       list_for_each_entry(mapping, mapping_list, head) {
> +               if (mapping->index == *list) {
> +                       mapping->refcnt++;
> +                       *priv = mapping;
Bad example. We have to program the PTE for the new GVA before return.

> + return;
> +               }
> +       }
> +
>         *priv = kzalloc(sizeof(struct gk20a_mmu_iommu_mapping), 
> GFP_KERNEL);
>         if (!*priv)
>                 return;
> @@ -95,6 +105,8 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
> nvkm_gpuobj *pgt,
>         if (ret)
>                 return;
>
> +       priv->index = *list;
> +
>         for (i = 0; i < npages; i++, list++) {
>                 ret = iommu_map(plat->gpu->iommu.domain,
>                                 (node->offset + i) << PAGE_SHIFT,
> @@ -117,6 +129,7 @@ gk20a_vm_map_iommu(struct nvkm_vma *vma, struct 
> nvkm_gpuobj *pgt,
>         p = *priv;
>         p->node = node;
>         p->iova = node->offset << PAGE_SHIFT;
> +       list_add_tail(&priv->head, mapping_list);
>  }
>
>
> Thanks,
> Vince
>
>
>