[Intel-gfx] [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap

Thu Mar 11 10:22:06 UTC 2021

On 3/1/21 3:09 PM, Daniel Vetter wrote:
> On Mon, Mar 1, 2021 at 11:17 AM Christian König
> <christian.koenig at amd.com> wrote:
>>
>>
>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>> <thomas_os at shipmail.org> wrote:
>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>> good
>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>> system
>>>>>>>> memory.
>>>>>>> Hmm,
>>>>>>>
>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>
>>>>>>>      * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>> without "struct
>>>>>>>      * page" backing, however the difference is that _all_ pages
>>>>>>> with a struct
>>>>>>>      * page (that is, those where pfn_valid is true) are refcounted
>>>>>>> and
>>>>>>> considered
>>>>>>>      * normal pages by the VM. The disadvantage is that pages are
>>>>>>> refcounted
>>>>>>>      * (which can be slower and simply not an option for some PFNMAP
>>>>>>> users). The
>>>>>>>      * advantage is that we don't have to follow the strict
>>>>>>> linearity rule of
>>>>>>>      * PFNMAP mappings in order to support COWable mappings.
>>>>>>>
>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>> path, so
>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>> case
>>>>>>> why there could any longer ever be a significant performance
>>>>>>> difference
>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>> what sticks.
>>>>>>
>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>> devmap
>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>> there.
>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>> to find the underlying page.
>>>>>> -Daniel
>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>> to set
>>>>>
>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>> true, and
>>>>> then
>>>>>
>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>> gup_fast()
>>>>> backs off,
>>>>>
>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>> devmap
>>>>> page table entry for which we haven't registered any devmap struct
>>>>> pages
>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>> huge
>>>>> page table entry".
>>>>>
>>>>>   From what I can tell, all code calling get_dev_pagemap() already
>>>>> does that,
>>>>> it's just a question of getting it accepted and formalizing it.
>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>> else that would block gup_fast from falling over. I guess really would
>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>> fails like we expect.
>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>> Otherwise pmd_devmap() will not return true and since there is no
>>> pmd_special() things break.
>> Is that maybe the issue we have seen with amdgpu and huge pages?
> Yeah, essentially when you have a hugepte inserted by ttm, and it
> happens to point at system memory, then gup will work on that. And
> create all kinds of havoc.
>
>> Apart from that I'm lost guys, that devmap and gup stuff is not
>> something I have a good knowledge of apart from a one mile high view.
> I'm not really better, hence would be good to do a testcase and see.
> This should provoke it:
> - allocate nicely aligned bo in system memory
> - mmap, again nicely aligned to 2M
> - do some direct io from a filesystem into that mmap, that should trigger gup
> - before the gup completes free the mmap and bo so that ttm recycles
> the pages, which should trip up on the elevated refcount. If you wait
> until the direct io is completely, then I think nothing bad can be
> observed.
>
> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
> another issue.
>
> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
> -Daniel

So I did the following quick experiment on vmwgfx, and it turns out that 
with it,
fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds

I should probably craft an RFC formalizing this.

/Thomas

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf66744..72b6fb17c984 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         pfn_t pfnt;
         struct ttm_tt *ttm = bo->ttm;
         bool write = vmf->flags & FAULT_FLAG_WRITE;
+       struct dev_pagemap *pagemap;

         /* Fault should not cross bo boundary. */
         page_offset &= ~(fault_page_size - 1);
@@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         if ((pfn & (fault_page_size - 1)) != 0)
                 goto out_fallback;

+       /*
+        * Huge entries must be special, that is marking them as devmap
+        * with no backing device map range. If there is a backing
+        * range, Don't insert a huge entry.
+        */
+       pagemap = get_dev_pagemap(pfn, NULL);
+       if (pagemap) {
+               put_dev_pagemap(pagemap);
+               goto out_fallback;
+       }
+
         /* Check that memory is contiguous. */
         if (!bo->mem.bus.is_iomem) {
                 for (i = 1; i < fault_page_size; ++i) {
@@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
                 }
         }

-       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
+       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
         if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
                 ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct 
vm_fault *vmf,
         if (ret != VM_FAULT_NOPAGE)
                 goto out_fallback;

+#if 1
+       {
+               int npages;
+               struct page *page;
+
+               npages = get_user_pages_fast_only(vmf->address, 1, 0, 
&page);
+               if (npages == 1) {
+                       DRM_WARN("Fast gup succeeded. Bad.\n");
+                       put_page(page);
+               } else {
+                       DRM_INFO("Fast gup failed. Good.\n");
+               }
+       }
+#endif
+
         return VM_FAULT_NOPAGE;
  out_fallback:
         count_vm_event(THP_FAULT_FALLBACK);