[Intel-gfx] [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap
Thomas Hellström (Intel)
thomas_os at shipmail.org
Thu Mar 11 15:37:45 UTC 2021
On 3/11/21 2:17 PM, Daniel Vetter wrote:
> On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel)
> <thomas_os at shipmail.org> wrote:
>> Hi!
>>
>> On 3/11/21 2:00 PM, Daniel Vetter wrote:
>>> On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/1/21 3:09 PM, Daniel Vetter wrote:
>>>>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
>>>>> <christian.koenig at amd.com> wrote:
>>>>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>>>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>>>>>> <thomas_os at shipmail.org> wrote:
>>>>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>>>>>> good
>>>>>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>>>>>> system
>>>>>>>>>>>> memory.
>>>>>>>>>>> Hmm,
>>>>>>>>>>>
>>>>>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>>>>>
>>>>>>>>>>> * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>>>>>> without "struct
>>>>>>>>>>> * page" backing, however the difference is that _all_ pages
>>>>>>>>>>> with a struct
>>>>>>>>>>> * page (that is, those where pfn_valid is true) are refcounted
>>>>>>>>>>> and
>>>>>>>>>>> considered
>>>>>>>>>>> * normal pages by the VM. The disadvantage is that pages are
>>>>>>>>>>> refcounted
>>>>>>>>>>> * (which can be slower and simply not an option for some PFNMAP
>>>>>>>>>>> users). The
>>>>>>>>>>> * advantage is that we don't have to follow the strict
>>>>>>>>>>> linearity rule of
>>>>>>>>>>> * PFNMAP mappings in order to support COWable mappings.
>>>>>>>>>>>
>>>>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>>>>>> path, so
>>>>>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>>>>>> case
>>>>>>>>>>> why there could any longer ever be a significant performance
>>>>>>>>>>> difference
>>>>>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>>>>>> what sticks.
>>>>>>>>>>
>>>>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>>>>>> devmap
>>>>>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>>>>>> there.
>>>>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>>>>>> to find the underlying page.
>>>>>>>>>> -Daniel
>>>>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>>>>>> to set
>>>>>>>>>
>>>>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>>>>>> true, and
>>>>>>>>> then
>>>>>>>>>
>>>>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>>>>>> gup_fast()
>>>>>>>>> backs off,
>>>>>>>>>
>>>>>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>>>>>> devmap
>>>>>>>>> page table entry for which we haven't registered any devmap struct
>>>>>>>>> pages
>>>>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>>>>>> huge
>>>>>>>>> page table entry".
>>>>>>>>>
>>>>>>>>> From what I can tell, all code calling get_dev_pagemap() already
>>>>>>>>> does that,
>>>>>>>>> it's just a question of getting it accepted and formalizing it.
>>>>>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>>>>>> else that would block gup_fast from falling over. I guess really would
>>>>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>>>>>> fails like we expect.
>>>>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>>>>>> Otherwise pmd_devmap() will not return true and since there is no
>>>>>>> pmd_special() things break.
>>>>>> Is that maybe the issue we have seen with amdgpu and huge pages?
>>>>> Yeah, essentially when you have a hugepte inserted by ttm, and it
>>>>> happens to point at system memory, then gup will work on that. And
>>>>> create all kinds of havoc.
>>>>>
>>>>>> Apart from that I'm lost guys, that devmap and gup stuff is not
>>>>>> something I have a good knowledge of apart from a one mile high view.
>>>>> I'm not really better, hence would be good to do a testcase and see.
>>>>> This should provoke it:
>>>>> - allocate nicely aligned bo in system memory
>>>>> - mmap, again nicely aligned to 2M
>>>>> - do some direct io from a filesystem into that mmap, that should trigger gup
>>>>> - before the gup completes free the mmap and bo so that ttm recycles
>>>>> the pages, which should trip up on the elevated refcount. If you wait
>>>>> until the direct io is completely, then I think nothing bad can be
>>>>> observed.
>>>>>
>>>>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
>>>>> another issue.
>>>>>
>>>>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
>>>>> -Daniel
>>>> So I did the following quick experiment on vmwgfx, and it turns out that
>>>> with it,
>>>> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
>>>>
>>>> I should probably craft an RFC formalizing this.
>>> Yeah I think that would be good. Maybe even more formalized if we also
>>> switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
>>> fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
>>> something like that.
>>>
>>> Otoh your description of when it only sometimes succeeds would indicate my
>>> understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>> My understanding from reading the vmf_insert_mixed() code is that iff
>> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
>> not consistent with the vm_normal_page() doc. For architectures without
>> pte_special, VM_PFNMAP must be used, and then we must also block COW
>> mappings.
>>
>> If we can get someone can commit to verify that the potential PAT WC
>> performance issue is gone with PFNMAP, I can put together a series with
>> that included.
> Iirc when I checked there's not much archs without pte_special, so I
> guess that's why we luck out. Hopefully.
>
>> As for existing userspace using COW TTM mappings, I once had a couple of
>> test cases to verify that it actually worked, in particular together
>> with huge PMDs and PUDs where breaking COW would imply splitting those,
>> but I can't think of anything else actually wanting to do that other
>> than by mistake.
> Yeah disallowing MAP_PRIVATE mappings would be another good thing to
> lock down. Really doesn't make much sense.
> -Daniel
Yes, we can't allow them with PFNMAP + a non-linear address space...
/Thomas
>> /Thomas
>>
>>
>>> Christian, what's your take?
>>> -Daniel
>>>
>>>> /Thomas
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> index 6dc96cf66744..72b6fb17c984 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>> pfn_t pfnt;
>>>> struct ttm_tt *ttm = bo->ttm;
>>>> bool write = vmf->flags & FAULT_FLAG_WRITE;
>>>> + struct dev_pagemap *pagemap;
>>>>
>>>> /* Fault should not cross bo boundary. */
>>>> page_offset &= ~(fault_page_size - 1);
>>>> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>> if ((pfn & (fault_page_size - 1)) != 0)
>>>> goto out_fallback;
>>>>
>>>> + /*
>>>> + * Huge entries must be special, that is marking them as devmap
>>>> + * with no backing device map range. If there is a backing
>>>> + * range, Don't insert a huge entry.
>>>> + */
>>>> + pagemap = get_dev_pagemap(pfn, NULL);
>>>> + if (pagemap) {
>>>> + put_dev_pagemap(pagemap);
>>>> + goto out_fallback;
>>>> + }
>>>> +
>>>> /* Check that memory is contiguous. */
>>>> if (!bo->mem.bus.is_iomem) {
>>>> for (i = 1; i < fault_page_size; ++i) {
>>>> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>> }
>>>> }
>>>>
>>>> - pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
>>>> + pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>>>> if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>>>> ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>>>> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>> if (ret != VM_FAULT_NOPAGE)
>>>> goto out_fallback;
>>>>
>>>> +#if 1
>>>> + {
>>>> + int npages;
>>>> + struct page *page;
>>>> +
>>>> + npages = get_user_pages_fast_only(vmf->address, 1, 0,
>>>> &page);
>>>> + if (npages == 1) {
>>>> + DRM_WARN("Fast gup succeeded. Bad.\n");
>>>> + put_page(page);
>>>> + } else {
>>>> + DRM_INFO("Fast gup failed. Good.\n");
>>>> + }
>>>> + }
>>>> +#endif
>>>> +
>>>> return VM_FAULT_NOPAGE;
>>>> out_fallback:
>>>> count_vm_event(THP_FAULT_FALLBACK);
>>>>
>>>>
>>>>
>>>>
>>>>
>
>
More information about the Intel-gfx
mailing list