[Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap

Fri Feb 26 13:28:39 UTC 2021

On Fri, Feb 26, 2021 at 10:41 AM Thomas Hellström (Intel)
<thomas_os at shipmail.org> wrote:
>
>
> On 2/25/21 4:49 PM, Daniel Vetter wrote:
> > On Thu, Feb 25, 2021 at 11:44 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> >> On Thu, Feb 25, 2021 at 11:28:31AM +0100, Christian König wrote:
> >>> Am 24.02.21 um 10:31 schrieb Daniel Vetter:
> >>>> On Wed, Feb 24, 2021 at 10:16 AM Thomas Hellström (Intel)
> >>>> <thomas_os at shipmail.org> wrote:
> >>>>> On 2/24/21 9:45 AM, Daniel Vetter wrote:
> >>>>>> On Wed, Feb 24, 2021 at 8:46 AM Thomas Hellström (Intel)
> >>>>>> <thomas_os at shipmail.org> wrote:
> >>>>>>> On 2/23/21 11:59 AM, Daniel Vetter wrote:
> >>>>>>>> tldr; DMA buffers aren't normal memory, expecting that you can use
> >>>>>>>> them like that (like calling get_user_pages works, or that they're
> >>>>>>>> accounting like any other normal memory) cannot be guaranteed.
> >>>>>>>>
> >>>>>>>> Since some userspace only runs on integrated devices, where all
> >>>>>>>> buffers are actually all resident system memory, there's a huge
> >>>>>>>> temptation to assume that a struct page is always present and useable
> >>>>>>>> like for any more pagecache backed mmap. This has the potential to
> >>>>>>>> result in a uapi nightmare.
> >>>>>>>>
> >>>>>>>> To stop this gap require that DMA buffer mmaps are VM_PFNMAP, which
> >>>>>>>> blocks get_user_pages and all the other struct page based
> >>>>>>>> infrastructure for everyone. In spirit this is the uapi counterpart to
> >>>>>>>> the kernel-internal CONFIG_DMABUF_DEBUG.
> >>>>>>>>
> >>>>>>>> Motivated by a recent patch which wanted to swich the system dma-buf
> >>>>>>>> heap to vm_insert_page instead of vm_insert_pfn.
> >>>>>>>>
> >>>>>>>> v2:
> >>>>>>>>
> >>>>>>>> Jason brought up that we also want to guarantee that all ptes have the
> >>>>>>>> pte_special flag set, to catch fast get_user_pages (on architectures
> >>>>>>>> that support this). Allowing VM_MIXEDMAP (like VM_SPECIAL does) would
> >>>>>>>> still allow vm_insert_page, but limiting to VM_PFNMAP will catch that.
> >>>>>>>>
> >>>>>>>>     From auditing the various functions to insert pfn pte entires
> >>>>>>>> (vm_insert_pfn_prot, remap_pfn_range and all it's callers like
> >>>>>>>> dma_mmap_wc) it looks like VM_PFNMAP is already required anyway, so
> >>>>>>>> this should be the correct flag to check for.
> >>>>>>>>
> >>>>>>> If we require VM_PFNMAP, for ordinary page mappings, we also need to
> >>>>>>> disallow COW mappings, since it will not work on architectures that
> >>>>>>> don't have CONFIG_ARCH_HAS_PTE_SPECIAL, (see the docs for vm_normal_page()).
> >>>>>> Hm I figured everyone just uses MAP_SHARED for buffer objects since
> >>>>>> COW really makes absolutely no sense. How would we enforce this?
> >>>>> Perhaps returning -EINVAL on is_cow_mapping() at mmap time. Either that
> >>>>> or allowing MIXEDMAP.
> >>>>>
> >>>>>>> Also worth noting is the comment in  ttm_bo_mmap_vma_setup() with
> >>>>>>> possible performance implications with x86 + PAT + VM_PFNMAP + normal
> >>>>>>> pages. That's a very old comment, though, and might not be valid anymore.
> >>>>>> I think that's why ttm has a page cache for these, because it indeed
> >>>>>> sucks. The PAT changes on pages are rather expensive.
> >>>>> IIRC the page cache was implemented because of the slowness of the
> >>>>> caching mode transition itself, more specifically the wbinvd() call +
> >>>>> global TLB flush.
> >>> Yes, exactly that. The global TLB flush is what really breaks our neck here
> >>> from a performance perspective.
> >>>
> >>>>>> There is still an issue for iomem mappings, because the PAT validation
> >>>>>> does a linear walk of the resource tree (lol) for every vm_insert_pfn.
> >>>>>> But for i915 at least this is fixed by using the io_mapping
> >>>>>> infrastructure, which does the PAT reservation only once when you set
> >>>>>> up the mapping area at driver load.
> >>>>> Yes, I guess that was the issue that the comment describes, but the
> >>>>> issue wasn't there with vm_insert_mixed() + VM_MIXEDMAP.
> >>>>>
> >>>>>> Also TTM uses VM_PFNMAP right now for everything, so it can't be a
> >>>>>> problem that hurts much :-)
> >>>>> Hmm, both 5.11 and drm-tip appears to still use MIXEDMAP?
> >>>>>
> >>>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/ttm/ttm_bo_vm.c#L554
> >>>> Uh that's bad, because mixed maps pointing at struct page wont stop
> >>>> gup. At least afaik.
> >>> Hui? I'm pretty sure MIXEDMAP stops gup as well. Otherwise we would have
> >>> already seen tons of problems with the page cache.
> >> On any architecture which has CONFIG_ARCH_HAS_PTE_SPECIAL vm_insert_mixed
> >> boils down to vm_insert_pfn wrt gup. And special pte stops gup fast path.
> >>
> >> But if you don't have VM_IO or VM_PFNMAP set, then I'm not seeing how
> >> you're stopping gup slow path. See check_vma_flags() in mm/gup.c.
> >>
> >> Also if you don't have CONFIG_ARCH_HAS_PTE_SPECIAL then I don't think
> >> vm_insert_mixed even works on iomem pfns. There's the devmap exception,
> >> but we're not devmap. Worse ttm abuses some accidental codepath to smuggle
> >> in hugepte support by intentionally not being devmap.
> >>
> >> So I'm really not sure this works as we think it should. Maybe good to do
> >> a quick test program on amdgpu with a buffer in system memory only and try
> >> to do direct io into it. If it works, you have a problem, and a bad one.
> > That's probably impossible, since a quick git grep shows that pretty
> > much anything reasonable has special ptes: arc, arm, arm64, powerpc,
> > riscv, s390, sh, sparc, x86. I don't think you'll have a platform
> > where you can plug an amdgpu in and actually exercise the bug :-)
>
> Hm. AFAIK _insert_mixed() doesn't set PTE_SPECIAL on system pages, so I
> don't see what should be stopping gup to those?

If you have an arch with pte special we use insert_pfn(), which afaict
will use pte_mkspecial for the !devmap case. And ttm isn't devmap
(otherwise our hugepte abuse of devmap hugeptes would go rather
wrong).

So I think it stops gup. But I haven't verified at all. Would be good
if Christian can check this with some direct io to a buffer in system
memory.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch