[RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)
Kasireddy, Vivek
vivek.kasireddy at intel.com
Tue Aug 1 07:11:09 UTC 2023
Hi Peter,
> >
> > > > > > > > > I'm not at all familiar with the udmabuf use case but that
> sounds
> > > > > > > > > brittle and effectively makes this notifier udmabuf specific
> right?
> > > > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > > > components
> > > > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest
> created
> > > > > > > > buffers. In other words, from a core mm standpoint, udmabuf
> just
> > > > > > > > collects a bunch of pages (associated with buffers) scattered
> inside
> > > > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and
> wraps
> > > > > > > > them in a dmabuf fd. And, since we provide zero-copy access,
> we
> > > > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > > > Guest do not access the buffer simultaneously.
> > > > > > >
> > > > > > > So why do you need to track updates proactively like this?
> > > > > > As David noted in the earlier series, if Qemu punches a hole in its
> > > memfd
> > > > > > that goes through pages that are registered against a udmabuf fd,
> then
> > > > > > udmabuf needs to update its list with new pages when the hole gets
> > > > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > > > problem (between udmabuf and memfd) as demonstrated in the
> > > selftest
> > > > > > (patch #3 in this series).
> > > > >
> > > > > Wouldn't this all be very much better if Qemu stopped punching holes
> > > there?
> > > > I think holes can be punched anywhere in the memfd for various
> reasons.
> > > Some
> > >
> > > I just start to read this thread, even haven't finished all of them.. but
> > > so far I'm not sure whether this is right at all..
> > >
> > > udmabuf is a file, it means it should follow the file semantics. Mmu
> > Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK,
> > operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
> > on it. And, in our use-case, since udmabuf driver is sharing (or exporting)
> its
> > buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
> > to only read from it.
> >
> > > notifier is per-mm, otoh.
> > >
> > > Imagine for some reason QEMU mapped the guest pages twice, udmabuf
> is
> > > created with vma1, so udmabuf registers the mm changes over vma1
> only.
> > Udmabufs are created with pages obtained from the mapping using offsets
> > provided by Qemu.
> >
> > >
> > > However the shmem/hugetlb page cache can be populated in either
> vma1, or
> > > vma2. It means when populating on vma2 udmabuf won't get update
> notify
> > > at
> > > all, udmabuf pages can still be obsolete. Same thing to when multi-
> process
> > In this (unlikely) scenario you described above,
>
> IMHO it's very legal for qemu to do that, we won't want this to break so
> easily and silently simply because qemu mapped it twice. I would hope
> it'll not be myself to debug something like that. :)
>
> I actually personally have a tree that does exactly that:
>
> https://github.com/xzpeter/qemu/commit/62050626d6e511d022953165cc0f
> 604bf90c5324
>
> But that's definitely not in main line.. it shouldn't need special
> attention, either. Just want to say that it can always happen for various
> reasons especially in an relatively involved software piece like QEMU.
Ok, I'll keep your use-case in mind but AFAICS, the process that creates
the udmabuf can be considered the owner. So, I think it makes sense that
the owner's VMA range can be registered (via mmu_notifiers) for updates.
>
> > I think we could still find all the
> > VMAs (and ranges) where the guest buffer pages are mapped (and register
> > for PTE updates) using Qemu's mm_struct. The below code can be
> modified
> > to create a list of VMAs where the guest buffer pages are mapped.
> > static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
> > struct mm_struct *vmm_mm)
> > {
> > struct vm_area_struct *vma = NULL;
> > MA_STATE(mas, &vmm_mm->mm_mt, 0, 0);
> > unsigned long addr;
> > pgoff_t pg;
> >
> > mas_set(&mas, 0);
> > mmap_read_lock(vmm_mm);
> > mas_for_each(&mas, vma, ULONG_MAX) {
> > for (pg = 0; pg < ubuf->pagecount; pg++) {
> > addr = page_address_in_vma(ubuf->pages[pg], vma);
> > if (addr == -EFAULT)
> > break;
> > }
> > if (addr != -EFAULT)
> > break;
> > }
> > mmap_read_unlock(vmm_mm);
> >
> > return vma;
> > }
>
> This is hackish to me, and not working when across mm (multi-proc qemu).
Udmabuf backend is still considered experimental for multi-proc qemu (i.e, Qemu +
vhost-user-gpu given our use-case). And, it looks like the usage of the udmabuf
driver in both cases is different.
>
> >
> > > QEMU is used, where we can have vma1 in QEMU while vma2 in the
> other
> > > process like vhost-user.
> > >
> > > I think the trick here is we tried to "hide" the fact that these are
> > > actually normal file pages, but we're doing PFNMAP on them... then we
> want
> > > the file features back, like hole punching..
> > >
> > > If we used normal file operations, everything will just work fine;
> TRUNCATE
> > > will unmap the host mapped frame buffers when needed, and when
> > > accessed
> > > it'll fault on demand from the page cache. We seem to be trying to
> > > reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> > > this at least..
> > If we can figure out the VMA ranges where the guest buffer pages are
> mapped,
> > we should be able to register mmu notifiers for those ranges right?
>
> In general, sorry to say that, but, mmu notifiers still do not sound like
> the right approach here.
What limitation do you see with the usage of mmu notifiers for this use-case?
And, if using mmu notifiers is not the right approach, how do you suggest we
can solve this problem?
>
> >
> > >
> > > > of the use-cases where this would be done were identified by David.
> Here
> > > is what
> > > > he said in an earlier discussion:
> > > > "There are *probably* more issues on the QEMU side when udmabuf is
> > > paired
> > > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> > >
> > > Now after seething this, I'm truly wondering whether we can still simply
> > > use the file semantics we already have (for either shmem/hugetlb/...), or
> > > is it a must we need to use a single fd to represent all?
> > >
> > > Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> > > itself to do host zero-copy mapping? the page_array can be e.g. a list of
> > That (tuple) is essentially what we are doing (with udmabuf) but in a
> > standardized way that follows convention using the dmabuf buffer sharing
> > framework that all the importers (other drivers and userspace
> components)
> > know and understand.
> >
> > > file offsets that points to the pages (rather than pinning the pages using
> > If we are using the dmabuf framework, the pages must be pinned when the
> > importers map them.
>
> Oh so the pages are for DMAs from hardwares, rather than accessed by the
> host programs?
GPU DMA is the main use-case but the fd (i.e, pages) can be consumed in
different ways. For local display support, the fd can be imported by the Host
GPU driver for DMA (if Qemu is launched with gl=on) or mmap'd by Qemu
UI module (if gl=off). For remote display support, Qemu shares the fd with
Spice which can either encode it using CPU based algorithms or GPU based
ones (H264/H265/VP8/VP9) using Gstreamer.
>
> I really have merely zero knowledge from that aspect, sorry. If so I don't
> know how truncation can work with that, while keeping the page coherent.
>
> Hugh asked why not QEMU just doesn't do that truncation, I'll then ask the
It is not just about truncation. My goal with this patch is to ensure that when
one or more (guest buffer) pages in the memfd are affected in any way (moved,
migrated, etc), udmabuf would take corrective action after getting notified.
And, given that the guest buffer pages can be scattered anywhere in the rather
large memfd, it seems likely that one or more pages might be impacted when
various other features (such as virto-mem/balloon, memory unplug) are enabled.
> same. Probably virtio-mem will not be able to work. I think postcopy will
> not be affected - postcopy only drops pages at very early stage of dest
> QEMU, not after VM started there, so either not affected or maybe there's
> chance it'll work.
>
> IIUC it's then the same as VFIO attached then we try to blow some pages
> away from anything like virtio-balloon - AFAIR qemu just explicitly don't
> allow that to happen. See vfio_ram_block_discard_disable().
>
> >
> > > FOLL_GET). The good thing is then the fd can be the guest memory file
> > > itself. With that, we can mmap() over the shmem/hugetlb in whatever
> vma
> > > and whatever process. Truncation (and actually everything... e.g. page
> > > migration, swapping, ... which will be disabled if we use PFNMAP pins)
> will
> > > just all start to work, afaiu.
> > IIUC, we'd not be able to use the fd of the guest memory file because the
> > dmabuf fds are expected to have constant size that reflects the size of the
> > buffer that is being shared. I just don't think it'd be feasible given all the
> > other restrictions:
> > https://www.kernel.org/doc/html/latest/driver-api/dma-
> buf.html?highlight=dma_buf#userspace-interface-notes
>
> Yeah I also don't know well on the dmabuf APIs, but I think if the page
> must be pinned for real world DMA then it's already another story to me..
Right, the pages need to be kept pinned as long as an importer is using
(mapping exists) them.
> what I said on the [guest_mem_fd, offset_array] tuple idea could only (if
> still possible..) work if the udmabuf access is only from the processor
> side, never from the device.
As of now, GPU is the device that would access the pages directly in addition
to the CPU but there are already patches (on qemu-devel) to facilitate DMA
access from other devices on the Host.
Thanks,
Vivek
>
> Thanks,
>
> --
> Peter Xu
>
More information about the dri-devel
mailing list