[RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

Kasireddy, Vivek vivek.kasireddy at intel.com
Sat Jul 29 00:08:25 UTC 2023


Hi Peter,

> > > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > components
> > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > Guest do not access the buffer simultaneously.
> > > > >
> > > > > So why do you need to track updates proactively like this?
> > > > As David noted in the earlier series, if Qemu punches a hole in its
> memfd
> > > > that goes through pages that are registered against a udmabuf fd, then
> > > > udmabuf needs to update its list with new pages when the hole gets
> > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > problem (between udmabuf and memfd) as demonstrated in the
> selftest
> > > > (patch #3 in this series).
> > >
> > > Wouldn't this all be very much better if Qemu stopped punching holes
> there?
> > I think holes can be punched anywhere in the memfd for various reasons.
> Some
> 
> I just start to read this thread, even haven't finished all of them.. but
> so far I'm not sure whether this is right at all..
> 
> udmabuf is a file, it means it should follow the file semantics. Mmu
Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK,
operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
on it. And, in our use-case, since udmabuf driver is sharing (or exporting) its
buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
to only read from it.

> notifier is per-mm, otoh.
> 
> Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
> created with vma1, so udmabuf registers the mm changes over vma1 only.
Udmabufs are created with pages obtained from the mapping using offsets
provided by Qemu. 

> 
> However the shmem/hugetlb page cache can be populated in either vma1, or
> vma2.  It means when populating on vma2 udmabuf won't get update notify
> at
> all, udmabuf pages can still be obsolete.  Same thing to when multi-process
In this (unlikely) scenario you described above, I think we could still find all the
VMAs (and ranges) where the guest buffer pages are mapped (and register
for PTE updates) using Qemu's mm_struct. The below code can be modified
to create a list of VMAs where the guest buffer pages are mapped.
static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
                                                 struct mm_struct *vmm_mm)
{
        struct vm_area_struct *vma = NULL;
        MA_STATE(mas, &vmm_mm->mm_mt, 0, 0);
        unsigned long addr;
        pgoff_t pg;

        mas_set(&mas, 0);
        mmap_read_lock(vmm_mm);
        mas_for_each(&mas, vma, ULONG_MAX) {
                for (pg = 0; pg < ubuf->pagecount; pg++) {
                        addr = page_address_in_vma(ubuf->pages[pg], vma);
                        if (addr == -EFAULT)
                                break;
                }
                if (addr != -EFAULT)
                        break;
        }
        mmap_read_unlock(vmm_mm);

        return vma;
}

> QEMU is used, where we can have vma1 in QEMU while vma2 in the other
> process like vhost-user.
> 
> I think the trick here is we tried to "hide" the fact that these are
> actually normal file pages, but we're doing PFNMAP on them... then we want
> the file features back, like hole punching..
> 
> If we used normal file operations, everything will just work fine; TRUNCATE
> will unmap the host mapped frame buffers when needed, and when
> accessed
> it'll fault on demand from the page cache.  We seem to be trying to
> reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> this at least..
If we can figure out the VMA ranges where the guest buffer pages are mapped,
we should be able to register mmu notifiers for those ranges right?

> 
> > of the use-cases where this would be done were identified by David. Here
> is what
> > he said in an earlier discussion:
> > "There are *probably* more issues on the QEMU side when udmabuf is
> paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> 
> Now after seething this, I'm truly wondering whether we can still simply
> use the file semantics we already have (for either shmem/hugetlb/...), or
> is it a must we need to use a single fd to represent all?
> 
> Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> itself to do host zero-copy mapping?  the page_array can be e.g. a list of
That (tuple) is essentially what we are doing (with udmabuf) but in a
standardized way that follows convention using the dmabuf buffer sharing
framework that all the importers (other drivers and userspace components)
know and understand.

> file offsets that points to the pages (rather than pinning the pages using
If we are using the dmabuf framework, the pages must be pinned when the
importers map them.

> FOLL_GET).  The good thing is then the fd can be the guest memory file
> itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
> and whatever process.  Truncation (and actually everything... e.g. page
> migration, swapping, ... which will be disabled if we use PFNMAP pins) will
> just all start to work, afaiu.
IIUC, we'd not be able to use the fd of the guest memory file because the
dmabuf fds are expected to have constant size that reflects the size of the
buffer that is being shared. I just don't think it'd be feasible given all the
other restrictions:
https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes

Thanks,
Vivek


> 
> Thanks,
> 
> --
> Peter Xu



More information about the dri-devel mailing list