[RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

Jerome Glisse jglisse at redhat.com
Tue Jan 29 19:44:19 UTC 2019


On Tue, Jan 29, 2019 at 12:24:04PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jglisse at redhat.com wrote:
> >>
> >>> +	/*
> >>> +	 * Optional for device driver that want to allow peer to peer (p2p)
> >>> +	 * mapping of their vma (which can be back by some device memory) to
> >>> +	 * another device.
> >>> +	 *
> >>> +	 * Note that the exporting device driver might not have map anything
> >>> +	 * inside the vma for the CPU but might still want to allow a peer
> >>> +	 * device to access the range of memory corresponding to a range in
> >>> +	 * that vma.
> >>> +	 *
> >>> +	 * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> >>> +	 * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> >>> +	 * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> >>> +	 * device to map once during setup and report any failure at that time
> >>> +	 * to the userspace. Further mapping of the same range might happen
> >>> +	 * after mmu notifier invalidation over the range. The exporting device
> >>> +	 * can use this to move things around (defrag BAR space for instance)
> >>> +	 * or do other similar task.
> >>> +	 *
> >>> +	 * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> >>> +	 * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> >>> +	 * POINT IN TIME WITH NO LOCK HELD.
> >>> +	 *
> >>> +	 * In below function, the device argument is the importing device,
> >>> +	 * the exporting device is the device to which the vma belongs.
> >>> +	 */
> >>> +	long (*p2p_map)(struct vm_area_struct *vma,
> >>> +			struct device *device,
> >>> +			unsigned long start,
> >>> +			unsigned long end,
> >>> +			dma_addr_t *pa,
> >>> +			bool write);
> >>> +	long (*p2p_unmap)(struct vm_area_struct *vma,
> >>> +			  struct device *device,
> >>> +			  unsigned long start,
> >>> +			  unsigned long end,
> >>> +			  dma_addr_t *pa);
> >>
> >> I don't understand why we need new p2p_[un]map function pointers for
> >> this. In subsequent patches, they never appear to be set anywhere and
> >> are only called by the HMM code. I'd have expected it to be called by
> >> some core VMA code and set by HMM as that's what vm_operations_struct is
> >> for.
> >>
> >> But the code as all very confusing, hard to follow and seems to be
> >> missing significant chunks. So I'm not really sure what is going on.
> > 
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> > 
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
> 
> I'd suggest [1] should be a part of the patchset so we can actually see
> a user of the stuff you're adding.

I did not wanted to clutter patchset with device driver specific usage
of this. As the API can be reason about in abstract way.

> 
> But it still doesn't explain everything as without the HMM code nothing
> calls the new vm_ops. And there's still no callers for the p2p_test
> functions you added. And I still don't understand why we need the new
> vm_ops or who calls them and when. Why can't drivers use the existing
> 'fault' vm_op and call a new helper function to map p2p when appropriate
> or a different helper function to map a large range in its mmap
> operation? Just like regular mmap code...

HMM code is just one user, if you have a driver that use HMM mirror
then your driver get support for this for free. If you do not want to
use HMM then you can directly call this in your driver.

The flow is, device driver want to setup some mapping for a range of
virtual address [va_start, va_end]:
    1 - Lookup vma for the range
    2 - If vma is regular vma (not an mmap of a file) then use GUP
        If vma is a mmap of a file and they are p2p_map call back
        then call p2p_map()
    3 - Use either the result of GUP or p2p_map() to program the
        device

The p2p test function is use by device driver implementing the call
back for instance see:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a567696eafb1d4faf7054ab0d7c3a16a5ef06

The vm_fault callback is not suited because here we are mapping to
another device so this will need special handling, someone must have
both device struct pointer in end and someone must be allow to make
decission on what to allow and what not to allow.

Moreover exporting driver like GPU driver might have complex policy
in place for which they will only allow export of some memory to a
peer device but not other.

In the end it means that it easier and simpler to add new call back
specificaly for that, so the intention is clear to both the caller
and the callee. The exporting device can then do the proper check
using the core helper (ie checking that the device can actually peer
to each other) and if that works it can then decide wether or not
it wants to allow this other device to access its memory or if it
prefer to use main memory for this.

To add this to the fault callback we would need to define a bunch
of new flags, setup fake page table so that we can populate pte and
then have the importing device re-interpret everything specialy
because it comes from another device. It would look ugly and it
would need to modify bunch of core mm code.

Note that this callback solution also allow an exporting device to
allow peer access while CPU can not access the memory ie the pte
entry for the range are pte_none.

Cheers,
Jérôme


More information about the dri-devel mailing list