Enabling peer to peer device transactions for PCIe devices

Thu Nov 24 00:40:37 UTC 2016

On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:

> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.

The main issue is to  how to solve use cases when p2p is 
requested/initiated via CPU pointers where such pointers could 
point to non-system memory location e.g.  VRAM.  

It will allow to provide consistent working model for user to deal only
with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide 
performance optimization avoiding double-buffering and extra special code 
when dealing with PCIe device memory.

 Examples are:

 - RDMA Network operations.  RDMA MRs where registered memory 
could be e.g. VRAM.  Currently it is solved using so called PeerDirect  
interface which  is currently out-of-tree and  provided as part of OFED.
- File operations (fread/fwrite) when user wants to transfer file data directly 
to/from e.g. VRAM

Challenges are:
- Because graphics sub-system must support overcomit (at least each 
application/process should independently see all resources) ideally 
such memory should be movable without changing CPU pointer value
as well as "paged-out" supporting "page fault" at least on access from 
CPU.
 - We must co-exist with existing DRM infrastructure, as well as 
support sharing VRAM memory between different processes
- We should be able to deal with large allocations: tens, hundreds of 
MBs or may be GBs.
- We may have PCIe devices where p2p may not work
- Potentially any GPU memory should be supported including 
memory carved out from system RAM (e.g. allocated via
get_free_pages()).

Note:
-  In the case of RDMA MRs life-span of "pinning" 
(get_user_pages"/put_page) may be defined/controlled by 
application not kernel which  may be should 
treated differently as special case. 

Original proposal was to create "struct pages" for VRAM memory 
to allow "get_user_pages"  to work transparently similar 
how it is/was done for "DAX Device" case. Unfortunately 
based on my understanding "DAX Device" implementation 
deal only with permanently  "locked" memory  (fixed location) 
unrelated to "get_user_pages"/"put_page" scope  
which doesn't satisfy requirements  for "eviction" / "moving" of 
memory keeping CPU address intact.  

> The desire is for DMA to continue to work
> even after these migrations happen
At least some kind of mm notifier callback to inform about changing 
in location (pre- and post-) similar how it is done for system pages. 
My understanding is that It will not solve RDMA MR issue where "lock" 
could be during the whole  application life but  (a) it will not make 
RDMA MR case worse  (b) should be enough for all other cases for 
"get_user_pages"/"put_page" controlled by  kernel.