Enabling peer to peer device transactions for PCIe devices

Serguei Sagalovitch serguei.sagalovitch at amd.com
Fri Jan 6 16:56:30 UTC 2017


On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
>> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
>>
>>>> I still don't understand what you driving at - you've said in both
>>>> cases a user VMA exists.
>>> In the former case no, there is no VMA directly but if you want one than
>>> a device can provide one. But such VMA is useless as CPU access is not
>>> expected.
>> I disagree it is useless, the VMA is going to be necessary to support
>> upcoming things like CAPI, you need it to support O_DIRECT from the
>> filesystem, DPDK, etc. This is why I am opposed to any model that is
>> not VMA based for setting up RDMA - that is shorted sighted and does
>> not seem to reflect where the industry is going.
>>
>> So focus on having VMA backed by actual physical memory that covers
>> your GPU objects and ask how do we wire up the '__user *' to the DMA
>> API in the best way so the DMA API still has enough information to
>> setup IOMMUs and whatnot.
> I am talking about 2 different thing. Existing hardware and API where you
> _do not_ have a vma and you do not need one. This is just existing stuff.
I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel
(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.
> Some close driver provide a functionality on top of this design. Question
> is do we want to do the same ? If yes and you insist on having a vma we
> could provide one but this is does not apply and is useless for where we
> are going with new hardware.
>
> With new hardware you just use malloc or mmap to allocate memory and then
> you use it directly with the device. Device driver can migrate any part of
> the process address space to device memory. In this scheme you have your
> usual VMAs but there is nothing special about them.
Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.
> Now when you try to do get_user_page() on any page that is inside the
> device it will fails because we do not allow any device memory to be pin.
> There is various reasons for that and they are not going away in any hw
> in the planing (so for next few years).
>
> Still we do want to support peer to peer mapping. Plan is to only do so
> with ODP capable hardware. Still we need to solve the IOMMU issue and
> it needs special handling inside the RDMA device. The way it works is
> that RDMA ask for a GPU page, GPU check if it has place inside its PCI
> bar to map this page for the device, this can fail. If it succeed then
> you need the IOMMU to let the RDMA device access the GPU PCI bar.
>
> So here we have 2 orthogonal problem. First one is how to make 2 drivers
> talks to each other to setup mapping to allow peer to peer But I would assume  and second is
> about IOMMU.
>
I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.
Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?





More information about the dri-devel mailing list