Making drm_gpuvm work across gpu devices

Christian König christian.koenig at amd.com
Fri Mar 8 10:07:22 UTC 2024


Hi Oak,

sorry the mail sounded like you didn't expected a reply.

And yes, the approaches outlined in the mail sounds really good to me.

Regards,
Christian.

Am 08.03.24 um 05:43 schrieb Zeng, Oak:
> Hello all,
>
> Since I didn't get a reply for this one, I assume below are agreed. But feel free to let us know if you don't agree.
>
> Thanks,
> Oak
>
> -----Original Message-----
> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of Zeng, Oak
> Sent: Thursday, February 29, 2024 1:23 PM
> To: Christian König <christian.koenig at amd.com>; Daniel Vetter <daniel at ffwll.ch>; David Airlie <airlied at redhat.com>
> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>; Brost, Matthew <matthew.brost at intel.com>; Felix Kuehling <felix.kuehling at amd.com>; Welty, Brian <brian.welty at intel.com>; dri-devel at lists.freedesktop.org; Ghimiray, Himal Prasad <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah <krishnaiah.bommu at intel.com>; Gupta, saurabhg <saurabhg.gupta at intel.com>; Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>; intel-xe at lists.freedesktop.org; Danilo Krummrich <dakr at redhat.com>; Shah, Ankur N <ankur.n.shah at intel.com>; jglisse at redhat.com; rcampbell at nvidia.com; apopple at nvidia.com
> Subject: RE: Making drm_gpuvm work across gpu devices
>
> Hi Christian/Daniel/Dave/Felix/Thomas, and all,
>
> We have been refining our design internally in the past month. Below is our plan. Please let us know if you have any concern.
>
> 1) Remove pseudo /dev/xe-svm device. All system allocator interfaces will be through /dev/dri/render devices. Not global interface.
>
> 2) Unify userptr and system allocator codes. We will treat userptr as a speciality of system allocator without migration capability. We will introduce the hmmptr concept for system allocator. We will extend vm_bind API to map a range A..B of process address space to a range C..D of GPU address space for hmmptr. For hmmptr, if gpu program accesses an address which is not backed by core mm vma, it is a fatal error.
>
> 3) Multiple device support. We have identified p2p use-cases where we might want to leave memory on a foreign device or direct migrations to a foreign device and therefore might need a global structure that tracks or caches the migration state per process address space. We didn't completely settle down this design. We will come back when we have more details.
>
> 4)We will first work on this code on xekmd then look to move some common codes to drm layer so it can also be used by other vendors.
>
> Thomas and me still have open questions to Christian. We will follow up.
>
> Thanks all for this discussion.
>
> Regards,
> Oak
>
>> -----Original Message-----
>> From: Christian König <christian.koenig at amd.com>
>> Sent: Thursday, February 1, 2024 3:52 AM
>> To: Zeng, Oak <oak.zeng at intel.com>; Daniel Vetter <daniel at ffwll.ch>; David
>> Airlie <airlied at redhat.com>
>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>; Brost, Matthew
>> <matthew.brost at intel.com>; Felix Kuehling <felix.kuehling at amd.com>; Welty,
>> Brian <brian.welty at intel.com>; dri-devel at lists.freedesktop.org; Ghimiray, Himal
>> Prasad <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
>> <krishnaiah.bommu at intel.com>; Gupta, saurabhg <saurabhg.gupta at intel.com>;
>> Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>; intel-
>> xe at lists.freedesktop.org; Danilo Krummrich <dakr at redhat.com>; Shah, Ankur N
>> <ankur.n.shah at intel.com>; jglisse at redhat.com; rcampbell at nvidia.com;
>> apopple at nvidia.com
>> Subject: Re: Making drm_gpuvm work across gpu devices
>>
>> Hi Oak,
>>
>> Am 31.01.24 um 21:17 schrieb Zeng, Oak:
>>> Hi Sima, Dave,
>>>
>>> I am well aware nouveau driver is not what Nvidia do with their customer. The
>> key argument is, can we move forward with the concept shared virtual address
>> space b/t CPU and GPU? This is the foundation of HMM. We already have split
>> address space support with other driver API. SVM, from its name, it means
>> shared address space. Are we allowed to implement another driver model to
>> allow SVM work, along with other APIs supporting split address space? Those two
>> scheme can co-exist in harmony. We actually have real use cases to use both
>> models in one application.
>>> Hi Christian, Thomas,
>>>
>>> In your scheme, GPU VA can != GPU VA. This does introduce some flexibility.
>> But this scheme alone doesn't solve the problem of the proxy process/para-
>> virtualization. You will still need a second mechanism to partition GPU VA space
>> b/t guest process1 and guest process2 because proxy process (or the host
>> hypervisor whatever you call it) use one single gpu page table for all the
>> guest/client processes. GPU VA for different guest process can't overlap. If this
>> second mechanism exist, we of course can use the same mechanism to partition
>> CPU VA space between guest processes as well, then we can still use shared VA
>> b/t CPU and GPU inside one process, but process1 and process2's address space
>> (for both cpu and gpu) doesn't overlap. This second mechanism is the key to
>> solve the proxy process problem, not the flexibility you introduced.
>>
>> That approach was suggested before, but it doesn't work. First of all
>> you create a massive security hole when you give the GPU full access to
>> the QEMU CPU process which runs the virtualization.
>>
>> So even if you say CPU VA == GPU VA you still need some kind of
>> flexibility otherwise you can't implement this use case securely.
>>
>> Additional to this the CPU VAs are usually controlled by the OS and not
>> some driver, so to make sure that host and guest VAs don't overlap you
>> would need to add some kind of sync between the guest and host OS kernels.
>>
>>> In practice, your scheme also have a risk of running out of process space
>> because you have to partition whole address space b/t processes. Apparently
>> allowing each guest process to own the whole process space and using separate
>> GPU/CPU page table for different processes is a better solution than using single
>> page table and partition process space b/t processes.
>>
>> Yeah that you run out of address space is certainly possible. But as I
>> said CPUs are switching to 5 level of pages tables and if you look at
>> for example a "cat maps | cut -c-4 | sort -u" of process you will find
>> that only a handful of 4GiB segments are actually used and thanks to
>> recoverable page faults you can map those between host and client on
>> demand. This gives you at least enough address space to handle a couple
>> of thousand clients.
>>
>>> For Intel GPU, para-virtualization (xenGT, see https://github.com/intel/XenGT-
>> Preview-kernel. It is similar idea of the proxy process in Flex's email. They are all
>> SW-based GPU virtualization technology) is an old project. It is now replaced with
>> HW accelerated SRIOV/system virtualization. XenGT is abandoned long time ago.
>> So agreed your scheme add some flexibility. The question is, do we have a valid
>> use case to use such flexibility? I don't see a single one ATM.
>>
>> Yeah, we have SRIOV functionality on AMD hw as well, but for some use
>> cases it's just to inflexible.
>>
>>> I also pictured into how to implement your scheme. You basically rejected the
>> very foundation of hmm design which is shared address space b/t CPU and GPU.
>> In your scheme, GPU VA = CPU VA + offset. In every single place where driver
>> need to call hmm facilities such as hmm_range_fault, migrate_vma_setup and in
>> mmu notifier call back, you need to offset the GPU VA to get a CPU VA. From
>> application writer's perspective, whenever he want to use a CPU pointer in his
>> GPU program, he add to add that offset. Do you think this is awkward?
>>
>> What? This flexibility is there to prevent the application writer to
>> change any offset.
>>
>>> Finally, to implement SVM, we need to implement some memory hint API
>> which applies to a virtual address range across all GPU devices. For example, user
>> would say, for this virtual address range, I prefer the backing store memory to be
>> on GPU deviceX (because user knows deviceX would use this address range
>> much more than other GPU devices or CPU). It doesn't make sense to me to
>> make such API per device based. For example, if you tell device A that the
>> preferred memory location is device B memory, this doesn't sounds correct to
>> me because in your scheme, device A is not even aware of the existence of
>> device B. right?
>>
>> Correct and while the additional flexibility is somewhat option I
>> strongly think that not having a centralized approach for device driver
>> settings is mandatory.
>>
>> Going away from the well defined file descriptor based handling of
>> device driver interfaces was one of the worst ideas I've ever seen in
>> roughly thirty years of working with Unixiode operating systems. It
>> basically broke everything, from reverse lockup handling for mmap() to
>> file system privileges for hardware access.
>>
>> As far as I can see anything which goes into the direction of opening
>> /dev/kfd or /dev/xe_svm or something similar and saying that this then
>> results into implicit SVM for your render nodes is an absolutely no-go
>> and would required and explicit acknowledgement from Linus on the design
>> to do something like that.
>>
>> What you can do is to have an IOCTL for the render node file descriptor
>> which says this device should do SVM with the current CPU address space
>> and another IOCTL which says range A..B is preferred to migrate to this
>> device for HMM when the device runs into a page fault.
>>
>> And yes that obviously means shitty performance for device drivers
>> because page play ping/pong if userspace gives contradicting information
>> for migrations, but that is something supposed to be.
>>
>> Everything else which works over the boarders of a device driver scope
>> should be implemented as system call with the relevant review process
>> around it.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Oak
>>>> -----Original Message-----
>>>> From: Daniel Vetter <daniel at ffwll.ch>
>>>> Sent: Wednesday, January 31, 2024 4:15 AM
>>>> To: David Airlie <airlied at redhat.com>
>>>> Cc: Zeng, Oak <oak.zeng at intel.com>; Christian König
>>>> <christian.koenig at amd.com>; Thomas Hellström
>>>> <thomas.hellstrom at linux.intel.com>; Daniel Vetter <daniel at ffwll.ch>; Brost,
>>>> Matthew <matthew.brost at intel.com>; Felix Kuehling
>>>> <felix.kuehling at amd.com>; Welty, Brian <brian.welty at intel.com>; dri-
>>>> devel at lists.freedesktop.org; Ghimiray, Himal Prasad
>>>> <himal.prasad.ghimiray at intel.com>; Bommu, Krishnaiah
>>>> <krishnaiah.bommu at intel.com>; Gupta, saurabhg
>> <saurabhg.gupta at intel.com>;
>>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>; intel-
>>>> xe at lists.freedesktop.org; Danilo Krummrich <dakr at redhat.com>; Shah,
>> Ankur N
>>>> <ankur.n.shah at intel.com>; jglisse at redhat.com; rcampbell at nvidia.com;
>>>> apopple at nvidia.com
>>>> Subject: Re: Making drm_gpuvm work across gpu devices
>>>>
>>>> On Wed, Jan 31, 2024 at 09:12:39AM +1000, David Airlie wrote:
>>>>> On Wed, Jan 31, 2024 at 8:29 AM Zeng, Oak <oak.zeng at intel.com> wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nvidia Nouveau driver uses exactly the same concept of SVM with HMM,
>>>> GPU address in the same process is exactly the same with CPU virtual address.
>> It
>>>> is already in upstream Linux kernel. We Intel just follow the same direction for
>>>> our customers. Why we are not allowed?
>>>>> Oak, this isn't how upstream works, you don't get to appeal to
>>>>> customer or internal design. nouveau isn't "NVIDIA"'s and it certainly
>>>>> isn't something NVIDIA would ever suggest for their customers. We also
>>>>> likely wouldn't just accept NVIDIA's current solution upstream without
>>>>> some serious discussions. The implementation in nouveau was more of a
>>>>> sample HMM use case rather than a serious implementation. I suspect if
>>>>> we do get down the road of making nouveau an actual compute driver for
>>>>> SVM etc then it would have to severely change.
>>>> Yeah on the nouveau hmm code specifically my gut feeling impression is
>>>> that we didn't really make friends with that among core kernel
>>>> maintainers. It's a bit too much just a tech demo to be able to merge the
>>>> hmm core apis for nvidia's out-of-tree driver.
>>>>
>>>> Also, a few years of learning and experience gaining happened meanwhile -
>>>> you always have to look at an api design in the context of when it was
>>>> designed, and that context changes all the time.
>>>>
>>>> Cheers, Sima
>>>> --
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch



More information about the dri-devel mailing list