Making drm_gpuvm work across gpu devices

Fri Mar 1 07:01:15 UTC 2024

Hi Thomas,

Am 29.02.24 um 18:12 schrieb Thomas Hellström:
> Hi, Christian.
>
> On Thu, 2024-02-29 at 10:41 +0100, Christian König wrote:
>> Am 28.02.24 um 20:51 schrieb Zeng, Oak:
>>> The mail wasn’t indent/preface correctly. Manually format it.
>>>
>>> *From:*Christian König <christian.koenig at amd.com>
>>> *Sent:* Tuesday, February 27, 2024 1:54 AM
>>> *To:* Zeng, Oak <oak.zeng at intel.com>; Danilo Krummrich
>>> <dakr at redhat.com>; Dave Airlie <airlied at redhat.com>; Daniel Vetter
>>> <daniel at ffwll.ch>; Felix Kuehling <felix.kuehling at amd.com>;
>>> jglisse at redhat.com
>>> *Cc:* Welty, Brian <brian.welty at intel.com>;
>>> dri-devel at lists.freedesktop.org; intel-xe at lists.freedesktop.org;
>>> Bommu, Krishnaiah <krishnaiah.bommu at intel.com>; Ghimiray, Himal
>>> Prasad
>>> <himal.prasad.ghimiray at intel.com>;
>>> Thomas.Hellstrom at linux.intel.com;
>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura at intel.com>;
>>> Brost, Matthew <matthew.brost at intel.com>; Gupta, saurabhg
>>> <saurabhg.gupta at intel.com>
>>> *Subject:* Re: Making drm_gpuvm work across gpu devices
>>>
>>> Hi Oak,
>>>
>>> Am 23.02.24 um 21:12 schrieb Zeng, Oak:
>>>
>>>      Hi Christian,
>>>
>>>      I go back this old email to ask a question.
>>>
>>>
>>> sorry totally missed that one.
>>>
>>>      Quote from your email:
>>>
>>>      “Those ranges can then be used to implement the SVM feature
>>>      required for higher level APIs and not something you need at
>>> the
>>>      UAPI or even inside the low level kernel memory management.”
>>>
>>>      “SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
>>>      should not have any influence on the design of the kernel
>>> UAPI.”
>>>
>>>      There are two category of SVM:
>>>
>>>      1.driver svm allocator: this is implemented in user space,
>>>   i.g.,
>>>      cudaMallocManaged (cuda) or zeMemAllocShared (L0) or
>>>      clSVMAlloc(openCL). Intel already have gem_create/vm_bind in
>>> xekmd
>>>      and our umd implemented clSVMAlloc and zeMemAllocShared on top
>>> of
>>>      gem_create/vm_bind. Range A..B of the process address space is
>>>      mapped into a range C..D of the GPU address space, exactly as
>>> you
>>>      said.
>>>
>>>      2.system svm allocator:  This doesn’t introduce extra driver
>>> API
>>>      for memory allocation. Any valid CPU virtual address can be
>>> used
>>>      directly transparently in a GPU program without any extra
>>> driver
>>>      API call. Quote from kernel Documentation/vm/hmm.hst: “Any
>>>      application memory region (private anonymous, shared memory, or
>>>      regular file backed memory) can be used by a device
>>> transparently”
>>>      and “to share the address space by duplicating the CPU page
>>> table
>>>      in the device page table so the same address points to the same
>>>      physical memory for any valid main memory address in the
>>> process
>>>      address space”. In system svm allocator, we don’t need that
>>> A..B
>>>      C..D mapping.
>>>
>>>      It looks like you were talking of 1). Were you?
>>>
>>>
>>> No, even when you fully mirror the whole address space from a
>>> process
>>> into the GPU you still need to enable this somehow with an IOCTL.
>>>
>>> And while enabling this you absolutely should specify to which part
>>> of
>>> the address space this mirroring applies and where it maps to.
>>>
>>> */[Zeng, Oak] /*
>>>
>>> Lets say we have a hardware platform where both CPU and GPU support
>>> 57bit(use it for example. The statement apply to any address range)
>>> virtual address range, how do you decide “which part of the address
>>> space this mirroring applies”? You have to mirror the whole address
>>> space [0~2^57-1], do you? As you designed it, the gigantic
>>> vm_bind/mirroring happens at the process initialization time, and
>>> at
>>> that time, you don’t know which part of the address space will be
>>> used
>>> for gpu program. Remember for system allocator, *any* valid CPU
>>> address can be used for GPU program.  If you add an offset to
>>> [0~2^57-1], you get an address out of 57bit address range. Is this
>>> a
>>> valid concern?
>>>
>> Well you can perfectly mirror on demand. You just need something
>> similar
>> to userfaultfd() for the GPU. This way you don't need to mirror the
>> full
>> address space, but can rather work with large chunks created on
>> demand,
>> let's say 1GiB or something like that.
>
> What we're looking at as the current design is an augmented userptr
> (A..B -> C..D mapping) which is internally sparsely populated in
> chunks. KMD manages the population using gpu pagefaults. We acknowledge
> that some parts of this mirror will not have a valid CPU mapping. That
> is, no vma so a gpu page-fault that resolves to such a mirror address
> will cause an error. Would you have any concerns / objections against
> such an approach?

Nope, as far as I can see that sounds like a perfectly valid design to me.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
>